9 Ways to Master Advanced SQL Analytics
Why Moving Data to Python Makes You Slower and What to Do Instead
As a data analyst, you are working with data that grows exponentially, and analytics demands are much larger than before.
Modern SQL engines now support features to run advanced analytics without moving data to external systems. You can write queries to rank rows, calculate running totals, parse nested JSON, parse hierarchical structures, and run a statistical analysis in SQL.
Here is how to do that:
Window Functions:
Ranking Functions:
RANK(), DENSE_RANK(), ROW_NUMBER() assign sequential numbers to rows within a partition.
ROW_NUMBER() assigns a unique integer to every row.
RANK() assigns the same rank to tied values and skips subsequent ranks. If two rows tie at rank 1, the next row receives rank 3.
DENSE_RANK() assigns the same rank to ties but does not skip numbers. The next row after a tie at rank 1 receives rank 2.
NTILE() divides rows into a specified number of roughly equal groups and assigns a group number to each row.
-- Example:
SELECT
region,
salesperson,
sales,
RANK() OVER (PARTITION BY region ORDER BY sales DESC) AS rnk,
FROM
quarterly_sales;Offset Functions: LAG and LEAD:
LAG() and LEAD() access values from preceding and following rows within the same partition.
LAG() pulls a value from a previous row,
LEAD() pulls from a subsequent row.
Both accept an optional offset argument to specify how many rows to skip. FIRST_VALUE() and LAST_VALUE() return the first or last value in the window frame.
You can pair FIRST_VALUE() with ORDER BY to find the earliest event in a user session or use LAST_VALUE() to find the most recent status change in a workflow.
-- Example:
SELECT
date,
close_price,
LAG(close_price) OVER (ORDER BY date) AS prev_close,
close_price - LAG(close_price) OVER (ORDER BY date) AS change
FROM
stock_prices;Use LAG and LEAD for cohort retention, funnel step drop-off, and anomaly detection on sequential data.
Aggregate Window Functions:
SUM(), AVG(), COUNT(), MIN(), MAX() standard aggregates work as window functions when you pair them with the OVER() clause.
The result set keeps every row, which differs from GROUP BY aggregation.
SELECT
department,
employee,
salary,
AVG(salary) OVER (PARTITION BY department) AS dept_avg,
FROM
employees;Common Table Expressions:
A Common Table Expression (CTE) is a temporary result set that lives inside a single SQL statement.
Define it with the WITH keyword and refer to it in the main query that follows that part of the code. A CTE disappears when the statement finishes and does not create a permanent object in the database.
The main query can reference CTEs multiple times, unlike subqueries. Use CTEs instead of nested subqueries.
There are two forms of CTEs: recursive and non-recursive.
Non-recursive CTEs:
You should use non-recursive CTEs as substitutes for subqueries and derived tables.
CTEs read much easier than nested subqueries.
--example
WITH regional_sales AS (
SELECT region, SUM(amount) AS total_sales
FROM orders
GROUP BY region
),
top_regions AS (
SELECT region
FROM regional_sales
WHERE total_sales > 1000000
)
SELECT
o.*
FROM
orders o
JOIN
top_regions t
ON o.region = t.region;Recursive CTEs:
Recursive CTEs reference themselves in hierarchical and graph-structured data.
A recursive CTE consists of two parts:
Base query.
Recursive part that references itself inside the CTE.
You can use recursive CTEs to search through the organizational charts, trees, file systems, paths, or network graphs.
-- Example:
WITH RECURSIVE org_chart AS (
SELECT
id, name, manager_id, 1 AS level
FROM
employees
WHERE
manager_id IS NULL
UNION ALL
SELECT
e.id, e.name, e.manager_id, o.level + 1
FROM
employees e
JOIN org_chart o
ON e.manager_id = o.id
)
SELECT
*
FROM
org_chart;Recursive CTEs find all nodes from a starting point in a network. The same pattern applies everywhere: supply chain dependency resolution, social network distance queries, and routes.
Pattern matching:
It works the same way as Regular Expressions, but for table rows instead of characters in a string.
Use it to find sequences of rows that match a pattern you define. You can use it in time-series anomaly detection, user behavior analysis, event correlation, or funnels.
For example, track users who did LOGIN -> SEARCH -> ADD_TO_CART -> PURCHASE
SELECT
*
FROM
sales
MATCH_RECOGNIZE (
PARTITION BY user_id
ORDER BY event_time
MEASURES
A.event_time AS start_time,
LAST(B.event_time) AS end_time,
COUNT(B.*) AS steps
ONE ROW PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (A B+ C)
DEFINE
A AS event_type = ‘login’,
B AS event_type = ‘page_view’,
C AS event_type = ‘purchase’
)This functionality is currently supported only by Oracle and Snowflake.
Advanced aggregations:
Conditional aggregation:
Use a CASE statement inside aggregate functions to calculate multiple metrics in a single pass.
SELECT
region,
COUNT(*) AS total_orders,
COUNT(CASE WHEN status = ‘returned’ THEN 1 END) AS returns,
SUM(CASE WHEN status = ‘completed’ THEN amount ELSE 0 END)
AS completed_revenue
FROM
orders
GROUP BY
region;Multidimensional grouping:
You can use GROUPING SETS, ROLLUP, and CUBE to execute multiple GROUP BY clauses in a single pass of the data.
These methods are more efficient than several independent queries with UNION ALL joins.
ROLLUP: This function calculates subtotals at increasing levels of aggregation.
-- Example
SELECT
year,
month,
SUM(revenue) AS total_revenue,
GROUPING(year) AS year_is_total,
GROUPING(month) AS month_is_total
FROM
sales
GROUP BY
ROLLUP(year, month)
ORDER BY
year, month;Use ROLLUP when you want to navigate the hierarchy.
2. CUBE: This function calculates every possible combination of the specified columns. If you have three columns, the CUBE will generate eight different result sets in a single query.
SELECT
region,
year,
SUM(revenue) AS total_revenue,
GROUPING(region) AS region_total,
GROUPING(year) AS year_total
FROM
sales
GROUP BY
CUBE(region, year)
ORDER BY
region, year;3. GROUPING SETS:
GROUPING SETS combine CUBE and ROLLUP functionality together.
SELECT
region,
year,
SUM(revenue) AS total_revenue
FROM sales
GROUP BY
GROUPING SETS (
(region), -- subtotal per region
(year), -- subtotal per year
() -- grand total
)
ORDER BY
region, year;Almost every SQL dialect supports all three functions, except SQLite and MySQL, with some limitations.
Pivoting and unpivoting:
Like in Excel, you can pivot the table in SQL:
SELECT
product_id,
q1_sales,
q2_sales,
q3_sales,
q4_sales
FROM
quarterly_revenue
PIVOT (
SUM(revenue)
FOR quarter_name
IN (’Q1’ AS q1_sales, ‘Q2’ AS q2_sales,
‘Q3’ AS q3_sales, ‘Q4’ AS q4_sales));Also, SQL got unpivot functionality to do the opposite.
SELECT
product_id,
quarter_name,
revenue
FROM wide_sales_table
UNPIVOT (
revenue
FOR quarter_name IN (q1_sales, q2_sales, q3_sales, q4_sales)
);Most SQL dialects, except Postgres and MySQL, support these functions.
These functions could be computationally expensive, so, use them wisely
Statistical and Distribution Functions:
Continuous and Discrete Percentiles:
There are two ways to calculate percentiles in SQL:
PERCENTILE_CONT() returns a percentile, even if that number is not in your dataset. If the answer falls between two values, it invents the number in the middle.
PERCENTILE_DISC() finds the exact record from the database that matches the percentile.
You can combine these functions with a window partition and build baseline profiles:
SELECT
department_id,
employee_id,
salary,
PERCENTILE_CONT(0.5) WITHIN GROUP
(ORDER BY salary) OVER (PARTITION BY department_id)
AS median_salary_cont,
PERCENTILE_DISC(0.75) WITHIN GROUP
(ORDER BY salary) OVER (PARTITION BY department_id)
AS p75_salary_disc
FROM
employee_payroll;Correlation:
The CORR() function calculates the Pearson correlation coefficient between two numeric indicators for a group.
SELECT
region,
campaign_month,
CORR(marketing_spend, revenue) OVER (PARTITION BY region)
AS spend_to_revenue_correlation
FROM
regional_marketing_performance;SQL Server does not support this function.
Hypothetical Analysis:
Oracle and PostgreSQL can simulate operational shifts with hypothetical set calculations. You can give it a hypothetical new value, and the engine will calculate where that value lands if you add it into the existing dataset.
SELECT
RANK(100000) WITHIN GROUP (ORDER BY salary DESC)
AS hypothetical_rank_at_100k,
PERCENT_RANK(50000) WITHIN GROUP (ORDER BY salary DESC)
AS hypothetical_percent_rank_at_50k
FROM
employee_payrollCohort Analysis and Retention Modeling:
You can group users by the starting date and monitor their behavior in following months with cohort analysis.
Here is how t run a 3-step cohort analysis:
Define the Cohort Anchor: Find the first activity date for every user. Use the MIN() function and group by user ID.
Link Activity to Cohorts: Join the cohort anchor table back to the activity table. You will see every subsequent user action.
Calculate the Period Offset: Convert absolute calendar dates into relative time periods (Month 0, Month 1, etc.).
WITH cohort_anchor AS (
SELECT
user_id,
DATEADD(month, DATEDIFF(month, 0, MIN(activity_date)), 0)
AS cohort_month
FROM user_activity
GROUP BY user_id
),
linked_activity AS (
SELECT
a.user_id,
c.cohort_month,
DATEADD(month, DATEDIFF(month, 0, a.activity_date), 0)
AS activity_month
FROM user_activity a
JOIN cohort_anchor c ON a.user_id = c.user_id
),
period_offsets AS (
SELECT
cohort_month,
activity_month,
DATEDIFF(month, cohort_month, activity_month) AS period_offset,
COUNT(DISTINCT user_id) AS active_users
FROM linked_activity
GROUP BY cohort_month, activity_month
)
SELECT
cohort_month,
MAX(CASE WHEN period_offset = 0 THEN active_users END)
OVER(PARTITION BY cohort_month) AS cohort_size,
period_offset,
active_users,
ROUND(100.0 * active_users /
MAX(CASE WHEN period_offset = 0 THEN active_users END)
OVER(PARTITION BY cohort_month), 2) AS retention_rate
FROM
period_offsets
ORDER BY
cohort_month, period_offset;Time-series Analysis:
SQL has date functions to manipulate timestamps.
DATE_TRUNC groups timestamps into periods such as month or quarter. DATEDIFF and DATEADD compute intervals between dates or shift dates forward and backward, and Window functions compute rolling statistics.
For example, you can compare the metrics for the current period vs. the previous period.
SELECT
CAST(current_day.activity_datetime AS DATE) AS activity_date,
COUNT(DISTINCT current_day.user_id) AS active_users_today,
COUNT(DISTINCT prior_day.user_id) AS active_users_7_days_ago,
(COUNT(DISTINCT current_day.user_id) - COUNT(DISTINCT prior_day.user_id))
AS net_change_vs_last_week
FROM
user_activity current_day
LEFT JOIN
user_activity prior_day
ON prior_day.user_id = current_day.user_id
AND prior_day.activity_datetime >=
DATEADD(day, -7, current_day.activity_datetime)
AND prior_day.activity_datetime <
DATEADD(day, -6, current_day.activity_datetime)
WHERE
current_day.activity_datetime >= ‘2026-05-01’
GROUP BY
CAST(current_day.activity_datetime AS DATE)
ORDER BY
activity_date;Vector Search:
BigQuery and SQL Server 2025 store high-dimensional vectors as a native data type.
You no longer need a workflow to extract relational columns, convert text to embeddings with an external Python script, and run calculations inside a separate third-party vector database.
Now you can keep your unstructured data and relational metrics inside the same engine.
Vector Indexes and Similarity Searches:
Execute these commands to build a vector index and execute a semantic search query.
BigQuery:
Set up a distance index:
CREATE VECTOR INDEX IF NOT EXISTS
text_embedding_idx
ON
my_dataset.knowledge_base(embedding_col)
OPTIONS
(distance_type = ‘COSINE’, index_type = ‘IVF’);2. Locate the top matching documents:
SELECT
document_id,
text_content
FROM
VECTOR_SEARCH( TABLE my_dataset.knowledge_base, ‘embedding_col’,
(SELECT
embedding_col
FROM
my_dataset.query_embeddings
WHERE query_id = 45), top_k => 5);SQL Server 2025:
1) Create a vector index to speed up nearest-neighbor scanning:
CREATE VECTOR INDEX
text_vector_idx
ON
dbo.knowledge_base(embedding_col)
WITH
(DISTANCE_METRIC = ‘COSINE’);Execute this statement to pull matching content with similarity function:
SELECT TOP (5)
document_id,
text_content
FROM
dbo.knowledge_base
ORDER BY
VECTOR_DISTANCE(’COSINE’, embedding_col, @query_vector);Follow me for more SQL tips and guides.
Do you still move data to Python for analysis? Why?
Let me know in the comments 👇



This is a very practical breakdown of advanced SQL analytics. The progression from window functions through to vector search in a single piece is genuinely useful, and the point about keeping analytics in SQL rather than moving to Python prematurely is one that saves teams a lot of unnecessary complexity.
That last section on vector search is where it gets really interesting for anyone building AI systems on structured data. When embeddings live in the same engine as your relational data, the gap between what the model knows and what the data actually says becomes measurable in the same query. That matters more than most people realize. Every abstraction layer between the model and the source is a place where relational integrity can degrade silently, and most audit frameworks never see it happen.