9 Ways to Master Advanced SQL Analytics

Why Moving Data to Python Makes You Slower and What to Do Instead

May 21, 2026

Advanced SQL for Data Analysis course header image featuring white and orange text on dark blue background with abstract network of connected nodes and glowing data points, used for SQL tutorial content on database analytics, data engineering, and business intelligence

As a data analyst, you are working with data that grows exponentially, and analytics demands are much larger than before.

Modern SQL engines now support features to run advanced analytics without moving data to external systems. You can write queries to rank rows, calculate running totals, parse nested JSON, parse hierarchical structures, and run a statistical analysis in SQL.

Here is how to do that:

Ranking Functions:

RANK(), DENSE_RANK(), ROW_NUMBER() assign sequential numbers to rows within a partition.

ROW_NUMBER() assigns a unique integer to every row.
RANK() assigns the same rank to tied values and skips subsequent ranks. If two rows tie at rank 1, the next row receives rank 3.
DENSE_RANK() assigns the same rank to ties but does not skip numbers. The next row after a tie at rank 1 receives rank 2.
NTILE() divides rows into a specified number of roughly equal groups and assigns a group number to each row.

-- Example:
SELECT 
    region,
    salesperson,
    sales,
    RANK() OVER (PARTITION BY region ORDER BY sales DESC) AS rnk,
FROM 
    quarterly_sales;

Offset Functions: LAG and LEAD:

LAG() and LEAD() access values from preceding and following rows within the same partition.

LAG() pulls a value from a previous row,
LEAD() pulls from a subsequent row.

Both accept an optional offset argument to specify how many rows to skip. FIRST_VALUE() and LAST_VALUE() return the first or last value in the window frame.

You can pair FIRST_VALUE() with ORDER BY to find the earliest event in a user session or use LAST_VALUE() to find the most recent status change in a workflow.

-- Example:
SELECT
   date,
   close_price,
   LAG(close_price) OVER (ORDER BY date) AS prev_close,
   close_price - LAG(close_price) OVER (ORDER BY date) AS change
FROM 
   stock_prices;

Use LAG and LEAD for cohort retention, funnel step drop-off, and anomaly detection on sequential data.

Aggregate Window Functions:

SUM(), AVG(), COUNT(), MIN(), MAX() standard aggregates work as window functions when you pair them with the OVER() clause.

The result set keeps every row, which differs from GROUP BY aggregation.

SELECT 
    department,
    employee,
    salary,
    AVG(salary) OVER (PARTITION BY department) AS dept_avg,
FROM 
    employees;

Common Table Expressions:

A Common Table Expression (CTE) is a temporary result set that lives inside a single SQL statement.

Define it with the WITH keyword and refer to it in the main query that follows that part of the code. A CTE disappears when the statement finishes and does not create a permanent object in the database.

The main query can reference CTEs multiple times, unlike subqueries. Use CTEs instead of nested subqueries.

There are two forms of CTEs: recursive and non-recursive.

Non-recursive CTEs:

You should use non-recursive CTEs as substitutes for subqueries and derived tables.

CTEs read much easier than nested subqueries.

--example
WITH regional_sales AS (
    SELECT region, SUM(amount) AS total_sales
    FROM orders
    GROUP BY region
),
top_regions AS (
    SELECT region
    FROM regional_sales
    WHERE total_sales > 1000000
)
SELECT 
  o.*
FROM 
  orders o
JOIN 
  top_regions t 
  ON o.region = t.region;

Recursive CTEs:

Recursive CTEs reference themselves in hierarchical and graph-structured data.

A recursive CTE consists of two parts:

Base query.
Recursive part that references itself inside the CTE.

You can use recursive CTEs to search through the organizational charts, trees, file systems, paths, or network graphs.

-- Example:
WITH RECURSIVE org_chart AS (
    SELECT 
      id, name, manager_id, 1 AS level
    FROM 
      employees
    WHERE 
      manager_id IS NULL
    UNION ALL
    SELECT 
      e.id, e.name, e.manager_id, o.level + 1
    FROM 
      employees e
    JOIN org_chart o 
    ON e.manager_id = o.id
)
SELECT 
  * 
FROM 
  org_chart;

Recursive CTEs find all nodes from a starting point in a network. The same pattern applies everywhere: supply chain dependency resolution, social network distance queries, and routes.

Pattern matching:

It works the same way as Regular Expressions, but for table rows instead of characters in a string.

Use it to find sequences of rows that match a pattern you define. You can use it in time-series anomaly detection, user behavior analysis, event correlation, or funnels.

For example, track users who did LOGIN -> SEARCH -> ADD_TO_CART -> PURCHASE

SELECT 
  *
FROM 
  sales
MATCH_RECOGNIZE (
  PARTITION BY user_id
  ORDER BY event_time
  MEASURES
    A.event_time AS start_time,
    LAST(B.event_time) AS end_time,
    COUNT(B.*) AS steps
  ONE ROW PER MATCH
  AFTER MATCH SKIP PAST LAST ROW
  PATTERN (A B+ C)
  DEFINE
    A AS event_type = ‘login’,
    B AS event_type = ‘page_view’,
    C AS event_type = ‘purchase’
)

This functionality is currently supported only by Oracle and Snowflake.

Advanced aggregations:

Conditional aggregation:

Use a CASE statement inside aggregate functions to calculate multiple metrics in a single pass.

SELECT
    region,
    COUNT(*) AS total_orders,
    COUNT(CASE WHEN status = ‘returned’ THEN 1 END) AS returns,
    SUM(CASE WHEN status = ‘completed’ THEN amount ELSE 0 END) 
      AS completed_revenue
FROM 
    orders
GROUP BY 
    region;

Multidimensional grouping:

You can use GROUPING SETS, ROLLUP, and CUBE to execute multiple GROUP BY clauses in a single pass of the data.

These methods are more efficient than several independent queries with UNION ALL joins.

ROLLUP: This function calculates subtotals at increasing levels of aggregation.

-- Example
SELECT
  year,
  month,
  SUM(revenue) AS total_revenue,
  GROUPING(year) AS year_is_total, 
  GROUPING(month) AS month_is_total
FROM 
  sales
GROUP BY 
  ROLLUP(year, month)
ORDER BY 
  year, month;

Use ROLLUP when you want to navigate the hierarchy.

2. CUBE: This function calculates every possible combination of the specified columns. If you have three columns, the CUBE will generate eight different result sets in a single query.

SELECT
  region,
  year,
  SUM(revenue) AS total_revenue,
  GROUPING(region) AS region_total,
  GROUPING(year) AS year_total
FROM 
  sales
GROUP BY 
  CUBE(region, year)
ORDER BY
  region, year;

3. GROUPING SETS:

GROUPING SETS combine CUBE and ROLLUP functionality together.

SELECT
  region,
  year,
  SUM(revenue) AS total_revenue
FROM sales
GROUP BY 
  GROUPING SETS (
  (region),      -- subtotal per region
  (year),        -- subtotal per year  
  ()             -- grand total
)
ORDER BY
  region, year;

Almost every SQL dialect supports all three functions, except SQLite and MySQL, with some limitations.

Pivoting and unpivoting:

Like in Excel, you can pivot the table in SQL:

SELECT 
    product_id,
    q1_sales,
    q2_sales,
    q3_sales,
    q4_sales
FROM 
    quarterly_revenue
PIVOT (
    SUM(revenue) 
    FOR quarter_name 
    IN (’Q1’ AS q1_sales, ‘Q2’ AS q2_sales, 
        ‘Q3’ AS q3_sales, ‘Q4’ AS q4_sales));

Also, SQL got unpivot functionality to do the opposite.

SELECT 
    product_id,
    quarter_name,
    revenue
FROM wide_sales_table
UNPIVOT (
    revenue 
    FOR quarter_name IN (q1_sales, q2_sales, q3_sales, q4_sales)
);

Most SQL dialects, except Postgres and MySQL, support these functions.

These functions could be computationally expensive, so, use them wisely

Statistical and Distribution Functions:

Continuous and Discrete Percentiles:

There are two ways to calculate percentiles in SQL:

PERCENTILE_CONT() returns a percentile, even if that number is not in your dataset. If the answer falls between two values, it invents the number in the middle.
PERCENTILE_DISC() finds the exact record from the database that matches the percentile.

You can combine these functions with a window partition and build baseline profiles:

SELECT 
    department_id,
    employee_id,
    salary,
    PERCENTILE_CONT(0.5) WITHIN GROUP 
    (ORDER BY salary) OVER (PARTITION BY department_id) 
      AS median_salary_cont,
    PERCENTILE_DISC(0.75) WITHIN GROUP
     (ORDER BY salary) OVER (PARTITION BY department_id) 
      AS p75_salary_disc
FROM 
    employee_payroll;

Correlation:

The CORR() function calculates the Pearson correlation coefficient between two numeric indicators for a group.

SELECT 
    region,
    campaign_month,
    CORR(marketing_spend, revenue) OVER (PARTITION BY region) 
      AS spend_to_revenue_correlation
FROM 
    regional_marketing_performance;

SQL Server does not support this function.

Hypothetical Analysis:

Oracle and PostgreSQL can simulate operational shifts with hypothetical set calculations. You can give it a hypothetical new value, and the engine will calculate where that value lands if you add it into the existing dataset.

SELECT 
    RANK(100000) WITHIN GROUP (ORDER BY salary DESC) 
      AS hypothetical_rank_at_100k,
    PERCENT_RANK(50000) WITHIN GROUP (ORDER BY salary DESC) 
      AS hypothetical_percent_rank_at_50k
FROM 
    employee_payroll

Buy me a coffie

Cohort Analysis and Retention Modeling:

You can group users by the starting date and monitor their behavior in following months with cohort analysis.

Here is how t run a 3-step cohort analysis:

Define the Cohort Anchor: Find the first activity date for every user. Use the MIN() function and group by user ID.
Link Activity to Cohorts: Join the cohort anchor table back to the activity table. You will see every subsequent user action.
Calculate the Period Offset: Convert absolute calendar dates into relative time periods (Month 0, Month 1, etc.).

WITH cohort_anchor AS (
    SELECT 
        user_id,
        DATEADD(month, DATEDIFF(month, 0, MIN(activity_date)), 0) 
          AS cohort_month
    FROM user_activity
    GROUP BY user_id
),

linked_activity AS (
    SELECT 
        a.user_id,
        c.cohort_month,
        DATEADD(month, DATEDIFF(month, 0, a.activity_date), 0) 
          AS activity_month
    FROM user_activity a
    JOIN cohort_anchor c ON a.user_id = c.user_id
),

period_offsets AS (
    SELECT 
        cohort_month,
        activity_month,
        DATEDIFF(month, cohort_month, activity_month) AS period_offset,
        COUNT(DISTINCT user_id) AS active_users
    FROM linked_activity
    GROUP BY cohort_month, activity_month
)

SELECT 
    cohort_month,
    MAX(CASE WHEN period_offset = 0 THEN active_users END) 
      OVER(PARTITION BY cohort_month) AS cohort_size,
    period_offset,
    active_users,
    ROUND(100.0 * active_users / 
      MAX(CASE WHEN period_offset = 0 THEN active_users END) 
      OVER(PARTITION BY cohort_month), 2) AS retention_rate
FROM 
    period_offsets
ORDER BY 
    cohort_month, period_offset;

Time-series Analysis:

SQL has date functions to manipulate timestamps.

DATE_TRUNC groups timestamps into periods such as month or quarter. DATEDIFF and DATEADD compute intervals between dates or shift dates forward and backward, and Window functions compute rolling statistics.

For example, you can compare the metrics for the current period vs. the previous period.

SELECT 
    CAST(current_day.activity_datetime AS DATE) AS activity_date,
    COUNT(DISTINCT current_day.user_id) AS active_users_today,
    COUNT(DISTINCT prior_day.user_id) AS active_users_7_days_ago,
    (COUNT(DISTINCT current_day.user_id) - COUNT(DISTINCT prior_day.user_id))
       AS net_change_vs_last_week
FROM 
    user_activity current_day
LEFT JOIN 
    user_activity prior_day 
    ON prior_day.user_id = current_day.user_id
    AND prior_day.activity_datetime >= 
      DATEADD(day, -7, current_day.activity_datetime)
    AND prior_day.activity_datetime < 
      DATEADD(day, -6, current_day.activity_datetime)
WHERE 
    current_day.activity_datetime >= ‘2026-05-01’
GROUP BY 
    CAST(current_day.activity_datetime AS DATE)
ORDER BY 
    activity_date;

Vector Search:

BigQuery and SQL Server 2025 store high-dimensional vectors as a native data type.

You no longer need a workflow to extract relational columns, convert text to embeddings with an external Python script, and run calculations inside a separate third-party vector database.

Now you can keep your unstructured data and relational metrics inside the same engine.

Vector Indexes and Similarity Searches:

Execute these commands to build a vector index and execute a semantic search query.

BigQuery:

Set up a distance index:

CREATE VECTOR INDEX IF NOT EXISTS 
  text_embedding_idx 
ON 
  my_dataset.knowledge_base(embedding_col) 
OPTIONS 
  (distance_type = ‘COSINE’, index_type = ‘IVF’);

2. Locate the top matching documents:

SELECT 
  document_id, 
  text_content 
FROM 
  VECTOR_SEARCH( TABLE my_dataset.knowledge_base, ‘embedding_col’, 
    (SELECT 
    embedding_col 
    FROM 
    my_dataset.query_embeddings 
    WHERE query_id = 45), top_k => 5);

SQL Server 2025:

1) Create a vector index to speed up nearest-neighbor scanning:

CREATE VECTOR INDEX
  text_vector_idx 
ON 
  dbo.knowledge_base(embedding_col) 
WITH 
  (DISTANCE_METRIC = ‘COSINE’);

Execute this statement to pull matching content with similarity function:

SELECT TOP (5)
  document_id, 
  text_content
FROM 
  dbo.knowledge_base
ORDER BY 
  VECTOR_DISTANCE(’COSINE’, embedding_col, @query_vector);

Follow me for more SQL tips and guides.
Do you still move data to Python for analysis? Why?
Let me know in the comments 👇

BIWave Substack

Discussion about this post

Ready for more?