Q025 ·SQL

Tricky Customer Retention Rate with LAG and Multiple CTEs

Hard High frequency

1Problem

You have a table of user events for a subscription product:

user_events
-----------
user_id       BIGINT
event_date    DATE
event_type    VARCHAR  -- 'purchase', 'refund', 'view', 'signup'
product_id    INT
amount        DECIMAL  -- positive for purchase, negative for refund

Warm-up (easy)

Return the user_ids of users who made at least one successful purchase in the last 30 days (event_type = 'purchase' and amount > 0).

SQL

SELECT DISTINCT user_id
FROM user_events
WHERE event_type = 'purchase'
  AND amount > 0
  AND event_date >= CURRENT_DATE - INTERVAL '30 days';

Main question (hard)

Compute the monthly Customer Retention Rate (CRR) by signup cohort, under the following conditions:

A user is considered active in month M only if they have at least one purchase in month M that is not fully refunded — i.e., after netting refunds against purchases in that month, net spend is strictly positive.
A user is retained in month M if they were active in month M−1 and active in month M (no gaps allowed — consecutive months only).
Group users into cohorts by the month of their first-ever successful purchase.
Output a table: cohort_month, month_number (0 = cohort month, 1 = next month, …), active_users, retained_users, retention_rate.
Only consider cohorts with at least 100 users in month 0.

This is the kind of problem where a naive JOIN-heavy approach becomes a mess. Use LAG and a small stack of CTEs.

2HintsThink first, then tap to reveal

Hint 1 Sketch the layers before writing SQL — CTE per stage. Tap →

Hint 2 LAG tells you if the previous active month was adjacent. Tap →

Hint 3 Net refunds against purchases — don't just filter them out. Tap →

3Solution

▶Why the naive approach fails

Candidates often reach for:

SQL

-- naive
SELECT cohort.month, COUNT(DISTINCT u2.user_id) / COUNT(DISTINCT u1.user_id)
FROM events u1
JOIN events u2 ON u1.user_id = u2.user_id AND u2.month = u1.month + 1
GROUP BY cohort.month;

Problems:

Refunds aren't netted — you count users with a purchase that was later refunded as "active."
"Consecutive months" is hard to express with JOIN; you end up joining each month against the previous, which doesn't generalize to month N+2.
No cohort grouping — you're measuring month-over-month retention, not cohort retention.
The 100-user filter on cohort is hard to tack on without yet another CTE.

▶The right decomposition

This is classic cohort-analysis: group users by their first-active month, then track retention at fixed month offsets. The clean pattern uses cte-recursive-style CTE chaining (though no actual recursion) plus window-functions for the consecutive-month check.

Five CTEs, each doing one clean thing:

monthly_net — aggregate events into (user, month) with net spend (purchases − refunds).
active_user_months — keep only user-months with net spend > 0.
cohort_map — for each user, compute their cohort month (min active month).
retention_flags — use LAG to check whether the user was active in the previous month.
cohort_sizes — count users per cohort to apply the ≥ 100 filter.

Final SELECT joins it together.

▶The SQL

SQL

WITH monthly_net AS (
  -- Net spend per user per month
  SELECT
    user_id,
    DATE_TRUNC('month', event_date) AS month,
    SUM(CASE WHEN event_type = 'purchase' THEN amount
             WHEN event_type = 'refund'   THEN -ABS(amount)
             ELSE 0 END) AS net_spend
  FROM user_events
  WHERE event_type IN ('purchase', 'refund')
  GROUP BY user_id, DATE_TRUNC('month', event_date)
),

active_user_months AS (
  -- User-months with strictly positive net spend
  SELECT user_id, month
  FROM monthly_net
  WHERE net_spend > 0
),

cohort_map AS (
  -- Each user's cohort = first active month
  SELECT
    user_id,
    MIN(month) AS cohort_month
  FROM active_user_months
  GROUP BY user_id
),

retention_flags AS (
  -- For each active user-month, find the previous active month and flag if consecutive
  SELECT
    aum.user_id,
    aum.month,
    LAG(aum.month) OVER (PARTITION BY aum.user_id ORDER BY aum.month) AS prev_active_month,
    cm.cohort_month
  FROM active_user_months aum
  JOIN cohort_map cm USING (user_id)
),

cohort_sizes AS (
  -- Cohort size = active users in month 0 (i.e., their cohort month)
  SELECT
    cohort_month,
    COUNT(DISTINCT user_id) AS cohort_size
  FROM retention_flags
  WHERE month = cohort_month
  GROUP BY cohort_month
)

SELECT
  rf.cohort_month,
  DATEDIFF('month', rf.cohort_month, rf.month) AS month_number,
  COUNT(DISTINCT rf.user_id) AS active_users,
  COUNT(DISTINCT CASE
          WHEN rf.prev_active_month = rf.month - INTERVAL '1 month'
          THEN rf.user_id END) AS retained_users,
  1.0 * COUNT(DISTINCT CASE
          WHEN rf.prev_active_month = rf.month - INTERVAL '1 month'
          THEN rf.user_id END)
      / NULLIF(COUNT(DISTINCT rf.user_id), 0) AS retention_rate
FROM retention_flags rf
JOIN cohort_sizes cs USING (cohort_month)
WHERE cs.cohort_size >= 100
GROUP BY rf.cohort_month, DATEDIFF('month', rf.cohort_month, rf.month)
ORDER BY rf.cohort_month, month_number;

▶Key subtleties

LAG with month arithmetic. LAG gives you the previous row in the user's active-month series. That row's month might be one month earlier (retained) or several months earlier (churned and returned). The prev_active_month = month - 1 month check enforces the "consecutive month" definition.
NULLIF in the denominator prevents divide-by-zero when a cohort has no activity in some future month.
Refund handling via netting, not filtering. A user who bought $50 and refunded$ 30 is still active ($20 net). Only a user with net ≤ 0 is excluded.
DATE_TRUNC('month', ...) is the idiomatic way to bucket by month. Avoid EXTRACT(MONTH) — that gives you 1–12 and loses the year.
COUNT(DISTINCT ... CASE WHEN ...) is the cleanest way to count a subset within a group without a second scan.

▶If the interviewer pushes further

"What if retention means active at month 0 AND month N, with any pattern in between?" That's a simpler condition — no LAG needed; just EXISTS for month 0 and month N.
"What if you need a triangle chart (cohort × month_number retention matrix)?" The query already produces exactly that — pivot in the BI layer, not in SQL.
"What if the data is so large that COUNT(DISTINCT) is slow?" Precompute at the user level: one row per user with cohort_month and a bitmap of active months. Then aggregation is cheap.

▶Why this is 4–5 CTEs, not a single query

Each CTE represents one logical stage:

Clean (net spend)
Filter (positive months only)
Label (assign cohort)
Window (use LAG to check consecutiveness)
Gate (cohort size filter)

Trying to inline these produces an unreadable query and almost always a subtle bug. The CTE structure is the solution, not just a stylistic choice.

4Interview

Interviewer

Quick warm-up: give me the users who made a successful purchase in the last 30 days.

SQL

SELECT DISTINCT user_id
FROM user_events
WHERE event_type = 'purchase' AND amount > 0
  AND event_date >= CURRENT_DATE - INTERVAL '30 days';

Self-rate:

Interviewer · Follow-up

Good. Now the hard one. I want monthly retention by signup cohort, but with a few twists. A user is active in a month only if their net spend — purchases minus refunds — is positive. Retention means active in consecutive months, no gaps. Group users by the month of their first successful purchase as their cohort. And skip any cohort with fewer than 100 users. Walk me through how you'd build it.

Okay, this one's going to need a few CTEs. Let me sketch the layers before I write SQL, because if I try to inline this I'll get it wrong.

Layer 1: aggregate events to a user × month grain with net spend. That handles the refund condition cleanly — I don't filter refunds out, I let them subtract from purchases, and then I only keep months with net > 0.

Layer 2: the cohort assignment — for each user, the MIN of their active months is their cohort month.

Layer 3: the retention check. For each active user-month, I want to know if the user was also active in the immediately preceding month. This is where LAG shines — partition by user, order by month, and LAG(month) gives me the user's previous active month. If that previous month equals current_month − 1 month, they're retained from last month.

Layer 4: the cohort size filter — count users in month 0 per cohort, then only keep cohorts with ≥ 100.

Then the final select joins these and produces the cohort × month_number retention table.

Self-rate:

Interviewer · Follow-up

Write it out.

[writes the SQL from the solution above]

Self-rate:

Interviewer · Follow-up

Walk me through what LAG is doing in retention_flags.

Right. LAG(month) OVER (PARTITION BY user_id ORDER BY month) gives me, for each active user-month row, the month value from the user's previous active row. So if a user was active in Jan, Feb, April — for the Feb row LAG returns Jan, for the April row LAG returns Feb.

Then in the final SELECT, I check whether prev_active_month = month - 1 month. For the Feb row that's Jan = Feb - 1, which is true → retained. For the April row that's Feb = April - 1, which is false → not retained (they had a gap in March).

That's exactly the "consecutive months, no gaps" definition. If I had used a JOIN on user_id AND month = prev_month + 1, I'd have written a different query for every lag value I cared about, and the gap logic would be easy to get wrong.

Self-rate:

Interviewer · Follow-up

What if I didn't care about consecutive months — I just wanted to know whether a user who was active in month 0 was also active in month N, regardless of what happened in between?

Much simpler, no LAG needed. For each user I'd check two conditions independently: EXISTS an active month at cohort_month, and EXISTS an active month at cohort_month + N. The retention rate is the fraction of cohort-0-active users who are also cohort-N-active.

That's a very different retention definition — any-time retention versus streak retention. Which one you want depends on the product. For a subscription product, streak retention is often the right one because churn in the middle breaks the relationship. For a consumer product like Maps, any-time is usually more meaningful because people dip in and out.

Self-rate:

Interviewer · Follow-up

What would you do if the table had a billion rows and COUNT(DISTINCT) was too slow?

Two options.

First, pre-aggregate. Build a user_monthly_active table with one row per (user, month) where the user is active. That's the output of my first two CTEs. If I maintain that table incrementally, my retention query runs against something much smaller than the raw events.

Second, use approximate distinct counts — HyperLogLog (APPROX_COUNT_DISTINCT in BigQuery) is typically good to within 1–2% and is orders of magnitude faster. For a dashboard metric that's almost always acceptable.

If it were really large, I'd also consider representing each user's activity as a bitmap indexed by month offset. Then cohort retention becomes a bitwise AND across users, which is both fast and elegant. But that's probably overkill for an interview answer.

Self-rate:

Interviewer · Follow-up

Good. We're a few minutes over, let's wrap.

Thanks. If I were writing this for production I'd add a unit test with a small synthetic dataset — specifically a user who refunds in the middle of a month to verify the net-spend logic, and a user with a Feb–April gap to verify the LAG consecutive-month check. Those are the two places this kind of query quietly gets wrong.

Self-rate:

Interviewer

Quick warm-up: give me the users who made a successful purchase in the last 30 days.

Candidate

SQL

SELECT DISTINCT user_id
FROM user_events
WHERE event_type = 'purchase' AND amount > 0
  AND event_date >= CURRENT_DATE - INTERVAL '30 days';

Interviewer

Candidate

Okay, this one's going to need a few CTEs. Let me sketch the layers before I write SQL, because if I try to inline this I'll get it wrong.

Layer 2: the cohort assignment — for each user, the MIN of their active months is their cohort month.

Layer 4: the cohort size filter — count users in month 0 per cohort, then only keep cohorts with ≥ 100.

Then the final select joins these and produces the cohort × month_number retention table.

Interviewer

Write it out.

Candidate

[writes the SQL from the solution above]

Interviewer

Walk me through what LAG is doing in retention_flags.

Candidate

Interviewer

What if I didn't care about consecutive months — I just wanted to know whether a user who was active in month 0 was also active in month N, regardless of what happened in between?

Candidate

Interviewer

What would you do if the table had a billion rows and COUNT(DISTINCT) was too slow?

Candidate

Two options.

Interviewer

Good. We're a few minutes over, let's wrap.

Candidate

5Notes

Saved Export all notes

6Mock interview debrief

This question has a debrief tool attached. Practice it aloud with a voice-mode AI interviewer, paste the transcript, and get a graded debrief against the reference answer.

Create account or sign in

How to do a mock interview

1
Open DS Mock Interviewer in ChatGPT
2

Copy this question and paste it as your first message:

You have a table of user events for a subscription product: ``` user_events ----------- user_id BIGINT event_date DATE event_type VARCHAR -- 'purchase', 'refund', 'view', 'signup' product_id INT amount DECIMAL -- positive for purchase, negative for refund ``` ### Warm-up (easy) Return the user_ids of users who made at least one **successful purchase** in the **last 30 days** (`event_type = 'purchase'` and `amount > 0`). ```sql SELECT DISTINCT user_id FROM user_events WHERE event_type = 'purchase' AND amount > 0 AND event_date >= CURRENT_DATE - INTERVAL '30 days'; ``` ### Main question (hard) Compute the **monthly Customer Retention Rate (CRR)** by **signup cohort**, under the following conditions: 1. A user is considered **active in month M** only if they have at least one purchase in month M that is **not fully refunded** — i.e., after netting refunds against purchases in that month, net spend is strictly positive. 2. A user is **retained in month M** if they were active in month M−1 **and** active in month M (no gaps allowed — consecutive months only). 3. Group users into **cohorts** by the month of their first-ever successful purchase. 4. Output a table: `cohort_month`, `month_number` (0 = cohort month, 1 = next month, …), `active_users`, `retained_users`, `retention_rate`. 5. Only consider cohorts with at least **100 users** in month 0. This is the kind of problem where a naive `JOIN`-heavy approach becomes a mess. Use `LAG` and a small stack of CTEs.
3

Switch to voice mode (mic icon in the chat input). Speak through each follow-up — aim for 4–6 turns.
4

When the interviewer says "thank you, that's all I had", type or speak this:

Print the full transcript of our conversation as alternating "Interviewer:" and "Candidate:" lines. Include every exchange verbatim. Do not paraphrase, summarize, or skip turns. Do not add commentary.
5

Copy ChatGPT's response, paste it below, and run the debrief.

Shortcuts ›

SpaceReveal next 123Status FFocus TTranscript NNotes EscClose concept ←→Prev / next