← Dashboard
Q002 ·Experimentation ·Ab testing ·Novelty effect

A/B test lift shrinks in week 2 — what happened?

Medium High frequency

A new Reels ranking model is tested 50/50 for 2 weeks. Week 1 shows +3% engagement, p=0.02. Week 2 the lift has shrunk to +0.5%, p=0.14. The PM wants to know: what happened, and should we still ship?

Framing

The knee-jerk answer is "novelty effect, run it longer." That's sometimes right and often wrong. Before you say novelty, rule out the alternatives — because the recommendation depends on which one it is.

The four hypotheses that explain "lift shrinks over time"
HypothesisSignature in the dataImplication
novelty-effect in treatmentTreatment engagement spikes then decays toward control; control flatRun longer; measure steady-state lift
primacy-effect (dislike of change)Treatment drops first, recovers toward control; control flatAlso run longer; steady-state may be positive
Regression toward the meanBoth arms move together; gap closesWeek 1 effect was noise; likely no real effect
Seasonality or external eventGap closes because control caught up, not because treatment declinedNot a treatment effect; compare to aa-test baseline

You cannot distinguish these from a single two-week total. You need to see the daily curves for each arm separately.

Diagnostic: plot both arms daily

Step 1 before anything else. If treatment declines and control is flat → novelty. If treatment rises and control rises → seasonality or broad trend. If both converge from opposite sides → treatment is settling to a real steady-state lift. The shape of the two curves is the diagnosis.

Diagnostic: segment by user-freshness to treatment

Within the treatment arm, split by "days since first exposure." If engagement in the first 3 days of exposure is much higher than days 10–14, the effect is novelty-effect driven. Steady-state lift is whatever the day 10+ cohort shows — use that, not the two-week average, as your ship decision number.

Is week 2's p=0.14 actually "no effect"?

Not necessarily. Two sources of underpowering:

  • Half the data. A single week has roughly half the sample of two weeks, so minimum-detectable-effect roughly doubles. A +0.5% true effect at week-2-only sample is below MDE for most experiments — you couldn't detect it even if it were real.
  • Repeated peeking. If we looked at week 1 and wanted to ship, and now we're looking at week 2 and wanting to decide, we've peeked twice (peeking-problem). The p=0.14 is not a clean false-negative test — the decision surface is contaminated.

So p=0.14 ≠ "effect is zero." It means "we don't have power at this sample size to distinguish +0.5% from zero."

The ship decision

Depends on the diagnostic.

  • If the daily curves show classic novelty decay (treatment spikes then settles) — don't ship on the week-1 number. Steady-state is what prod will look like after launch. Run the test longer (4+ weeks) and measure steady-state lift on the day-10+ cohort.
  • If curves show primacy (treatment dips then recovers) — this is rarer but the mirror case. Run longer; steady-state may be positive.
  • If both arms moved together (seasonality/trend) — the week-1 result was noise or external. Re-run the test in a stable period.
  • If the diagnostic shows nothing clean — ship null. "Inconclusive" is a decision.

Don't let "we committed 2 weeks, we need to decide today" override the diagnostic. The cost of shipping a novelty-driven feature is months of confused follow-up experiments trying to explain why headline engagement dropped after launch.

Interviewer

Reels ranking A/B test: +3% week 1 at p=0.02, shrinks to +0.5% week 2 at p=0.14. The PM wants to ship based on the week 1 number. What do you tell them?

I'd not ship yet, and the reasoning is I don't know which of four things is happening. Novelty is the obvious guess, but it's not the only one.

Before anything else I'd pull the daily engagement curves for treatment and control separately. There are four signatures and each implies a different decision.

If treatment spikes day 1–3 and decays toward control, that's novelty-effect — users are exploring the new recommendations and it'll settle. Steady-state lift is what you'd see in prod, and the week-1 number is a mirage.

If treatment dips first and recovers, that's primacy-effect — users initially disliked the change and adapted. Steady-state might be positive; also run longer.

If both arms moved together, it's seasonality or an external event. The week-1 gap was noise, not a real treatment effect.

If the curves show neither pattern and the gap just closed because week-2 noise was unfavorable, that's regression to the mean — effect was probably small to begin with.

On the week-2 p=0.14 specifically: that's not "no effect." At half the sample size, minimum-detectable-effect roughly doubles, so a real +0.5% effect is below the detection bar at the week-2-only sample. Don't read p=0.14 as "null confirmed." Read it as "underpowered."

My ask to the PM: give me the daily curves. Pick the decision based on the shape, not the headline averages.

Self-rate:
Interviewer · Follow-up

You pull the curves. Treatment starts at +6% day 1, decays to +1% by day 10, stays around +1% days 10–14. Control is flat throughout. What now?

Classic novelty-effect. The steady-state lift is the day-10+ number — call it +1%. The +3% two-week average is inflated by the early novelty boost that won't persist in production.

Ship decision now depends on whether +1% at the day-10+ steady state clears the bar.

What I'd look at: was +1% pre-registered as the minimum-detectable-effect for this experiment? If MDE was +2%, then the steady-state signal is below what we designed to detect, and launching on +1% means we're making a decision on an effect size the test wasn't built to resolve. Rerun at 2× sample size or 2× longer to nail the steady-state down.

If MDE was +0.5% and the steady-state is clearly +1% with tight confidence — ship. Small but real. The week-1 +3% was a bonus we won't see in production.

The thing I'd not do: ship on the +3% headline. That's forecasting launch impact from a cohort-experience (first week of exposure) that won't generalize — every user in production after launch is a day-10+ user once they've been exposed long enough. The day-1 novelty bump happens once per user.

Self-rate:
Interviewer · Follow-up

PM pushes back: "Our team ships every 2 weeks. If we wait 4 weeks for steady-state we'll miss the release train."

I'd reframe the question. The cost of launching on the novelty-inflated number isn't "nothing"; it's a specific and predictable problem down the road.

Three months from now, engagement will have settled to whatever the steady-state lift was. If we shipped on +3%, executives will ask why headline engagement is only up 1% and not 3%. Someone will then spend 4 weeks running a counter-experiment to figure out what went wrong, which will eventually conclude: "nothing went wrong, we just shipped on a novelty number." That's 4 weeks of debug work plus a credibility hit.

The 2-week extension to measure steady-state is 2 weeks. The counter-experiment retroactively is 4+ weeks plus the narrative cost.

The honest framing to the PM: "We can ship in 2 weeks on a number that won't hold up, or ship in 4 weeks on a number that will. Either way, the 4-week sample is what we'll be accountable for. Which one do you want to stake your decision on?"

If they still want to ship at 2 weeks, I'd escalate that as a judgment call, not a data call. The data's job is to give the decision-maker the steady-state estimate; the decision itself is theirs.

Self-rate:
Interviewer · Follow-up

Let's say we shipped at 2 weeks on +3%. Three months later engagement is +1%. How do we handle the retrospective?

Separate two questions: did we ship the right thing, and did we make the right decision at the time.

Did we ship the right thing: yes, +1% lifetime lift on Reels engagement at Meta scale is still meaningful — the new model is better than the old one. So the product decision was correct. The experiment-to-launch gap isn't a case of "we shipped garbage."

Did we make the right decision at the time: no. The diagnostic plot would have shown novelty decay and we'd have updated the steady-state estimate before ship. We used a headline number we knew could be inflated, and that was the gap.

What I'd change going forward — specifically, not retroactively:

  • Every experiment over 1 week runs the daily-curve diagnostic as part of readout, not as an on-demand request. Make it the default.
  • The launch doc's "expected lift" field records the steady-state estimate, not the two-week average. Makes it harder to forget.
  • If the diagnostic shows novelty, the pre-registered launch-go/no-go lift is re-computed on the day-10+ cohort before shipping.

The shape of the fix is process, not a new model or more data. We had the signal; we didn't build the habit to use it.

Self-rate:
Interviewer

Reels ranking A/B test: +3% week 1 at p=0.02, shrinks to +0.5% week 2 at p=0.14. The PM wants to ship based on the week 1 number. What do you tell them?

Candidate

I'd not ship yet, and the reasoning is I don't know which of four things is happening. Novelty is the obvious guess, but it's not the only one.

Before anything else I'd pull the daily engagement curves for treatment and control separately. There are four signatures and each implies a different decision.

If treatment spikes day 1–3 and decays toward control, that's novelty-effect — users are exploring the new recommendations and it'll settle. Steady-state lift is what you'd see in prod, and the week-1 number is a mirage.

If treatment dips first and recovers, that's primacy-effect — users initially disliked the change and adapted. Steady-state might be positive; also run longer.

If both arms moved together, it's seasonality or an external event. The week-1 gap was noise, not a real treatment effect.

If the curves show neither pattern and the gap just closed because week-2 noise was unfavorable, that's regression to the mean — effect was probably small to begin with.

On the week-2 p=0.14 specifically: that's not "no effect." At half the sample size, minimum-detectable-effect roughly doubles, so a real +0.5% effect is below the detection bar at the week-2-only sample. Don't read p=0.14 as "null confirmed." Read it as "underpowered."

My ask to the PM: give me the daily curves. Pick the decision based on the shape, not the headline averages.

Interviewer

You pull the curves. Treatment starts at +6% day 1, decays to +1% by day 10, stays around +1% days 10–14. Control is flat throughout. What now?

Candidate

Classic novelty-effect. The steady-state lift is the day-10+ number — call it +1%. The +3% two-week average is inflated by the early novelty boost that won't persist in production.

Ship decision now depends on whether +1% at the day-10+ steady state clears the bar.

What I'd look at: was +1% pre-registered as the minimum-detectable-effect for this experiment? If MDE was +2%, then the steady-state signal is below what we designed to detect, and launching on +1% means we're making a decision on an effect size the test wasn't built to resolve. Rerun at 2× sample size or 2× longer to nail the steady-state down.

If MDE was +0.5% and the steady-state is clearly +1% with tight confidence — ship. Small but real. The week-1 +3% was a bonus we won't see in production.

The thing I'd not do: ship on the +3% headline. That's forecasting launch impact from a cohort-experience (first week of exposure) that won't generalize — every user in production after launch is a day-10+ user once they've been exposed long enough. The day-1 novelty bump happens once per user.

Interviewer

PM pushes back: "Our team ships every 2 weeks. If we wait 4 weeks for steady-state we'll miss the release train."

Candidate

I'd reframe the question. The cost of launching on the novelty-inflated number isn't "nothing"; it's a specific and predictable problem down the road.

Three months from now, engagement will have settled to whatever the steady-state lift was. If we shipped on +3%, executives will ask why headline engagement is only up 1% and not 3%. Someone will then spend 4 weeks running a counter-experiment to figure out what went wrong, which will eventually conclude: "nothing went wrong, we just shipped on a novelty number." That's 4 weeks of debug work plus a credibility hit.

The 2-week extension to measure steady-state is 2 weeks. The counter-experiment retroactively is 4+ weeks plus the narrative cost.

The honest framing to the PM: "We can ship in 2 weeks on a number that won't hold up, or ship in 4 weeks on a number that will. Either way, the 4-week sample is what we'll be accountable for. Which one do you want to stake your decision on?"

If they still want to ship at 2 weeks, I'd escalate that as a judgment call, not a data call. The data's job is to give the decision-maker the steady-state estimate; the decision itself is theirs.

Interviewer

Let's say we shipped at 2 weeks on +3%. Three months later engagement is +1%. How do we handle the retrospective?

Candidate

Separate two questions: did we ship the right thing, and did we make the right decision at the time.

Did we ship the right thing: yes, +1% lifetime lift on Reels engagement at Meta scale is still meaningful — the new model is better than the old one. So the product decision was correct. The experiment-to-launch gap isn't a case of "we shipped garbage."

Did we make the right decision at the time: no. The diagnostic plot would have shown novelty decay and we'd have updated the steady-state estimate before ship. We used a headline number we knew could be inflated, and that was the gap.

What I'd change going forward — specifically, not retroactively:

  • Every experiment over 1 week runs the daily-curve diagnostic as part of readout, not as an on-demand request. Make it the default.
  • The launch doc's "expected lift" field records the steady-state estimate, not the two-week average. Makes it harder to forget.
  • If the diagnostic shows novelty, the pre-registered launch-go/no-go lift is re-computed on the day-10+ cohort before shipping.

The shape of the fix is process, not a new model or more data. We had the signal; we didn't build the habit to use it.

Interviewer

Good.

How to do a mock interview
  1. 1
  2. 2

    Copy this question and paste it as your first message. This briefs the interviewer — they won't start asking yet.

    View prompt
    A new Reels ranking model is tested 50/50 for 2 weeks. Week 1 shows +3% engagement, p=0.02. Week 2 the lift has shrunk to +0.5%, p=0.14. The PM wants to know: what happened, and should we still ship?
  3. 3

    Switch to voice mode (mic icon in the chat input). Say "I'm ready, let's begin" — the interviewer will start asking one sub-question at a time. Aim for 4–6 turns.

  4. 4

    When the interviewer says "thank you, that's all I had", exit voice mode, then paste this prompt.

    View prompt
    Print the full transcript of our conversation as alternating "Interviewer:" and "Candidate:" lines. Strip any <<AudioTranscription: ...>> wrappers from spoken input — include only the spoken content inside them. Include every exchange verbatim otherwise. Do not paraphrase, summarize, or skip turns. Do not add commentary.
  5. 5

    Copy ChatGPT's response, paste it below, and run the debrief.

Shortcuts
SpaceReveal next 123Status FFocus TTranscript NNotes EscClose concept Prev / next