← Dashboard
Q003 ·Statistics ·Hypothesis testing ·Confidence intervals

Explain CI, P-value, and Alpha — How Would You Explain Them to a PM?

Medium High frequency

Explain confidence intervals, p-values, and alpha level. How would you explain each to a non-technical product manager? How is alpha chosen?

Confidence Interval (CI)

Technical definition: If we repeated the experiment many times, 95% of the constructed confidence-intervals would contain the true parameter value.

Common misconception: "There's a 95% probability the true value is in this interval." — WRONG in frequentist framework. The true value is fixed; the interval is random.

For PM: "We ran the experiment and the new feature increased engagement by 3 minutes, with a 95% CI of [1.5, 4.5]. This means we're fairly confident the real improvement is somewhere between 1.5 and 4.5 minutes."

What determines CI width:

  • Sample size n ↑ → CI narrows
  • Data variability σ ↑ → CI widens
  • Confidence level ↑ (95% → 99%) → CI widens
P-value

Technical definition: The p-value is the probability of observing a result as extreme as (or more extreme than) the current result, assuming null-hypothesis H₀ is true.

Common misconceptions:

  • ≠ probability that H₀ is true
  • ≠ probability the result is due to chance
  • ≠ effect size (small p ≠ large effect; with huge n, tiny effects become significant)

For PM: "The p-value is a surprise score. If the feature had zero real effect, how surprised would we be to see this result? A p-value of 0.03 means there's only a 3% chance we'd see something this dramatic if nothing was actually happening. That's pretty surprising, so we think the effect is real."

Alpha (Significance Level)

Technical definition: The threshold for rejecting H₀. It controls the type-i-error rate — the probability of a false positive.

Why 0.05? Historical convention from Ronald Fisher (1925). No deep mathematical reason.

How to actually choose alpha — it's a business decision:

ScenarioAlphaReasoning
Drug clinical trial0.01 or 0.001False positive → patient harm
Google Ads auction change0.01False positive → revenue loss at scale
UI color change0.05Low risk, easy rollback
Push notification test0.10Low cost, don't want to miss opportunities
Genomics (10,000 tests)0.05/m (bonferroni or fdr-bh)Multiple testing correction

The real framework: Balance type-i-error cost vs type-ii-error cost, considering:

  1. Is the decision reversible?
  2. What's the cost of a false positive vs missing a true effect?
  3. How much sample size can you afford? (lower α → need more data, hurts statistical-power)
Interviewer

A PM with no stats background asks you: what are confidence intervals, p-values, and alpha, and why 0.05?

I'd anchor on the duality first because it makes the three concepts click together, then translate each into PM-speak.

The duality: a p-value is the probability of data this extreme or more, assuming the null hypothesis. A confidence interval is the set of parameter values the data doesn't reject. Alpha is the rejection threshold we agreed to in advance. The link: the 95% CI excludes zero if and only if p < 0.05. They're two windows into the same test.

Then for the PM:

"P-value is a surprise score. If the feature had zero real effect, how surprised should we be by what we saw? 0.03 means a 3% chance of seeing this extreme a result under nothing-going-on. Low enough that we're willing to say the effect is real.

Confidence interval is the range of effects the data is consistent with. We observed a lift of 3 minutes with a 95% CI of [1.5, 4.5]. Zero isn't in there, so the effect is statistically significant at 5%. And the range tells you precision: [1.5, 4.5] is tight; [0.1, 5.9] is noisy.

Alpha is the false-alarm rate we agreed to tolerate. If we set α = 0.05 and there's no real effect, we'll still claim a win 5% of the time due to random variation. That's the cost of doing business."

The thing I'd emphasize: a p-value is not the probability the null is true, not the probability the effect is real, and not the effect size. It's a statement about the data given H₀, nothing more.

Self-rate:
Interviewer · Follow-up

Why 0.05, and how would you actually pick alpha for a real experiment?

0.05 is Fisher's 1925 convention. He called it "a convenient level." There's no math behind the specific number.

The principled framework is cost-benefit. Expected cost of a wrong decision is CFPα+CFN(1power)C_{FP} \cdot \alpha + C_{FN} \cdot (1 - \text{power}). Minimize that over α\alpha given the power curve.

The practical rule of thumb I'd give the PM: compare the two costs. If a false positive is 10× worse than a false negative, α should be an order of magnitude tighter — 0.005, not 0.05. Shipping a bad change to the Ads auction costs millions in revenue and is painful to roll back; α = 0.005 or 0.001 is the right bar. If false negatives dominate — a cheap, reversible UI experiment where the main cost is missing a winner — α = 0.10 is fine.

One non-negotiable: α is committed before the experiment runs. Picking it after seeing the data is where p-hacking lives.

Self-rate:
Interviewer · Follow-up

The experiment comes back at p = 0.06. The PM wants to launch. What do you do?

I commit to the decision rule we pre-registered.

Pre-registered α was 0.05. p = 0.06 is not significant at that bar. I report null, and I don't redecide α after seeing the data — that's moving the goalposts, and the false-positive rate of "pick α after peeking" is not 5%, it's much worse.

Concretely, what I'd tell the PM: "We committed to 0.05 before launch. The result didn't clear it. If you want a second swing at this feature, it's a new experiment with a new sample — not a re-roll of the same data. Re-using the same users or extending the window until it crosses 0.05 is p-hacking, and at that point the reported 0.05 is a lie about the actual false-positive rate of the procedure."

This isn't me being rigid for the sake of it. The entire point of a pre-registered α is that it makes the false-positive rate a real guarantee. Once you break pre-registration, you don't have that guarantee any more.

Self-rate:
Interviewer · Follow-up

The PM pushes back. "It's basically significant. We should launch."

Two responses.

One: "basically significant" has no statistical meaning. The threshold is binary by design. p = 0.06 and p = 0.04 feel close, but they sit on opposite sides of a decision rule we chose precisely because we needed a rule. Saying "close to 0.05 is fine" is the same as saying "we actually wanted α = 0.07," which nobody pre-registered.

Two, the useful diagnostic: post-hoc power at the observed effect size. This tells us which world we're in.

If post-hoc power was ~50%, the test was underpowered — the true effect might be real but we didn't collect enough data to see it. The right response isn't "launch anyway on p = 0.06"; it's rerun at 2× sample size. A properly powered rerun either clears 0.05 cleanly or confirms the effect is smaller than we thought.

If post-hoc power was ~90% and we still got p = 0.06, the effect really is small — around the MDE boundary or below it. In that case "basically significant" is actively misleading: the data is telling us the effect isn't big enough to clear the bar we set for a reason.

Either way, the decision isn't "launch on borderline p." It's either "rerun with more data" or "the effect is too small to justify ship." Both respect the pre-registered α.

Self-rate:
Interviewer · Follow-up

Last one. You're running 20 A/B tests this quarter. How do you handle multiple comparisons?

Family-wise false positives compound. At α = 0.05 across 20 independent tests, P(at least one false positive) = 1 − (0.95)²⁰ ≈ 0.64 — odds-on you declare at least one winner purely by chance. Uncorrected α = 0.05 at that scale is indefensible.

Two corrections worth knowing.

Bonferroni: test each at α/m = 0.05/20 = 0.0025. Controls family-wise error rate (probability of any false positive) at 5%. Simple, conservative, and kills statistical-power — you need much larger samples to detect real effects.

Benjamini–Hochberg FDR: rank p-values, reject the k-th if p(k)(k/m)qp_{(k)} \leq (k/m) \cdot q for target FDR qq. Controls the expected proportion of false discoveries among your rejections, not the probability of any one. For q = 0.10 across 20 tests, about 10% of declared winners are expected to be false — much less aggressive than Bonferroni on power.

For an A/B testing portfolio I'd default to fdr-bh at q = 0.10. Launching a feature is reversible; tolerating 10% false-discovery rate to retain power is a good trade. For irreversible or safety-critical decisions — drug trials, auction changes — I'd flip to bonferroni.

Self-rate:
Interviewer

A PM with no stats background asks you: what are confidence intervals, p-values, and alpha, and why 0.05?

Candidate

I'd anchor on the duality first because it makes the three concepts click together, then translate each into PM-speak.

The duality: a p-value is the probability of data this extreme or more, assuming the null hypothesis. A confidence interval is the set of parameter values the data doesn't reject. Alpha is the rejection threshold we agreed to in advance. The link: the 95% CI excludes zero if and only if p < 0.05. They're two windows into the same test.

Then for the PM:

"P-value is a surprise score. If the feature had zero real effect, how surprised should we be by what we saw? 0.03 means a 3% chance of seeing this extreme a result under nothing-going-on. Low enough that we're willing to say the effect is real.

Confidence interval is the range of effects the data is consistent with. We observed a lift of 3 minutes with a 95% CI of [1.5, 4.5]. Zero isn't in there, so the effect is statistically significant at 5%. And the range tells you precision: [1.5, 4.5] is tight; [0.1, 5.9] is noisy.

Alpha is the false-alarm rate we agreed to tolerate. If we set α = 0.05 and there's no real effect, we'll still claim a win 5% of the time due to random variation. That's the cost of doing business."

The thing I'd emphasize: a p-value is not the probability the null is true, not the probability the effect is real, and not the effect size. It's a statement about the data given H₀, nothing more.

Interviewer

Why 0.05, and how would you actually pick alpha for a real experiment?

Candidate

0.05 is Fisher's 1925 convention. He called it "a convenient level." There's no math behind the specific number.

The principled framework is cost-benefit. Expected cost of a wrong decision is CFPα+CFN(1power)C_{FP} \cdot \alpha + C_{FN} \cdot (1 - \text{power}). Minimize that over α\alpha given the power curve.

The practical rule of thumb I'd give the PM: compare the two costs. If a false positive is 10× worse than a false negative, α should be an order of magnitude tighter — 0.005, not 0.05. Shipping a bad change to the Ads auction costs millions in revenue and is painful to roll back; α = 0.005 or 0.001 is the right bar. If false negatives dominate — a cheap, reversible UI experiment where the main cost is missing a winner — α = 0.10 is fine.

One non-negotiable: α is committed before the experiment runs. Picking it after seeing the data is where p-hacking lives.

Interviewer

The experiment comes back at p = 0.06. The PM wants to launch. What do you do?

Candidate

I commit to the decision rule we pre-registered.

Pre-registered α was 0.05. p = 0.06 is not significant at that bar. I report null, and I don't redecide α after seeing the data — that's moving the goalposts, and the false-positive rate of "pick α after peeking" is not 5%, it's much worse.

Concretely, what I'd tell the PM: "We committed to 0.05 before launch. The result didn't clear it. If you want a second swing at this feature, it's a new experiment with a new sample — not a re-roll of the same data. Re-using the same users or extending the window until it crosses 0.05 is p-hacking, and at that point the reported 0.05 is a lie about the actual false-positive rate of the procedure."

This isn't me being rigid for the sake of it. The entire point of a pre-registered α is that it makes the false-positive rate a real guarantee. Once you break pre-registration, you don't have that guarantee any more.

Interviewer

The PM pushes back. "It's basically significant. We should launch."

Candidate

Two responses.

One: "basically significant" has no statistical meaning. The threshold is binary by design. p = 0.06 and p = 0.04 feel close, but they sit on opposite sides of a decision rule we chose precisely because we needed a rule. Saying "close to 0.05 is fine" is the same as saying "we actually wanted α = 0.07," which nobody pre-registered.

Two, the useful diagnostic: post-hoc power at the observed effect size. This tells us which world we're in.

If post-hoc power was ~50%, the test was underpowered — the true effect might be real but we didn't collect enough data to see it. The right response isn't "launch anyway on p = 0.06"; it's rerun at 2× sample size. A properly powered rerun either clears 0.05 cleanly or confirms the effect is smaller than we thought.

If post-hoc power was ~90% and we still got p = 0.06, the effect really is small — around the MDE boundary or below it. In that case "basically significant" is actively misleading: the data is telling us the effect isn't big enough to clear the bar we set for a reason.

Either way, the decision isn't "launch on borderline p." It's either "rerun with more data" or "the effect is too small to justify ship." Both respect the pre-registered α.

Interviewer

Last one. You're running 20 A/B tests this quarter. How do you handle multiple comparisons?

Candidate

Family-wise false positives compound. At α = 0.05 across 20 independent tests, P(at least one false positive) = 1 − (0.95)²⁰ ≈ 0.64 — odds-on you declare at least one winner purely by chance. Uncorrected α = 0.05 at that scale is indefensible.

Two corrections worth knowing.

Bonferroni: test each at α/m = 0.05/20 = 0.0025. Controls family-wise error rate (probability of any false positive) at 5%. Simple, conservative, and kills statistical-power — you need much larger samples to detect real effects.

Benjamini–Hochberg FDR: rank p-values, reject the k-th if p(k)(k/m)qp_{(k)} \leq (k/m) \cdot q for target FDR qq. Controls the expected proportion of false discoveries among your rejections, not the probability of any one. For q = 0.10 across 20 tests, about 10% of declared winners are expected to be false — much less aggressive than Bonferroni on power.

For an A/B testing portfolio I'd default to fdr-bh at q = 0.10. Launching a feature is reversible; tolerating 10% false-discovery rate to retain power is a good trade. For irreversible or safety-critical decisions — drug trials, auction changes — I'd flip to bonferroni.

Interviewer

Good.

This question has a debrief tool attached. Practice it aloud with a voice-mode AI interviewer, paste the transcript, and get a graded debrief against the reference answer.

Sign in to use. Free during beta.

How to do a mock interview
  1. 1
  2. 2

    Copy this question and paste it as your first message:

    Explain confidence intervals, p-values, and alpha level. How would you explain each to a non-technical product manager? How is alpha chosen?
  3. 3

    Switch to voice mode (mic icon in the chat input). Speak through each follow-up — aim for 4–6 turns.

  4. 4

    When the interviewer says "thank you, that's all I had", type or speak this:

    Print the full transcript of our conversation as alternating "Interviewer:" and "Candidate:" lines. Include every exchange verbatim. Do not paraphrase, summarize, or skip turns. Do not add commentary.
  5. 5

    Copy ChatGPT's response, paste it below, and run the debrief.

Shortcuts
SpaceReveal next 123Status FFocus TTranscript NNotes EscClose concept Prev / next