Explain CI, P-value, and Alpha — How Would You Explain Them to a PM?
Explain confidence intervals, p-values, and alpha level. How would you explain each to a non-technical product manager? How is alpha chosen?
▶Confidence Interval (CI)
Technical definition: If we repeated the experiment many times, 95% of the constructed confidence-intervals would contain the true parameter value.
Common misconception: "There's a 95% probability the true value is in this interval." — WRONG in frequentist framework. The true value is fixed; the interval is random.
For PM: "We ran the experiment and the new feature increased engagement by 3 minutes, with a 95% CI of [1.5, 4.5]. This means we're fairly confident the real improvement is somewhere between 1.5 and 4.5 minutes."
What determines CI width:
- Sample size n ↑ → CI narrows
- Data variability σ ↑ → CI widens
- Confidence level ↑ (95% → 99%) → CI widens
▶P-value
Technical definition: The p-value is the probability of observing a result as extreme as (or more extreme than) the current result, assuming null-hypothesis H₀ is true.
Common misconceptions:
- ≠ probability that H₀ is true
- ≠ probability the result is due to chance
- ≠ effect size (small p ≠ large effect; with huge n, tiny effects become significant)
For PM: "The p-value is a surprise score. If the feature had zero real effect, how surprised would we be to see this result? A p-value of 0.03 means there's only a 3% chance we'd see something this dramatic if nothing was actually happening. That's pretty surprising, so we think the effect is real."
▶Alpha (Significance Level)
Technical definition: The threshold for rejecting H₀. It controls the type-i-error rate — the probability of a false positive.
Why 0.05? Historical convention from Ronald Fisher (1925). No deep mathematical reason.
How to actually choose alpha — it's a business decision:
| Scenario | Alpha | Reasoning |
|---|---|---|
| Drug clinical trial | 0.01 or 0.001 | False positive → patient harm |
| Google Ads auction change | 0.01 | False positive → revenue loss at scale |
| UI color change | 0.05 | Low risk, easy rollback |
| Push notification test | 0.10 | Low cost, don't want to miss opportunities |
| Genomics (10,000 tests) | 0.05/m (bonferroni or fdr-bh) | Multiple testing correction |
The real framework: Balance type-i-error cost vs type-ii-error cost, considering:
- Is the decision reversible?
- What's the cost of a false positive vs missing a true effect?
- How much sample size can you afford? (lower α → need more data, hurts statistical-power)
A PM with no stats background asks you: what are confidence intervals, p-values, and alpha, and why 0.05?
I'd anchor on the duality first because it makes the three concepts click together, then translate each into PM-speak.
The duality: a p-value is the probability of data this extreme or more, assuming the null hypothesis. A confidence interval is the set of parameter values the data doesn't reject. Alpha is the rejection threshold we agreed to in advance. The link: the 95% CI excludes zero if and only if p < 0.05. They're two windows into the same test.
Then for the PM:
"P-value is a surprise score. If the feature had zero real effect, how surprised should we be by what we saw? 0.03 means a 3% chance of seeing this extreme a result under nothing-going-on. Low enough that we're willing to say the effect is real.
Confidence interval is the range of effects the data is consistent with. We observed a lift of 3 minutes with a 95% CI of [1.5, 4.5]. Zero isn't in there, so the effect is statistically significant at 5%. And the range tells you precision: [1.5, 4.5] is tight; [0.1, 5.9] is noisy.
Alpha is the false-alarm rate we agreed to tolerate. If we set α = 0.05 and there's no real effect, we'll still claim a win 5% of the time due to random variation. That's the cost of doing business."
The thing I'd emphasize: a p-value is not the probability the null is true, not the probability the effect is real, and not the effect size. It's a statement about the data given H₀, nothing more.
Why 0.05, and how would you actually pick alpha for a real experiment?
0.05 is Fisher's 1925 convention. He called it "a convenient level." There's no math behind the specific number.
The principled framework is cost-benefit. Expected cost of a wrong decision is . Minimize that over given the power curve.
The practical rule of thumb I'd give the PM: compare the two costs. If a false positive is 10× worse than a false negative, α should be an order of magnitude tighter — 0.005, not 0.05. Shipping a bad change to the Ads auction costs millions in revenue and is painful to roll back; α = 0.005 or 0.001 is the right bar. If false negatives dominate — a cheap, reversible UI experiment where the main cost is missing a winner — α = 0.10 is fine.
One non-negotiable: α is committed before the experiment runs. Picking it after seeing the data is where p-hacking lives.
The experiment comes back at p = 0.06. The PM wants to launch. What do you do?
I commit to the decision rule we pre-registered.
Pre-registered α was 0.05. p = 0.06 is not significant at that bar. I report null, and I don't redecide α after seeing the data — that's moving the goalposts, and the false-positive rate of "pick α after peeking" is not 5%, it's much worse.
Concretely, what I'd tell the PM: "We committed to 0.05 before launch. The result didn't clear it. If you want a second swing at this feature, it's a new experiment with a new sample — not a re-roll of the same data. Re-using the same users or extending the window until it crosses 0.05 is p-hacking, and at that point the reported 0.05 is a lie about the actual false-positive rate of the procedure."
This isn't me being rigid for the sake of it. The entire point of a pre-registered α is that it makes the false-positive rate a real guarantee. Once you break pre-registration, you don't have that guarantee any more.
The PM pushes back. "It's basically significant. We should launch."
Two responses.
One: "basically significant" has no statistical meaning. The threshold is binary by design. p = 0.06 and p = 0.04 feel close, but they sit on opposite sides of a decision rule we chose precisely because we needed a rule. Saying "close to 0.05 is fine" is the same as saying "we actually wanted α = 0.07," which nobody pre-registered.
Two, the useful diagnostic: post-hoc power at the observed effect size. This tells us which world we're in.
If post-hoc power was ~50%, the test was underpowered — the true effect might be real but we didn't collect enough data to see it. The right response isn't "launch anyway on p = 0.06"; it's rerun at 2× sample size. A properly powered rerun either clears 0.05 cleanly or confirms the effect is smaller than we thought.
If post-hoc power was ~90% and we still got p = 0.06, the effect really is small — around the MDE boundary or below it. In that case "basically significant" is actively misleading: the data is telling us the effect isn't big enough to clear the bar we set for a reason.
Either way, the decision isn't "launch on borderline p." It's either "rerun with more data" or "the effect is too small to justify ship." Both respect the pre-registered α.
Last one. You're running 20 A/B tests this quarter. How do you handle multiple comparisons?
Family-wise false positives compound. At α = 0.05 across 20 independent tests, P(at least one false positive) = 1 − (0.95)²⁰ ≈ 0.64 — odds-on you declare at least one winner purely by chance. Uncorrected α = 0.05 at that scale is indefensible.
Two corrections worth knowing.
Bonferroni: test each at α/m = 0.05/20 = 0.0025. Controls family-wise error rate (probability of any false positive) at 5%. Simple, conservative, and kills statistical-power — you need much larger samples to detect real effects.
Benjamini–Hochberg FDR: rank p-values, reject the k-th if for target FDR . Controls the expected proportion of false discoveries among your rejections, not the probability of any one. For q = 0.10 across 20 tests, about 10% of declared winners are expected to be false — much less aggressive than Bonferroni on power.
For an A/B testing portfolio I'd default to fdr-bh at q = 0.10. Launching a feature is reversible; tolerating 10% false-discovery rate to retain power is a good trade. For irreversible or safety-critical decisions — drug trials, auction changes — I'd flip to bonferroni.
A PM with no stats background asks you: what are confidence intervals, p-values, and alpha, and why 0.05?
I'd anchor on the duality first because it makes the three concepts click together, then translate each into PM-speak.
The duality: a p-value is the probability of data this extreme or more, assuming the null hypothesis. A confidence interval is the set of parameter values the data doesn't reject. Alpha is the rejection threshold we agreed to in advance. The link: the 95% CI excludes zero if and only if p < 0.05. They're two windows into the same test.
Then for the PM:
"P-value is a surprise score. If the feature had zero real effect, how surprised should we be by what we saw? 0.03 means a 3% chance of seeing this extreme a result under nothing-going-on. Low enough that we're willing to say the effect is real.
Confidence interval is the range of effects the data is consistent with. We observed a lift of 3 minutes with a 95% CI of [1.5, 4.5]. Zero isn't in there, so the effect is statistically significant at 5%. And the range tells you precision: [1.5, 4.5] is tight; [0.1, 5.9] is noisy.
Alpha is the false-alarm rate we agreed to tolerate. If we set α = 0.05 and there's no real effect, we'll still claim a win 5% of the time due to random variation. That's the cost of doing business."
The thing I'd emphasize: a p-value is not the probability the null is true, not the probability the effect is real, and not the effect size. It's a statement about the data given H₀, nothing more.
Why 0.05, and how would you actually pick alpha for a real experiment?
0.05 is Fisher's 1925 convention. He called it "a convenient level." There's no math behind the specific number.
The principled framework is cost-benefit. Expected cost of a wrong decision is . Minimize that over given the power curve.
The practical rule of thumb I'd give the PM: compare the two costs. If a false positive is 10× worse than a false negative, α should be an order of magnitude tighter — 0.005, not 0.05. Shipping a bad change to the Ads auction costs millions in revenue and is painful to roll back; α = 0.005 or 0.001 is the right bar. If false negatives dominate — a cheap, reversible UI experiment where the main cost is missing a winner — α = 0.10 is fine.
One non-negotiable: α is committed before the experiment runs. Picking it after seeing the data is where p-hacking lives.
The experiment comes back at p = 0.06. The PM wants to launch. What do you do?
I commit to the decision rule we pre-registered.
Pre-registered α was 0.05. p = 0.06 is not significant at that bar. I report null, and I don't redecide α after seeing the data — that's moving the goalposts, and the false-positive rate of "pick α after peeking" is not 5%, it's much worse.
Concretely, what I'd tell the PM: "We committed to 0.05 before launch. The result didn't clear it. If you want a second swing at this feature, it's a new experiment with a new sample — not a re-roll of the same data. Re-using the same users or extending the window until it crosses 0.05 is p-hacking, and at that point the reported 0.05 is a lie about the actual false-positive rate of the procedure."
This isn't me being rigid for the sake of it. The entire point of a pre-registered α is that it makes the false-positive rate a real guarantee. Once you break pre-registration, you don't have that guarantee any more.
The PM pushes back. "It's basically significant. We should launch."
Two responses.
One: "basically significant" has no statistical meaning. The threshold is binary by design. p = 0.06 and p = 0.04 feel close, but they sit on opposite sides of a decision rule we chose precisely because we needed a rule. Saying "close to 0.05 is fine" is the same as saying "we actually wanted α = 0.07," which nobody pre-registered.
Two, the useful diagnostic: post-hoc power at the observed effect size. This tells us which world we're in.
If post-hoc power was ~50%, the test was underpowered — the true effect might be real but we didn't collect enough data to see it. The right response isn't "launch anyway on p = 0.06"; it's rerun at 2× sample size. A properly powered rerun either clears 0.05 cleanly or confirms the effect is smaller than we thought.
If post-hoc power was ~90% and we still got p = 0.06, the effect really is small — around the MDE boundary or below it. In that case "basically significant" is actively misleading: the data is telling us the effect isn't big enough to clear the bar we set for a reason.
Either way, the decision isn't "launch on borderline p." It's either "rerun with more data" or "the effect is too small to justify ship." Both respect the pre-registered α.
Last one. You're running 20 A/B tests this quarter. How do you handle multiple comparisons?
Family-wise false positives compound. At α = 0.05 across 20 independent tests, P(at least one false positive) = 1 − (0.95)²⁰ ≈ 0.64 — odds-on you declare at least one winner purely by chance. Uncorrected α = 0.05 at that scale is indefensible.
Two corrections worth knowing.
Bonferroni: test each at α/m = 0.05/20 = 0.0025. Controls family-wise error rate (probability of any false positive) at 5%. Simple, conservative, and kills statistical-power — you need much larger samples to detect real effects.
Benjamini–Hochberg FDR: rank p-values, reject the k-th if for target FDR . Controls the expected proportion of false discoveries among your rejections, not the probability of any one. For q = 0.10 across 20 tests, about 10% of declared winners are expected to be false — much less aggressive than Bonferroni on power.
For an A/B testing portfolio I'd default to fdr-bh at q = 0.10. Launching a feature is reversible; tolerating 10% false-discovery rate to retain power is a good trade. For irreversible or safety-critical decisions — drug trials, auction changes — I'd flip to bonferroni.
Good.
This question has a debrief tool attached. Practice it aloud with a voice-mode AI interviewer, paste the transcript, and get a graded debrief against the reference answer.
How to do a mock interview
- 1
- 2
Copy this question and paste it as your first message:
Explain confidence intervals, p-values, and alpha level. How would you explain each to a non-technical product manager? How is alpha chosen? - 3
Switch to voice mode (mic icon in the chat input). Speak through each follow-up — aim for 4–6 turns.
- 4
When the interviewer says "thank you, that's all I had", type or speak this:
Print the full transcript of our conversation as alternating "Interviewer:" and "Candidate:" lines. Include every exchange verbatim. Do not paraphrase, summarize, or skip turns. Do not add commentary. - 5
Copy ChatGPT's response, paste it below, and run the debrief.