Q003 ·Statistics ·Hypothesis testing ·Confidence intervals

Explain CI, P-value, and Alpha — How Would You Explain Them to a PM?

Medium High frequency

1Problem

Explain confidence intervals, p-values, and alpha level. How would you explain each to a non-technical product manager? How is alpha chosen?

2HintsThink first, then tap to reveal

Hint 1 Frame it for a PM — plain language, no jargon. Tap →

Hint 2 Distinguish effect size from statistical significance. Tap →

Hint 3 Why 0.05? History, not math. Tap →

3Solution

▶Confidence Interval (CI)

Technical definition: If we repeated the experiment many times, 95% of the constructed confidence-intervals would contain the true parameter value.

Common misconception: "There's a 95% probability the true value is in this interval." — WRONG in frequentist framework. The true value is fixed; the interval is random.

For PM: "We ran the experiment and the new feature increased engagement by 3 minutes, with a 95% CI of [1.5, 4.5]. This means we're fairly confident the real improvement is somewhere between 1.5 and 4.5 minutes."

What determines CI width:

Sample size n ↑ → CI narrows
Data variability σ ↑ → CI widens
Confidence level ↑ (95% → 99%) → CI widens

▶P-value

Technical definition: The p-value is the probability of observing a result as extreme as (or more extreme than) the current result, assuming null-hypothesis H₀ is true.

Common misconceptions:

≠ probability that H₀ is true
≠ probability the result is due to chance
≠ effect size (small p ≠ large effect; with huge n, tiny effects become significant)

For PM: "The p-value is a surprise score. If the feature had zero real effect, how surprised would we be to see this result? A p-value of 0.03 means there's only a 3% chance we'd see something this dramatic if nothing was actually happening. That's pretty surprising, so we think the effect is real."

▶Alpha (Significance Level)

Technical definition: The threshold for rejecting H₀. It controls the type-i-error rate — the probability of a false positive.

Why 0.05? Historical convention from Ronald Fisher (1925). No deep mathematical reason.

How to actually choose alpha — it's a business decision:

Scenario	Alpha	Reasoning
Drug clinical trial	0.01 or 0.001	False positive → patient harm
Google Ads auction change	0.01	False positive → revenue loss at scale
UI color change	0.05	Low risk, easy rollback
Push notification test	0.10	Low cost, don't want to miss opportunities
Genomics (10,000 tests)	0.05/m (bonferroni or fdr-bh)	Multiple testing correction

The real framework: Balance type-i-error cost vs type-ii-error cost, considering:

Is the decision reversible?
What's the cost of a false positive vs missing a true effect?
How much sample size can you afford? (lower α → need more data, hurts statistical-power)

4Interview

Interviewer

A PM with no stats background asks you: what are confidence intervals, p-values, and alpha, and why 0.05?

I'd anchor on the duality first because it makes the three concepts click together, then translate each into PM-speak.

The duality: a p-value is the probability of data this extreme or more, assuming the null hypothesis. A confidence interval is the set of parameter values the data doesn't reject. Alpha is the rejection threshold we agreed to in advance. The link: the 95% CI excludes zero if and only if p < 0.05. They're two windows into the same test.

Then for the PM:

"P-value is a surprise score. If the feature had zero real effect, how surprised should we be by what we saw? 0.03 means a 3% chance of seeing this extreme a result under nothing-going-on. Low enough that we're willing to say the effect is real.

Confidence interval is the range of effects the data is consistent with. We observed a lift of 3 minutes with a 95% CI of [1.5, 4.5]. Zero isn't in there, so the effect is statistically significant at 5%. And the range tells you precision: [1.5, 4.5] is tight; [0.1, 5.9] is noisy.

Alpha is the false-alarm rate we agreed to tolerate. If we set α = 0.05 and there's no real effect, we'll still claim a win 5% of the time due to random variation. That's the cost of doing business."

The thing I'd emphasize: a p-value is not the probability the null is true, not the probability the effect is real, and not the effect size. It's a statement about the data given H₀, nothing more.

Self-rate:

Interviewer · Follow-up

Why 0.05, and how would you actually pick alpha for a real experiment?

0.05 is Fisher's 1925 convention. He called it "a convenient level." There's no math behind the specific number.

The principled framework is cost-benefit. Expected cost of a wrong decision is $C_{FP} \cdot \alpha + C_{FN} \cdot (1 - \text{power})$ . Minimize that over $\alpha$ given the power curve.

The practical rule of thumb I'd give the PM: compare the two costs. If a false positive is 10× worse than a false negative, α should be an order of magnitude tighter — 0.005, not 0.05. Shipping a bad change to the Ads auction costs millions in revenue and is painful to roll back; α = 0.005 or 0.001 is the right bar. If false negatives dominate — a cheap, reversible UI experiment where the main cost is missing a winner — α = 0.10 is fine.

One non-negotiable: α is committed before the experiment runs. Picking it after seeing the data is where p-hacking lives.

Self-rate:

Interviewer · Follow-up

The experiment comes back at p = 0.06. The PM wants to launch. What do you do?

I commit to the decision rule we pre-registered.

Pre-registered α was 0.05. p = 0.06 is not significant at that bar. I report null, and I don't redecide α after seeing the data — that's moving the goalposts, and the false-positive rate of "pick α after peeking" is not 5%, it's much worse.

Concretely, what I'd tell the PM: "We committed to 0.05 before launch. The result didn't clear it. If you want a second swing at this feature, it's a new experiment with a new sample — not a re-roll of the same data. Re-using the same users or extending the window until it crosses 0.05 is p-hacking, and at that point the reported 0.05 is a lie about the actual false-positive rate of the procedure."

This isn't me being rigid for the sake of it. The entire point of a pre-registered α is that it makes the false-positive rate a real guarantee. Once you break pre-registration, you don't have that guarantee any more.

Self-rate:

Interviewer · Follow-up

The PM pushes back. "It's basically significant. We should launch."

Two responses.

One: "basically significant" has no statistical meaning. The threshold is binary by design. p = 0.06 and p = 0.04 feel close, but they sit on opposite sides of a decision rule we chose precisely because we needed a rule. Saying "close to 0.05 is fine" is the same as saying "we actually wanted α = 0.07," which nobody pre-registered.

Two, the useful diagnostic: post-hoc power at the observed effect size. This tells us which world we're in.

If post-hoc power was ~50%, the test was underpowered — the true effect might be real but we didn't collect enough data to see it. The right response isn't "launch anyway on p = 0.06"; it's rerun at 2× sample size. A properly powered rerun either clears 0.05 cleanly or confirms the effect is smaller than we thought.

If post-hoc power was ~90% and we still got p = 0.06, the effect really is small — around the MDE boundary or below it. In that case "basically significant" is actively misleading: the data is telling us the effect isn't big enough to clear the bar we set for a reason.

Either way, the decision isn't "launch on borderline p." It's either "rerun with more data" or "the effect is too small to justify ship." Both respect the pre-registered α.

Self-rate:

Interviewer · Follow-up

Last one. You're running 20 A/B tests this quarter. How do you handle multiple comparisons?

Family-wise false positives compound. At α = 0.05 across 20 independent tests, P(at least one false positive) = 1 − (0.95)²⁰ ≈ 0.64 — odds-on you declare at least one winner purely by chance. Uncorrected α = 0.05 at that scale is indefensible.

Two corrections worth knowing.

Bonferroni: test each at α/m = 0.05/20 = 0.0025. Controls family-wise error rate (probability of any false positive) at 5%. Simple, conservative, and kills statistical-power — you need much larger samples to detect real effects.

Benjamini–Hochberg FDR: rank p-values, reject the k-th if $p_{(k)} \leq (k/m) \cdot q$ for target FDR $q$ . Controls the expected proportion of false discoveries among your rejections, not the probability of any one. For q = 0.10 across 20 tests, about 10% of declared winners are expected to be false — much less aggressive than Bonferroni on power.

For an A/B testing portfolio I'd default to fdr-bh at q = 0.10. Launching a feature is reversible; tolerating 10% false-discovery rate to retain power is a good trade. For irreversible or safety-critical decisions — drug trials, auction changes — I'd flip to bonferroni.

Self-rate:

Interviewer

A PM with no stats background asks you: what are confidence intervals, p-values, and alpha, and why 0.05?

Candidate

I'd anchor on the duality first because it makes the three concepts click together, then translate each into PM-speak.

Then for the PM:

Interviewer

Why 0.05, and how would you actually pick alpha for a real experiment?

Candidate

0.05 is Fisher's 1925 convention. He called it "a convenient level." There's no math behind the specific number.

The principled framework is cost-benefit. Expected cost of a wrong decision is $C_{FP} \cdot \alpha + C_{FN} \cdot (1 - \text{power})$ . Minimize that over $\alpha$ given the power curve.

One non-negotiable: α is committed before the experiment runs. Picking it after seeing the data is where p-hacking lives.

Interviewer

The experiment comes back at p = 0.06. The PM wants to launch. What do you do?

Candidate

I commit to the decision rule we pre-registered.

Interviewer

The PM pushes back. "It's basically significant. We should launch."

Candidate

Two responses.

Two, the useful diagnostic: post-hoc power at the observed effect size. This tells us which world we're in.

Either way, the decision isn't "launch on borderline p." It's either "rerun with more data" or "the effect is too small to justify ship." Both respect the pre-registered α.

Interviewer

Last one. You're running 20 A/B tests this quarter. How do you handle multiple comparisons?

Candidate

Two corrections worth knowing.

Interviewer

Good.

5Notes

Saved Export all notes

6Mock interview debrief

This question has a debrief tool attached. Practice it aloud with a voice-mode AI interviewer, paste the transcript, and get a graded debrief against the reference answer.

Create account or sign in

How to do a mock interview

1
Open DS Mock Interviewer in ChatGPT
2

Copy this question and paste it as your first message:

Explain confidence intervals, p-values, and alpha level. How would you explain each to a non-technical product manager? How is alpha chosen?
3

Switch to voice mode (mic icon in the chat input). Speak through each follow-up — aim for 4–6 turns.
4

When the interviewer says "thank you, that's all I had", type or speak this:

Print the full transcript of our conversation as alternating "Interviewer:" and "Candidate:" lines. Include every exchange verbatim. Do not paraphrase, summarize, or skip turns. Do not add commentary.
5

Copy ChatGPT's response, paste it below, and run the debrief.

Shortcuts ›

SpaceReveal next 123Status FFocus TTranscript NNotes EscClose concept ←→Prev / next