Mean vs Median — Which Is Better? Force a Choice.
Someone asks you: "Is the mean or the median a better summary statistic?" They don't accept "it depends" — they want you to force a choice and defend it.
The interviewer will probe on:
- What do they measure? (What's the definition and interpretation?)
- Statistical efficiency — which has lower variance?
- Computational efficiency — how expensive to compute?
- Implementation: write median in Python without using
np.median.
▶What They Measure
Both are "center" but in different senses.
Mean: the value that minimizes the sum of squared deviations.
Equivalently, the balance point of the distribution — the center of mass.
Median: the value that minimizes the sum of absolute deviations.
Equivalently, the 50th percentile — splits the data into two equal halves.
The difference traces back to L2 vs L1 loss. Mean penalizes outliers quadratically (they pull it hard); median treats them equally (only the middle value matters).
▶Statistical Efficiency
For i.i.d. samples from the normal-distribution :
So for normal data, the mean is about 57% more efficient — you'd need 1.57× the sample size with the median to get the same precision.
But for heavy-tailed distributions (Cauchy, Student-t with low df, real-world income data, latency data), the mean can have infinite variance while the median stays finite and stable.
Concretely:
- For Cauchy: mean has infinite variance, median has finite variance. Median is infinitely more efficient.
- For t-distribution with 2 df: variance is infinite.
- For log-normal: mean is unbiased but high-variance; median is more stable.
Rule of thumb: light tails → mean is more efficient. Heavy tails or outliers → median is more efficient AND more robust.
▶Computational Efficiency
Mean: — one pass, add up values, divide.
Median:
- Naive: sort the array, take the middle element. .
- Better: quickselect (partition-based selection) gives expected time, worst case.
- Linear worst-case: median-of-medians algorithm — guaranteed but with a large constant. Rarely used in practice.
In distributed / streaming settings:
- Mean: trivially parallel — compute partial sums, combine.
- Median: hard to parallelize or stream. Approximate quantile algorithms (t-digest, KLL sketch) exist but are approximate.
Rule of thumb: mean is dramatically cheaper at scale.
▶Robustness to Outliers
Mean: a single outlier can pull the mean arbitrarily far. If you have [1, 2, 3, 4, 1000], mean is 202 — very unrepresentative.
Median: outliers in the tails don't move the median. [1, 2, 3, 4, 1000] has median 3. A value could be and the median wouldn't change.
Formally: mean has breakdown point 0 (one contaminated value ruins it). Median has breakdown point 0.5 (up to half the data can be arbitrary before median moves to an uninformative position).
▶Forcing the Choice
For "which is better as a one-size-fits-all summary":
Median.
Reasoning:
- The dominant failure mode of mean is being pulled by outliers, which is common in real data.
- The dominant failure mode of median is slight inefficiency (needs 57% more data on normal) — small cost.
- Real-world data is rarely clean normal; most distributions have heavy tails or outliers.
- Median is almost never misleading; mean frequently is.
Counter-arguments (for mean):
- Mean has linearity: . Median doesn't ().
- Mean is unbiased by construction; median has bias even when unbiased-in-limit.
- Under homogeneity (truly normal clean data), mean is uniformly better.
- Computational ease at scale.
So for predictive modeling, economic aggregates, or physical measurements with known distributions, the mean is the right answer. For summarizing unknown/messy distributions to a stakeholder, the median is the safer default.
The "force a choice" is about what you'd pick as a default. I'd pick median.
▶Implementing Median in Python
def median(arr):
"""
Compute the median of a list of numbers without numpy.
Uses sort-and-pick approach: O(n log n).
"""
n = len(arr)
if n == 0:
raise ValueError("Cannot compute median of empty array")
sorted_arr = sorted(arr)
mid = n // 2
if n % 2 == 1:
return sorted_arr[mid]
else:
# Even length: average the two middle values
return (sorted_arr[mid - 1] + sorted_arr[mid]) / 2▶Implementing Median with Quickselect (O(n) expected)
import random
def quickselect(arr, k):
"""Find the k-th smallest element (0-indexed) in arr."""
if len(arr) == 1:
return arr[0]
pivot = random.choice(arr)
lows = [x for x in arr if x < pivot]
highs = [x for x in arr if x > pivot]
pivots = [x for x in arr if x == pivot]
if k < len(lows):
return quickselect(lows, k)
elif k < len(lows) + len(pivots):
return pivot
else:
return quickselect(highs, k - len(lows) - len(pivots))
def median_quickselect(arr):
n = len(arr)
if n % 2 == 1:
return quickselect(arr, n // 2)
else:
return (quickselect(arr, n // 2 - 1) + quickselect(arr, n // 2)) / 2This runs in expected time, which matters for very large arrays.
▶Common Mistakes
- "Median is always better because of outliers." Not always. For clean normal data, mean is strictly more efficient. "Always" claims lose nuance.
- Forgetting to handle even-length arrays. Average of the two middle values, not just taking one.
- Sorting with a nested loop (). Use Python's built-in
sortedor manual quickselect for better complexity. - Claiming mean is O(1) memory. Running mean needs one accumulator. Median requires all data to be seen (for exact median) — it's inherently memory-heavy.
Is mean or median better? Pick one.
I'd pick median, if I have to choose as a default. But let me unpack it.
What they measure: mean minimizes sum of squared deviations (L2); median minimizes sum of absolute deviations (L1). This is the root difference. Mean is the balance point, pulled by outliers; median is the 50th percentile, robust to tails.
Statistical efficiency: for clean normal data, mean is about 57% more efficient. Specifically, , while . So median needs 1.57× the data for the same precision.
But: for heavy-tailed or outlier-prone distributions, the story flips. For Cauchy, mean has infinite variance; median is stable. For real data — incomes, latencies, user engagement — distributions are usually heavy-tailed, and median becomes more informative.
Computational efficiency: mean is , median is with sort, or expected with quickselect. Mean is trivially parallel; median needs all the data. At scale, mean wins on compute.
Robustness: median has breakdown point 0.5 (up to half the data can be arbitrary before median breaks). Mean has breakdown point 0 — one extreme value is enough.
Given those tradeoffs, I'd default to median because the failure mode of mean — being misled by outliers — is more common and more dangerous in applied work than the failure mode of median — slightly lower efficiency. For clean, known distributions, mean is better; for unknown/messy data, median is safer.
Write median in code.
def median(arr):
n = len(arr)
if n == 0:
raise ValueError("Empty array")
sorted_arr = sorted(arr)
mid = n // 2
if n % 2 == 1:
return sorted_arr[mid]
else:
return (sorted_arr[mid - 1] + sorted_arr[mid]) / 2Handles both even and odd lengths. For large arrays, I'd use quickselect for expected time instead of sorting.
What's the median of [1, 2, 3, 4, 1000]?
- The 1000 is an outlier; median shrugs it off. The mean would be 202, which is way off and misleading.
When would you prefer mean despite the outliers?
Several cases.
Predictive modeling. MSE loss in regression corresponds to predicting the conditional mean, not median. For optimizing prediction accuracy under L2 loss, the mean is correct.
Aggregation in accounting. If you need total revenue to match per-user mean × count, you need the mean. Median doesn't satisfy that additive property.
Physical measurements with known distribution. Measuring a physical constant with Gaussian errors — mean is optimal (maximum likelihood estimator).
When the distribution is known to be symmetric and light-tailed. Mean and median coincide, but mean is more efficient.
Streaming / distributed computing. Mean can be computed from partial sums and enjoys central-limit-theorem guarantees; median requires the full dataset or approximate sketches.
So it's situational. But if I have to pick one without context — "what's a representative value of this messy data" — median is the safer pick for reporting, though mean is the right pick for building models.
Is mean or median better? Pick one.
I'd pick median, if I have to choose as a default. But let me unpack it.
What they measure: mean minimizes sum of squared deviations (L2); median minimizes sum of absolute deviations (L1). This is the root difference. Mean is the balance point, pulled by outliers; median is the 50th percentile, robust to tails.
Statistical efficiency: for clean normal data, mean is about 57% more efficient. Specifically, , while . So median needs 1.57× the data for the same precision.
But: for heavy-tailed or outlier-prone distributions, the story flips. For Cauchy, mean has infinite variance; median is stable. For real data — incomes, latencies, user engagement — distributions are usually heavy-tailed, and median becomes more informative.
Computational efficiency: mean is , median is with sort, or expected with quickselect. Mean is trivially parallel; median needs all the data. At scale, mean wins on compute.
Robustness: median has breakdown point 0.5 (up to half the data can be arbitrary before median breaks). Mean has breakdown point 0 — one extreme value is enough.
Given those tradeoffs, I'd default to median because the failure mode of mean — being misled by outliers — is more common and more dangerous in applied work than the failure mode of median — slightly lower efficiency. For clean, known distributions, mean is better; for unknown/messy data, median is safer.
Write median in code.
def median(arr):
n = len(arr)
if n == 0:
raise ValueError("Empty array")
sorted_arr = sorted(arr)
mid = n // 2
if n % 2 == 1:
return sorted_arr[mid]
else:
return (sorted_arr[mid - 1] + sorted_arr[mid]) / 2Handles both even and odd lengths. For large arrays, I'd use quickselect for expected time instead of sorting.
What's the median of [1, 2, 3, 4, 1000]?
- The 1000 is an outlier; median shrugs it off. The mean would be 202, which is way off and misleading.
When would you prefer mean despite the outliers?
Several cases.
Predictive modeling. MSE loss in regression corresponds to predicting the conditional mean, not median. For optimizing prediction accuracy under L2 loss, the mean is correct.
Aggregation in accounting. If you need total revenue to match per-user mean × count, you need the mean. Median doesn't satisfy that additive property.
Physical measurements with known distribution. Measuring a physical constant with Gaussian errors — mean is optimal (maximum likelihood estimator).
When the distribution is known to be symmetric and light-tailed. Mean and median coincide, but mean is more efficient.
Streaming / distributed computing. Mean can be computed from partial sums and enjoys central-limit-theorem guarantees; median requires the full dataset or approximate sketches.
So it's situational. But if I have to pick one without context — "what's a representative value of this messy data" — median is the safer pick for reporting, though mean is the right pick for building models.
Good.
This question has a debrief tool attached. Practice it aloud with a voice-mode AI interviewer, paste the transcript, and get a graded debrief against the reference answer.
How to do a mock interview
- 1
- 2
Copy this question and paste it as your first message:
Someone asks you: "Is the **mean** or the **median** a better summary statistic?" They don't accept "it depends" — they want you to force a choice and defend it. The interviewer will probe on: 1. **What do they measure?** (What's the definition and interpretation?) 2. **Statistical efficiency** — which has lower variance? 3. **Computational efficiency** — how expensive to compute? 4. **Implementation**: write median in Python without using `np.median`. - 3
Switch to voice mode (mic icon in the chat input). Speak through each follow-up — aim for 4–6 turns.
- 4
When the interviewer says "thank you, that's all I had", type or speak this:
Print the full transcript of our conversation as alternating "Interviewer:" and "Candidate:" lines. Include every exchange verbatim. Do not paraphrase, summarize, or skip turns. Do not add commentary. - 5
Copy ChatGPT's response, paste it below, and run the debrief.