← Dashboard
Q026 ·Experimentation ·Ab testing ·Metrics

A/B Test Metric Selection — Change a Button's Color

Medium High frequency

An app team wants to change a key button's color from blue to red. They're going to run an A/B test. Pick a primary metric — and the interviewer will keep pushing until the metric is concrete enough to write a power calculation against.

"That's too general. Be more specific."

This is the dialogue pattern the interviewer will use. Vague answers like "CTR" or "engagement" get rejected.

The Trap

Saying "I'd measure click-through rate" is technically correct and almost completely useless. Which click? On which surface? Per what denominator? Attributed how? The interviewer wants to see that you can turn a product question into an unambiguous metric definition.

Step 1 — Clarify Before Committing

Always ask these first:

  1. Which button is it, and what action does it trigger? ("Subscribe," "Add to Cart," "Start Trial," "Open Menu"?) The answer changes everything downstream.
  2. Where does the button live? (Home screen, after signup, buried in settings?) Surface determines base rate and audience.
  3. What's the business goal of the change? Usually one of: increase clicks, increase downstream conversion, accessibility fix, brand consistency.
  4. Is this a one-off visual change or part of a broader redesign? Determines whether guardrails should include design-system consistency metrics.

Without this, any metric is a guess. Say so out loud.

Step 2 — Name the Primary Metric Precisely

A good primary metric has four properties. Apply all four:

PropertyExample
Single number per userClick / no click, not "clicks per session" (confounded by session count)
Tied to the intended user actionThe specific button, not any click
Measured in a defined window"Within 1 session after exposure"
Measurable with low varianceBinary at user-level, not revenue at user-level

Concrete example (Subscribe button on home screen):

Primary metric: Per-user probability of clicking the "Subscribe" button at least once within 7 days of first exposure to the treatment. Denominator: users assigned to the experiment. Numerator: users who clicked the specific button (UI element ID) in that 7-day window.

Not "CTR." Not "engagement." That — with the window, the denominator, the dedup rule, and the specific UI element.

Step 3 — Don't Stop at the Click — Go One Step Downstream

A color change can absolutely increase clicks but decrease intent quality. Red might draw the eye but attract accidental or reluctant clicks.

So the primary metric is often the downstream conversion, not the click itself:

  • Button is "Subscribe" → measure subscription completion within N days, not clicks
  • Button is "Add to Cart" → measure purchase completion, not add-to-cart events
  • Button is "Start Free Trial" → measure trial activation + retention at day 7

The click is a leading indicator; the downstream conversion is the thing that pays the bills.

Step 4 — Guardrail Metrics (Always Bring Some)

Even if the primary metric moves in the right direction, you need guardrail-metrics to make sure you didn't break something else:

  • Session length / app engagement — did the red button make users bounce?
  • Bounce rate on that screen — did they leave instead of clicking?
  • Click-through on adjacent buttons — did we cannibalize other actions?
  • Crash rate / latency — any rendering regression
  • Negative feedback rate (if there's a feedback button) — subjective complaints
  • Accessibility: color contrast ratio meets WCAG AA for red-on-whatever-background. This is non-negotiable for Google.
Step 5 — Things Specific to Color Changes

Color changes have known pitfalls — naming them signals depth:

  • novelty-effect. A new color gets attention in the first week even if it's worse long-term. Run the test for at least 2–3 weeks and report early vs. late window separately.
  • primacy-effect. Heavy users might dislike the change and click less in week 1, even if the new color is better.
  • Color accessibility. Red/green combinations fail for the ~8% of men with red-green color blindness. Check contrast and run the test across OS dark-mode and light-mode.
  • Cultural meaning. Red signals "danger" / "stop" in Western UX and "auspicious" / "buy" in some Asian markets. If the app ships globally, stratify by region.
  • Cross-product consistency. If this button is red but the same action is blue in the desktop app, that's a design-system violation even if the A/B wins.
Step 6 — Sketch the Experiment, Not Just the Metric

A strong answer ends with one sentence that proves you'd actually run this:

"I'd randomize at the user level, 50/50, for at least 2 weeks. Primary is 7-day Subscribe completion rate; guardrails are session length, adjacent-button CTR, and WCAG contrast. I'd need about 50k users per arm to detect a 2% relative lift at 80% power, assuming ~5% baseline." (A quick minimum-detectable-effect calc back-solves from the business threshold to required n.)

Numbers are illustrative — the point is demonstrating you'd close the loop end-to-end.

Interviewer

The team wants to change this button's color from blue to red. Design an A/B test. What's your metric?

Before I pick a metric, I need to know what the button does. Is it a "Subscribe" button? "Add to Cart"? "Open Menu"? That changes everything.

Self-rate:
Interviewer · Follow-up

Let's say it's the "Subscribe" button on the home screen.

Got it. My primary metric is the per-user probability of completing a subscription within 7 days of first exposure to the treatment. Not click rate on the button — the downstream conversion.

Self-rate:
Interviewer · Follow-up

Why not just use click-through rate? That's more sensitive.

It is, but color changes are exactly the case where click rate and business value can diverge. Red is louder than blue — it might pull in more clicks but lower-intent clicks. Someone who meant to tap "Learn More" accidentally hits the red Subscribe button, bounces off the subscription page, and doesn't convert. Click rate goes up, subscriptions don't — or even drop.

So I want the primary metric to be as close as possible to the actual outcome the business cares about, which is completed subscriptions. I'd still track click rate as a secondary metric — if clicks go up but conversions don't, that's a diagnostic telling me the extra clicks were low-quality.

Self-rate:
Interviewer · Follow-up

"7-day window." Why 7 days specifically?

Two reasons. First, for subscriptions the decision isn't always made in the session where the user first sees the button — people compare plans, close the app, come back. A session-level conversion metric misses that. Second, 7 days is long enough to capture most of the decision latency without running into major attribution-window issues. If the business's funnel data shows most conversions happen within 72 hours, I'd shorten it. If they have a free-trial gate where conversion is day 7+, I'd lengthen it.

The specific window is a judgment call, but the key point is it needs to be fixed in advance and the same for both arms.

Self-rate:
Interviewer · Follow-up

What guardrails would you track?

Four categories.

Engagement guardrails: session length, screens-per-session, bounce rate on the home screen. If users find the red version jarring, they'll exit faster and this catches it.

Cannibalization: click-through rate on other buttons on the same screen. A louder button can steal clicks from less visually dominant actions. If we drive Subscribe up by crushing CTR on "Learn More," we might be hurting informed conversion long-term.

Quality / technical: app crash rate, screen-render latency — in case the color change ships a rendering bug on some device.

Accessibility: WCAG contrast ratio on the new color. This is a hard gate for Google — a shipped change that fails contrast is a P0 bug regardless of A/B results.

Self-rate:
Interviewer · Follow-up

Good. Anything specific about color changes I should be thinking about?

A few things.

Novelty effect is the big one. In the first week, any visual change draws attention just because it's different. If I read out results at day 7, I might be measuring novelty, not the underlying effect. I'd run for at least 2–3 weeks and explicitly compare early-window versus late-window treatment effect. If they diverge, the late window is the real answer.

Color blindness: roughly 8% of men have red-green color deficiency. If the button is red on a green-ish background, those users can't see it at all. I'd check the color palette against simulated color-blind views before launching.

Dark mode vs. light mode: red that pops on white often looks muddy on black. I'd want the test to be stratified by OS color mode, or at least checked.

Culturally global apps: red means "buy / auspicious" in parts of Asia and "danger / stop" in Western UX conventions. If this is a global app, I'd want to see the treatment effect broken down by region — it can literally have opposite signs.

Self-rate:
Interviewer · Follow-up

You have 30 seconds. Commit to an experiment design.

User-level randomization, 50/50. Two to three weeks minimum. Primary metric: 7-day Subscribe completion rate at the user level. Guardrails: session length, bounce rate, adjacent-button CTR, contrast ratio. Report treatment effect stratified by region, platform, and early vs. late window. Ship only if primary is positive, no guardrail regresses significantly, and contrast passes WCAG AA.

Self-rate:
Interviewer

The team wants to change this button's color from blue to red. Design an A/B test. What's your metric?

Candidate

Before I pick a metric, I need to know what the button does. Is it a "Subscribe" button? "Add to Cart"? "Open Menu"? That changes everything.

Interviewer

Let's say it's the "Subscribe" button on the home screen.

Candidate

Got it. My primary metric is the per-user probability of completing a subscription within 7 days of first exposure to the treatment. Not click rate on the button — the downstream conversion.

Interviewer

Why not just use click-through rate? That's more sensitive.

Candidate

It is, but color changes are exactly the case where click rate and business value can diverge. Red is louder than blue — it might pull in more clicks but lower-intent clicks. Someone who meant to tap "Learn More" accidentally hits the red Subscribe button, bounces off the subscription page, and doesn't convert. Click rate goes up, subscriptions don't — or even drop.

So I want the primary metric to be as close as possible to the actual outcome the business cares about, which is completed subscriptions. I'd still track click rate as a secondary metric — if clicks go up but conversions don't, that's a diagnostic telling me the extra clicks were low-quality.

Interviewer

"7-day window." Why 7 days specifically?

Candidate

Two reasons. First, for subscriptions the decision isn't always made in the session where the user first sees the button — people compare plans, close the app, come back. A session-level conversion metric misses that. Second, 7 days is long enough to capture most of the decision latency without running into major attribution-window issues. If the business's funnel data shows most conversions happen within 72 hours, I'd shorten it. If they have a free-trial gate where conversion is day 7+, I'd lengthen it.

The specific window is a judgment call, but the key point is it needs to be fixed in advance and the same for both arms.

Interviewer

What guardrails would you track?

Candidate

Four categories.

Engagement guardrails: session length, screens-per-session, bounce rate on the home screen. If users find the red version jarring, they'll exit faster and this catches it.

Cannibalization: click-through rate on other buttons on the same screen. A louder button can steal clicks from less visually dominant actions. If we drive Subscribe up by crushing CTR on "Learn More," we might be hurting informed conversion long-term.

Quality / technical: app crash rate, screen-render latency — in case the color change ships a rendering bug on some device.

Accessibility: WCAG contrast ratio on the new color. This is a hard gate for Google — a shipped change that fails contrast is a P0 bug regardless of A/B results.

Interviewer

Good. Anything specific about color changes I should be thinking about?

Candidate

A few things.

Novelty effect is the big one. In the first week, any visual change draws attention just because it's different. If I read out results at day 7, I might be measuring novelty, not the underlying effect. I'd run for at least 2–3 weeks and explicitly compare early-window versus late-window treatment effect. If they diverge, the late window is the real answer.

Color blindness: roughly 8% of men have red-green color deficiency. If the button is red on a green-ish background, those users can't see it at all. I'd check the color palette against simulated color-blind views before launching.

Dark mode vs. light mode: red that pops on white often looks muddy on black. I'd want the test to be stratified by OS color mode, or at least checked.

Culturally global apps: red means "buy / auspicious" in parts of Asia and "danger / stop" in Western UX conventions. If this is a global app, I'd want to see the treatment effect broken down by region — it can literally have opposite signs.

Interviewer

You have 30 seconds. Commit to an experiment design.

Candidate

User-level randomization, 50/50. Two to three weeks minimum. Primary metric: 7-day Subscribe completion rate at the user level. Guardrails: session length, bounce rate, adjacent-button CTR, contrast ratio. Report treatment effect stratified by region, platform, and early vs. late window. Ship only if primary is positive, no guardrail regresses significantly, and contrast passes WCAG AA.

Interviewer

Good.

This question has a debrief tool attached. Practice it aloud with a voice-mode AI interviewer, paste the transcript, and get a graded debrief against the reference answer.

Sign in to use. Free during beta.

How to do a mock interview
  1. 1
  2. 2

    Copy this question and paste it as your first message:

    An app team wants to change a key button's color from blue to red. They're going to run an A/B test. **Pick a primary metric** — and the interviewer will keep pushing until the metric is concrete enough to write a power calculation against. > "That's too general. Be more specific." This is the dialogue pattern the interviewer will use. Vague answers like "CTR" or "engagement" get rejected.
  3. 3

    Switch to voice mode (mic icon in the chat input). Speak through each follow-up — aim for 4–6 turns.

  4. 4

    When the interviewer says "thank you, that's all I had", type or speak this:

    Print the full transcript of our conversation as alternating "Interviewer:" and "Candidate:" lines. Include every exchange verbatim. Do not paraphrase, summarize, or skip turns. Do not add commentary.
  5. 5

    Copy ChatGPT's response, paste it below, and run the debrief.

Shortcuts
SpaceReveal next 123Status FFocus TTranscript NNotes EscClose concept Prev / next