If you’re new-ish to UX-or prepping for interviews-you’ve probably felt this: you run a test in a UX research platform, and suddenly a dashboard showers you with percentages, misclick counts, time-on-task, and a big composite “usability score.” Your gut says, “Ah, so usability testing is quantitative.” Then a senior researcher looks at your plan and says, “Nope-usability testing is qualitative; five users is enough.” It’s confusing, and the internet doesn’t help much because you can find smart people arguing both sides.
TL;DR Usability testing can be qualitative or quantitative. Use qual to find and understand problems (small, varied samples; evidence via clips/quotes). Use quant to measure and compare performance (larger samples; predefined success rules; report completion/time with confidence intervals or against benchmarks).
Here’s the reality you can say out loud in interviews and in stakeholder rooms without sweating: usability testing can be qualitative or quantitative. It depends on your goal, study design, and sample size-not on whether your tool shows numbers. This article will give you a practical, defensible way to label your study, choose the right design, and explain your choices with credibility. You’ll come away with a method map, decision steps, and ready-to-paste wording for your report slides.
Ready to build that trust and kickstart your research?
let’s make trust the foundation of every project you work on.
Definitions that actually matter to de-mystify
Usability testing is an umbrella. It can be formative (used mid-design to find problems and fix them) or summative (used to measure performance against a benchmark or compare versions). Qualitative work discovers problems and explains why they happen; quantitative work measures how much something happens and supports comparisons. Neither is “better.” They serve different decisions, and strong teams use both across a product’s life cycle.
When people say “qualitative usability testing,” they mean small, purposefully varied samples with rich observation (clips/quotes), aiming to find and understand issues. When they say “quantitative usability testing,” they mean larger samples with predefined success rules and confidence intervals around metrics like completion rate and time, often compared against a prior release or a competitor. A good mental shortcut: Qual = discover & explain; Quant = measure & compare.
There are several places in this article where abbreviations like SUS, SEQ, UEQ, SUPR-Q are used. Here are the definitions of those:
SUS (System Usability Scale): SUS is a 10-item questionnaire that measures overall perceived usability of a system or product. Respondents rate agreement on a 5-point scale; items are scored, summed, and multiplied by 2.5 to yield a 0-100 score. It’s quick, technology-agnostic, and reliable even with modest sample sizes. SUS is best for benchmarking trends or comparing versions, not for pinpointing specific design problems; many teams treat ~68 as a rough “average” reference point.
SEQ (Single Ease Question): SEQ is a single post-task question-“Overall, how easy or difficult was this task?”-typically answered on a 7-point scale. You compute a mean (and ideally a confidence interval) across participants for each task. It’s highly sensitive to differences in task difficulty and complements behavioral metrics like completion and time. SEQ is fast to administer and great for comparing tasks or variants, but, by itself, it doesn’t explain why a task was hard.
UEQ (User Experience Questionnaire): UEQ assesses both pragmatic (e.g., efficiency, dependability) and hedonic (e.g., stimulation, novelty) qualities using 26 semantic-differential items. Scores are calculated for six scales-Attractiveness, Perspicuity, Efficiency, Dependability, Stimulation, Novelty-on a continuum typically ranging from −3 to +3. It comes with public benchmarks so you can see how your product compares to others. UEQ is especially useful when you want a broader UX profile beyond usability alone.
SUPR-Q (Standardized User Experience Percentile Rank Questionnaire): SUPR-Q is an 8-item instrument focused on website UX, covering usability, trust/credibility, appearance, and loyalty (often paired with an NPS-style loyalty indicator). Results are reported as percentile ranks against a large industry benchmark, making it easy to see where a site stands relative to others. Subscale scores help you target design investments (e.g., credibility vs. aesthetics). SUPR-Q is widely used for web evaluations and competitive comparisons where an external norm matters.
Means with CIs (Confidence Intervals): “Mean” is the arithmetic average of a metric across participants (e.g., average SEQ score). A confidence interval wraps that mean with an uncertainty range (e.g., mean 5.6 with a 90% CI of 5.1-6.1), indicating where the true population mean likely lies given your sample and its variability. Reporting means with CIs prevents false precision, makes comparisons more honest (do intervals overlap?), and helps stakeholders judge reliability at a glance. Note that for skewed metrics like time-on-task, medians with CIs are usually more appropriate than means.
Confusion Between Qual vs Quant Usability Testing
Modern platforms do a great job visualising behavior: success percentages, time distributions, heatmaps, even a tidy 0-100 “usability score.” Those numbers are descriptive-they summarize what your sample did. They only become inferential (e.g., “we’re 90% confident real-world users will complete Task 3 at ≥80%”) when your study was designed for estimation or comparison and your sample is large enough to compute confidence intervals sensibly.
Without that design and sample size, a percentage is just a clue, not a claim. (If you’re curious why researchers keep mentioning intervals, MeasuringU has shown over and over that a completion rate without its interval is easy to misread; the interval tells you the plausible range in the real population.)
And the “five users” line you’ve heard? It’s a qualitative discovery heuristic: tiny rounds find many common issues quickly, especially when you iterate (5 => fix => 5 => fix…). It was never meant to justify small-N quant claims. Use it to find problems; don’t use it to declare benchmarks.
What’s Qualitative, what’s Quantitative, what “Depends”
Qualitative methods (discover & understand “why”)
| Method | Primary goal | Typical sample (n) | Study setup (notes) | Outputs you’ll report | Great when? | Watch-outs |
| Moderated, task-based usability testing | Find issues; reveal mental models & friction causes | 5-8 per round (per audience) | Realistic scenarios; neutral prompts; probe after the attempt | Issue list with severity; clips/quotes; recommended fixes | Design is evolving; need to diagnose why | Leading questions; turning session into a demo; treating quotes as votes |
| Contextual inquiry / field study / shadowing | See real workflows, constraints, tacit knowledge | 4-10 participants/sites | Observe first, interrupt later; capture environment & artifacts | Narrative of workflow; constraints; opportunity areas | Complex B2B/pro tools; policy or environment heavy work | Hawthorne effect; access/logistics; over-interpretation from one site |
| Semi-structured interviews | Language, decision criteria, anxieties, unmet needs | 6-20 | Episodic prompts (“last time you…”); short, neutral probes | Themes with verbatims; vocabulary to use/avoid | Early discovery; reframing problem statements | Asking hypotheticals; overstating “would use” claims |
| Diary study (qual variant) | Longitudinal experience; emotion arcs; breakdowns | 10-25 | Prompted + free entries; mid-study check-ins | Thematic codes; journey stories; design implications | Behaviors unfold over days/weeks | Attrition; shallow entries without prompts |
| Participatory design / co-creation | Surface values & trade-offs; co-imagine options | 5-12 per workshop | Equitable activities; multiple rounds; reflection | Principles; co-created artifacts; criteria to evaluate options | Alignment; exploring futures; sensitive domains | Dominant voices; “pretty outputs” without decisions |
| Heuristic evaluation / cognitive walkthrough | Expert review for usability issues | 2-4 evaluators | Use explicit heuristics; walk key tasks | Issue list with rationale & severity | Early catch of obvious issues; complements testing | Bias to evaluator’s experience; not a substitute for users |
| Session replays / support-call reviews (sampled) | Reconstruct failure paths; build evidence | 10-30 targeted cases | Sample by outcome; annotate moments | Clips & narratives explaining breakdowns | Explaining analytics drops; on-call follow-ups | Cherry-picking; privacy constraints |
Quantitative methods (measure & compare “how much”)
| Method | Primary goal | Typical sample (n) | Study setup (notes) | Outputs you’ll report | Great when? | Watch-outs |
| Summative usability benchmark | Estimate completion/time/error; compare to benchmark/competitor | 20-40+ per variant | Standardized tasks; predefined success rules; neutral instructions | Completion % & time (with confidence intervals); error rate; SEQ; SUS vs benchmark | Milestones; go/no-go; trend tracking | Underpowered samples; moving goalposts post-hoc |
| Unmoderated large-N usability test | Estimation/comparison at scale | 20-50+ | Clear task wording; controlled environment | Same as above; plus screen-level aggregates | Fast turn on stable flows | Mixed devices/contexts that add noise |
| A/B or multivariate experiment (live) | Causal impact on conversion/completion | Traffic-driven | Randomized exposure; guardrails | Lift/delta with statistical tests | Mature products; high traffic | Confounds (seasonality, audience shifts) |
| Tree testing (IA) | Findability; path quality | 30-50+ | Text-only tree; task scenarios | Success paths, time, backtracks | IA changes; nav audits | Over-generalizing to full UI without follow-ups |
| First-click testing | First choice accuracy | 30-50+ | Single prompt; first click captured | % correct first click; time to click | Early layout/comms checks | Reading too much into small lifts |
| Standardized questionnaires (SUS, SEQ, UEQ, SUPR-Q) | Perceived usability or ease | 20-40+ (context dependent) | Administer consistently post-task/test | Means with CIs; compare to norms/benchmarks | Summaries executives understand | Treating one score as the whole story |
| Analytics / telemetry | Real-world behavior at scale | Population | Instrumentation; events defined | Funnels; drop-off; error rates | After launch; KPI tracking | Attribution ambiguity; missing context |
| Eye-tracking (metric-driven) | Visual attention metrics | 20-30+ | Calibrated hardware; defined AOIs | Fixations; dwell; scan paths (stats) | Content/layout studies | Lab artificiality; small n masquerading as quant |
Mixed – “it depends” (choose the mode you need)
| Method / artifact | When it’s Qual | When it’s Quant | Notes |
| Usability testing (general) | Small rounds to find & explain issues; rich probing | Large, standardized study with CIs & benchmarks | Label follows goal + sample + analysis, not the tool |
| Surveys | Primarily open-ended for explanation | Scaled Likert/standard scales with analysis | You can mix: small open-ends inside a quant survey |
| Diary studies | Thematic coding of entries | Instrumented metrics across many participants | Often run as qual first, quant later |
| Eye-tracking | Exploratory traces on a few users | Metric analysis across many | Pair with task outcomes for meaning |
| Card sorting | Open sort to learn language | Closed sort/tree tests at scale | Validate IA with tree tests afterward |
| Session replays | Sampled stories to diagnose | Systematic coding/counts across many | Be explicit about sampling rules |
Which situations call for Quant vs Qual
You’ll often make the right call simply by looking at the organization and the decision on the table.
When quantitative usability testing is expected (or required)
- Large consumer products with high traffic. Leadership wants defensible KPIs, trend lines, and “did v2 actually improve Task 3?” numbers with uncertainty bounds (not just a vibe). That means summative benchmarks and, later, live A/Bs.
- Enterprises and regulated industries (finance, health, government). Procurement and compliance teams like acceptance criteria they can read: “≥80% completion at 90% confidence,” “SUS ≥ X.” A summative evaluation is the right artifact here.
- Comparative studies (competitor vs. current; version vs. version). If the decision is “switch approach A=>B,” you’ll want properly powered comparisons-not anecdotes.
- High-stakes launches (checkout, security, consent/permissions). Failure is costly, and numbers help de-risk the go/no-go.
- Global rollouts where you need to confirm parity across locales/devices-not just “we saw two people struggle in German.”
When qualitative usability testing is the right tool
- Early-stage startups or 0=>1 features. The interface is moving every week. You need to find problems fast, fix them, and retest.
- Complex workflows / specialist tools (B2B/pro). Context and tacit knowledge matter more than a single score.
- Sensitive interactions (onboarding trust, error recovery, privacy comprehension). Here, language and emotion do the heavy lifting-watching people work and probing their thinking beats a spreadsheet.
- Low-traffic products (recruitment limits) or tight timelines (learn tomorrow, change next week).
- Diagnosing metrics (analytics show a drop; you use qual to explain why and what to change).
Most teams end up in a blended rhythm: qual => iterate => qual, then quant at milestones to confirm progress and defend the roadmap.
Qualitative Usability Testing (Formative)
Think of qualitative usability testing as a series of short conversations with reality. Your goal is to uncover problems and the causes behind them. Numbers can appear on your notes (“6/8 completed”), but your currency is evidence: what you saw and heard, the pattern in those moments, and the design change they imply.
Design it well. Write realistic scenarios that don’t telegraph the correct path. Define what counts as success, indirect success, failure-not for statistical claims, but so you can report consistently and avoid moving the goalposts. Keep your prompts short. If you use think-aloud, pause when needed; if you observe silently, be ready with neutral probes (“What were you expecting here?”).
Recruit like a pro. Target recent behavior and context, not job titles. If you’re testing a claim submission flow, people who recently submitted a claim will give richer data than generic “users.” Sprinkle in edge cases: first-time users, power users, and someone who recently failed and tried again.
Moderate for depth. The trick no one puts on a checklist: use silence. People will fill it with the truth. Label tension when you hear it (“You hesitated-what was going on?”). If a manager or spouse sits in, name the power dynamic and decide whether to continue or switch to a 1:1.
Synthesize responsibly. Start with open coding on a few transcripts, then name patterns only after you’ve seen contradictions. Keep a column labeled “rival explanations”-what else could explain the same behavior? Stop when the signal stabilizes; there’s no magic N, but there is a point where you’re hearing the same thing for the same reasons. Tie every insight to a clip or quote so decisions aren’t faith-based.
Common pitfalls.
- Turning the session into a demo of your clever design.
- Leading questions (“How helpful was…?”).
- Treating quotes as votes (“three people said X, so X wins”) instead of evidence in context.
- Declaring “quantitative” results off a tiny n because the dashboard looks official.
Use this label in your plan or report:
“Qualitative usability test (n=8, moderated). Numbers are descriptive for this sample; findings are issue-focused and traceable to clips.”
The “five users” debate? Consider it settled-for discovery. Iterative small rounds find most common issues quickly and cheaply; that’s the point. They don’t produce population metrics; that’s not the point.
Quantitative Usability Testing (Summative/Benchmarking)
If the decision requires a measurement you can stand behind-“Task 2 success is ≥80%,” “v2 is faster than v1,” “our SUS improved by 6-10 points”-you’re in quant territory. That means a bigger sample, standardized tasks, predefined rules, and confidence intervals around your metrics.
Start with a measurement goal. Examples: “Estimate Task 1 completion to ±10 percentage points at 90% confidence,” or “Detect a 10-point improvement in SUS over last release.”
Plan your sample size. There’s no single magic number, but many standalone benchmarks land in the 20-40+ range per variant, depending on your precision target and the variability of the metric. MeasuringU’s recommendations and calculators are the place to start; they also explain why small-sample completion rates often carry wide intervals.
Standardize tasks and success rules. Define success/indirect success/failure before you launch. Keep instructions neutral. Randomize task order if you’re comparing flows to reduce learning effects.
Choose the right metrics.
- Completion rate (with confidence intervals, not just a single percentage).
- Time on task (report medians and intervals-times are typically skewed).
- Error rate (define what counts).
- Post-task SEQ and post-test SUS if perception matters. For SUS, it’s common to interpret an overall score near 68 as “around average” in many contexts; treat it as a reference point, not a law.
Analyze without overclaiming. Report the uncertainty. “Task 2 completion: 82% (90% CI: 70-91)” is honest and decision-ready; “82% success” without an interval invites false certainty. If variants’ intervals overlap heavily, say there’s no clear difference at your current n.
Avoid the quant traps.
- Don’t write the success rule after seeing results.
- Don’t compare different prototypes or audiences and call it A/B.
- Don’t slice the data 10 ways to find a “win.” Pre-register your rules and stick to them.
Use this label in your report:
“Quantitative usability evaluation (n=30). Predefined success rules; reporting completion/time with 90% confidence intervals; SUS compared to prior benchmark.”
If you want a deeper read on small-sample math for completion rates and the pros/cons of different interval methods, Sauro and Lewis have comparisons and recommendations; they’re useful when someone challenges your intervals.
Heatmaps, Misclicks, Time, and “Usability Scores”
Heatmaps are gorgeous and persuasive. Treat them as a picture of where this sample clicked. They’re great for spotting confused attention (e.g., lots of clicks on non-interactive labels) and for generating hypotheses about layout or information scent. They don’t, by themselves, tell you what your whole user base will do. (Framed this way, heatmaps play well with both qual and quant sections of your report.)
Misclick rate is a symptom. It often points to affordance issues or weak information scent. Pair the count with evidence-clips of the moment-to locate the cause.
Time on task is skewed in most studies (a few very long attempts stretch the average). That’s why medians and intervals are standard for quant, and why you should resist ranking designs by raw averages from a tiny N.
Composite “usability scores.” Some UX research platforms roll your metrics into a single 0-100 measure. These are handy internal speedometers-useful for scanning a study-but always unpack task-level results before concluding anything.
Here are some examples of situations where one needs to decide whether to use qual or quant approach to usability testing.
Example A – Formative, moderated (qual)
- Goal: find issues in the account-recovery flow before release.
- Method: 8 moderated sessions; realistic scenarios; neutral prompting; capture clips.
- Sample: recent attempt at recovery (past month), mix of success/failure; edge cases (new phone numbers, enterprise SSO).
- Output: issue list with severity, evidence, and recommended fixes; risks if ignored.
- Label: “Qualitative usability test (n=8, moderated). Numbers are descriptive; findings are issue-focused and traceable to clips.”
- Follow-up: run another 5-6 sessions after fixes, then consider a quant check when the flow stabilizes.
Example B – Unmoderated quick pass (qual + metrics)
- Goal: pulse check on a new homepage layout.
- Method: 12-15 unmoderated participants with 2 tasks; record success, time, heatmap; 2 short open-ends.
- Output: descriptive metrics + evidence (screenshots or short video moments); list of top issues with design changes.
- Label: “Qualitative (unmoderated) usability test with descriptive metrics. Not a benchmark.”
- Follow-up: pick one change, retest quickly; if the design is sticking, plan a larger quant study.
Example C – Summative benchmark (quant)
- Goal: confirm the redesigned checkout meets performance targets.
- Method: 30 participants; standardized tasks; predefined success rules; post-task SEQ, post-test SUS.
- Analysis: completion/time with 90% CIs; SUS mean + CI; compare to last quarter and target thresholds.
- Label: “Quantitative usability evaluation (n=30) with predefined rules; reporting intervals and benchmarks.”
Follow-up: where metrics are weak, schedule targeted qual to explain the why.
Experience the power of UXArmy
Join countless professionals in simplifying your user research process and delivering results that matter
Frequently asked questions
What type of research is usability testing?
Usability testing is u003cstrongu003ethe practice of assessing the functionality and performance of your website, app, or product by observing real users completing tasks on itu003c/strongu003e. Usability testing lets you experience your product from the users’ perspective so you can identify opportunities to improve the user experience.
Is UX research qualitative or quantitative?
u003cstrongu003eUX research can include both qualitative and quantitative research methodsu003c/strongu003e. The best result and insight into user experience and user behavior will be obtained by combining the two approaches and including multiple UX research methods in your UX research plan.
What is an example of a qualitative method of usability testing?
Qualitative usability testing can be moderated, in-person and remotely, or un-moderated, which are often done remotely. They are based on the “think-aloud” methodology, in which we observe users’ behavior and listen to their feedback as they perform the tasks.
What is the methodology of usability testing?
Usability testing methods are categorized by qualitative/quantitative and moderated/unmoderated, and in-person/remote aspects. Key methods include u003ca href=u0022https://www.google.com/search?sca_esv=2343f0def42fad25u0026amp;cs=0u0026amp;q=Moderated+Remote+or+In-Person+Testsu0026amp;sa=Xu0026amp;ved=2ahUKEwj_wNqP7MqPAxUu1zgGHbyEBk8QxccNegQIDhABu0022 target=u0022_blanku0022 rel=u0022noreferrer noopeneru0022u003eModerated Remote or In-Person Testsu003c/au003e (observing users with direct guidance), u003ca href=u0022https://uxarmy.com/ux-toolkit/remote-user-testing/u0022u003eUnmoderated Remote Testsu003c/au003e (users complete tasks independently and record sessions), and u003ca href=u0022https://www.google.com/search?sca_esv=2343f0def42fad25u0026amp;cs=0u0026amp;q=Guerilla+Testingu0026amp;sa=Xu0026amp;ved=2ahUKEwj_wNqP7MqPAxUu1zgGHbyEBk8QxccNegQIDhADu0022 target=u0022_blanku0022 rel=u0022noreferrer noopeneru0022u003eGuerilla Testingu003c/au003e (quick, informal tests with passersby). Methods like u003ca href=u0022https://www.google.com/search?sca_esv=2343f0def42fad25u0026amp;cs=0u0026amp;q=A/B+Testingu0026amp;sa=Xu0026amp;ved=2ahUKEwj_wNqP7MqPAxUu1zgGHbyEBk8QxccNegQIehABu0022 target=u0022_blanku0022 rel=u0022noreferrer noopeneru0022u003eA/B Testingu003c/au003e and Surveys focus on quantitative data, while Think Aloud Protocols and Eye Tracking gather richer qualitative insights.
Is usability testing the same as QA testing?
But not all testing is the same. Quality Assurance (QA) focuses on catching issues early and ensuring the product is built right, while User Acceptance Testing (UAT) checks if it’s the right product for the people who’ll actually use it.
Is usability testing qualitative or quantitative?
Both-u003cstrongu003eit depends on your goal and designu003c/strongu003e. Use u003cstrongu003equalitativeu003c/strongu003e testing to discover and explain problems (small, purposefully varied samples; evidence via clips/quotes). Use u003cstrongu003equantitativeu003c/strongu003e testing to measure and compare performance (larger samples; predefined success rules; report completion/time with confidence intervals or against benchmarks). NN/g’s guidance is explicit: the two approaches are complementary, not competing.
When should I choose qualitative vs. quantitative usability testing?
Pick u003cstrongu003equalitativeu003c/strongu003e when your product is changing quickly, you need to u003cstrongu003efind u0026amp; fixu003c/strongu003e issues, or you’re dealing with complex workflows and sensitive interactions where context and language matter. Choose u003cstrongu003equantitativeu003c/strongu003e when you need u003cstrongu003edefensible measurementsu003c/strongu003e for a milestone (e.g., “≥80% task completion at 90% confidence”), a comparison to last release/competitors, or parity across locales/devices. In short: u003cstrongu003ediscover u0026amp; explainu003c/strongu003e =u003e qual; u003cstrongu003emeasure u0026amp; compareu003c/strongu003e =u003e quant.
Can I say “80% success” if 8 of 10 participants completed the task?
Not responsibly u003cstrongu003ewithout a confidence intervalu003c/strongu003e. Small samples have wide uncertainty. For example, MeasuringU shows that even at u003cstrongu003en=20u003c/strongu003e, the margin of error for completion rates can hover around u003cstrongu003e±20 percentage pointsu003c/strongu003e; at u003cstrongu003en=10u003c/strongu003e it’s wider. Report completion u003cstrongu003ewith a CIu003c/strongu003e (e.g., “80% [90% CI: 55-93]”), or frame the number as u003cstrongu003edescriptiveu003c/strongu003e only.
What metrics and instruments belong in a quantitative benchmark?
At minimum: u003cstrongu003etask completion rateu003c/strongu003e (with a CI), u003cstrongu003etime on tasku003c/strongu003e (report u003cstrongu003emediansu003c/strongu003e + CIs because times are skewed), and a post-task u003cstrongu003eSEQu003c/strongu003e. For post-test attitude, use u003cstrongu003eSUSu003c/strongu003e; a widely cited reference point is that u003cstrongu003eSUS ≈ 68u003c/strongu003e sits around the 50th percentile across many systems-use it as a u003cstrongu003econtextual benchmarku003c/strongu003e, not a pass/fail law.
Are heatmaps and small-study percentages “quantitative data”?
Heatmaps are u003cstrongu003evisualizations of quantitative tracesu003c/strongu003e (clicks, gaze, scroll) and are excellent for spotting u003cstrongu003econfused attentionu003c/strongu003e or hypotheses to test. But when they come from u003cstrongu003esmall, convenience samplesu003c/strongu003e, treat them as u003cstrongu003edescriptive cluesu003c/strongu003e, not population facts. Use them to u003cstrongu003eprioritize issuesu003c/strongu003e in qualitative work; use u003cstrongu003elarger, standardized studiesu003c/strongu003e when you need estimates or comparisons.