You’ve probably heard some version of this lately: “Why do Usability tests when we can ask an AI; or ship five versions and keep the winner?” Development speed is up, experimentation platforms are everywhere, and dashboards feel like truth. It’s tempting to think watching real people fumble through real tasks is optional now.
Here’s the sober counterpoint: analytics and AI show you what happened; usability testing in product development explains why it happened and what to change. That’s not nostalgia; that’s the current consensus among the most credible UX sources. Nielsen Norman Group’s latest guidance is blunt: today’s AI tools are excellent assistants for note-taking and categorization, but they are not capable of observing, facilitating, or analyzing moderated usability tests in place of researchers. Synthetic users (AI-generated stand-ins) can help with ideation, but they do not replace real participants when decisions carry risk.
Meanwhile, customers are telling us; loudly; that shipping faster hasn’t automatically made experiences better. Forrester’s 2024 CX Index reported a third straight year of CX decline in the U.S., to the lowest level since the index began. More tech, more dashboards, more experiments; worse outcomes. If there were ever a time to re-ground decisions in observed human behavior, it’s now.
Ready to build that trust and kickstart your research?
let’s make trust the foundation of every project you work on.
Speed has changed not the Human reality
Generative AI has made the work around research faster: drafting task scenarios, producing transcripts, auto-tagging, first-pass clustering, audience-specific summaries. That’s all real and useful. It’s also true that software teams can code certain tasks faster with AI. GitHub’s controlled experiments reported appreciable speedups on scoped assignments; at the same time, a 2025 randomized study by METR found experienced developers were slower on complex, real-world work when using AI, largely because quality assurance and prompt iteration soaked up time. Translation: AI amplifies your process; good or bad.
So yes, you can build and launch more. But the uncomfortable math from large-scale experimentation hasn’t budged: most ideas don’t move the metric. Microsoft’s experimentation leaders have said for years that only a minority of tested changes are wins; many are neutral, and a nontrivial share are negative. If your plan is “ship many and keep the winner,” your hit rate doesn’t improve unless your inputs improve; and that’s exactly where usability testing earns its keep. Stanford Robotics
Data Platforms excel at WHAT; Usability testing reveals the WHY
Gartner and Forrester coverage show enterprises pouring money into analytics, digital-experience platforms, and experimentation stacks. These tools are superb at surfacing what users did (or didn’t do) and at orchestrating and measuring experiments at scale. But they aren’t designed to sit with a person, probe a hesitation, decode a misread label, or untangle a mental model. The gap between what and why is precisely where usability testing lives; and where teams either ship the right fix or spray variants into the void. Business Insider
You’ve likely lived the archetype. Analytics flags a cliff at “Verify phone number.” The room argues about latency, copy, or code delivery. Three moderated sessions later you’ve watched people paste an OTP that truncates visually, seen auto-focus jumps break paste behavior, and heard a screen reader read “star star star,” wiping out meaning. In an hour, the why is unambiguous; and the fix is boringly obvious.
The New Habit: “Launch 3 versions and keep the winner”
Another idea echoing in product orgs is the launch-and-choose mindset:
- “Dev velocity is up; let the market tell us.”
- “We can create as many versions as necessary and pick the top performer.”
We should give this its due: when run well, experimentation is a competitive advantage. But two facts apply:
- Build speed isn’t outcome speed. AI can double throughput on some work and still slow you down on other work; the METR study with seasoned developers is a needed caution. Don’t confuse the rate of shipping with the rate of learning.
- Selection ≠ design. Experiments select among candidates; usability testing creates better candidates by revealing why people stumble so you can fix the cause, not just the symptom. As Microsoft’s Ronny Kohavi summarized from years of online experiments, most features don’t help; which makes the quality of your inputs decisive.
“Most experiments show that features fail to move the metrics they were designed to improve.” ; Ronny Kohavi, Microsoft experimentation group. Stanford Robotics
Where usability testing pays for itself
If you want leadership to take usability seriously in 2025, don’t talk about “delight.” Talk about avoided waste, conversion lift, support deflection, risk reduction, and team alignment; with receipts.
1) Avoided rework (the cheapest time to fix is before code hardens)
- Prototype catch => sprint saved: A fintech’s KYC flow required selfie + ID scan. Six quick moderated sessions surfaced three root causes: a “live photo” control that looked disabled, a preview overlay that hid “Continue,” and copy implying a permanent lockout on retry. The team fixed states, layout, and copy, then re-tested with five more people: no misreads, faster progression, legal sign-off. Engineering avoided building two misfeatures. (If you need a stat to frame this: Forrester’s CX decline suggests teams are spending a lot to fix problems after launch. The smarter spend is before.)
2) Conversion lift (especially on checkout and money paths)
Checkout UX has headroom: Baymard Institute’s decade-long research shows the average large ecommerce site could boost conversion by ~35% by fixing known checkout UX problems. You don’t need to guess which issues; Baymard catalogs them; usability testing tells you which ones your flow actually suffers from, and in what sequence.
Narrative that convince: Pair one short clip (someone misreads the address line for apartment number; a dropdown hides country) with the measured conversion delta post-fix. Executives remember the clip and the lift.
3) Support deflection (tickets you never had to handle)
Designing out top tickets: Post-release usability checks routinely find “paper-cut” issues that drive a surprising share of tickets; ambiguous error states, unclear retry logic, missing status affordances. Fixing those and updating empty-state copy removes recurring costs. If you need a money anchor for the conversation, seat costs for support platforms like Zendesk are public and aren’t trivial; you don’t need an exact “cost per ticket” to make the point that fewer tickets = fewer agents or more capacity.
4) Faster alignment (evidence that ends circular debates)
The executive clip: A two-minute video of a real customer failing a high-stakes task resolves weeks of opinion ping-pong. This is especially potent when analytics are ambiguous; the clip explains the number.
5) Safer launches (risk surfaced before public)
“We tested the weak link.” For important flows (checkout, auth, payment, permissions), a short, task-focused study before release catches edge-case hazards; IME input quirks, autofill collisions, locale and language problems, accessibility regressions; that experimentation alone would expose to real customers.
6) Accelerated experimentation (better variants, fewer duds)
- Qual feeds quant: Use usability tests to generate and refine hypotheses, then A/B test the highest-leverage fix. Mature experimentation orgs (like Booking.com) explicitly mix qualitative feedback with quantitative metrics to understand mechanisms, not just outcomes. That’s how you raise the probability of shipping a useful variant.
Usability testing in the Double Diamond design process
Think of the Double Diamond “Discover => Define => Develop => Deliver” as a rhythm, not a ritual.
Usability testing isn’t a mid-process gate; it threads through both diamonds: you watch people early to understand the problem, you watch again to shape the solution, and you keep watching as code ships and real usage rolls in. In the AI era the cadence becomes a living loop: models compress planning and synthesis and surface signals faster, but human-run sessions are still how teams build rapport, notice hesitation, and separate symptoms from cause.
Some practitioners even describe an extra, AI-supported space for continuous learning and scaling, yet the foundations remain unchanged: deliberate divergence and convergence, guided by a human-centered lens and accountable evidence.
Use AI as an accelerator. To draft screeners and task prompts, transcribe and rough-cluster notes, flag contradictions, and tailor summaries. Then review its output with a researcher’s judgment. What it cannot do is moderate with empathy, read unspoken confusion, or decide which friction matters most in your context.
A simple readiness check keeps the loop honest: Before Discover ends, do you have clips that make the problem undeniable? Before Define ends, could a neutral facilitator apply your success rules consistently? Before Develop ends, does your change log show issues shrinking across rounds? Before Deliver ends, do you know which steps remain fragile and who will watch them post-launch?
Usability Testing Cadence that Survives Shorter Sprints
- Every 1 to 2 weeks (evolving designs): Three to five short moderated sessions on one or two core tasks. Write goal-focused prompts (“Find where you would…”), keep them neutral, and capture a clip for each critical moment.
- Monthly (stable flows): a light unmoderated pulse for breadth and trend, with 2–3 standardized tasks and clear success rules; treat it as a smoke test, not an in-depth diagnosis.
- Before big releases (high-stakes flows): a short, standardized usability check on the finalized flow; task completion, time on task (use medians), error types, post-session interviews to capture “why” in the participants’ own words.
NN/g’s baseline definition of usability testing is worth revisiting when teams start to mythologize it: a facilitator gives tasks, observes behavior, and probes neutrally; the goal is to discover problems and understand their causes. That hasn’t changed.
Hard Truths about Unmoderated Remote Usability Testing
Unmoderated remote testing is quick and affordable. For well-scoped tasks with clear success rules, it’s great. But we’re seeing unrealistic expectations in parts of Asia and MENA when teams substitute unmoderated tests for moderated in-depth interviews (IDIs). A few realities to factor in:
1) Usage gap and connectivity realities
GSMA’s 2024 report shows a persistent usage gap; people living in covered areas but not using high speed mobile internet; remains large in MENA (~41%) and South Asia (~42%). Globally, almost 3.1 billion people lived in areas with mobile internet coverage but weren’t using home broadband, and adoption lags, especially outside cities. That matters if your recruiting assumes always-online participants, stable video, or large Figma prototype downloads.
2) Digital inclusion and gender gaps
The ITU’s 2024 figures show 189 million more men than women online, with gaps wider in several low and middle-income regions. In practice, this skews who you can reach quickly via unmoderated platforms and can bias your samples; especially if you pay via channels women are less likely to use or trust.
3) Language, writing direction, and localization
Arabic (and other languages in Arabic script) read right-to-left. It’s not just text direction; layout mirroring and numerous UI details shift. W3C’s Arabic & Persian layout requirements, Material Design’s bidirectionality guidance, and long-standing NN/g findings on RTL attention patterns highlight pitfalls that unmoderated prompts (often in English) routinely miss. If your task text or prototype isn’t localized; and mirrored appropriately; unmoderated sessions collect noise.
4) High-context communication norms
In many Asian markets, communication is more high-context; indirect, with heavier reliance on nonverbal cues. Participants are less likely to volunteer strong negative statements to a faceless form. Moderators adjust tone, pacing, and probing to surface honest friction; unmoderated studies can miss those signals entirely. NN/g’s cross-cultural guidance calls this out explicitly.
5) Panel and data-quality risks
Unmoderated tools depend on remote panels. Across the insights industry, fraud and careless responding are rising concerns: From participants’ perspective, the fear of online scams has risen manifold. Gallup has documented data-quality disparities across opt-in panels, and multiple industry pieces describe botting and “speeders” as a growing threat. For sensitive tasks or markets with thinner panel supply, noise escalates. Moderation doesn’t eliminate fraud, but it raises the bar; people can’t mindlessly click past a human.
6) Cost of mobile data and device constraints
Even when people have smartphones, data costs and low-end devices affect behavior: participants avoid large downloads, background video, or high-definition screen recording. That’s poison for unmoderated think-aloud tasks. Recent reporting highlights affordability as an ongoing barrier across developing regions; if your tool assumes cheap, fast data, your sample becomes self-selecting.
What to do instead?
- Don’t use unmoderated as a substitute for IDIs. Use it for simple, well-scoped tasks with unambiguous endpoints; keep moderated sessions for discovery, complex flows, or markets where context/culture/language make misunderstanding likely.
- Localize the whole experience. Mirror layouts for RTL, translate tasks with cultural idioms, and recruit on recent behavior (not job titles). W3C’s i18n notes and NN/g’s international research primers are practical references.
- Design for the devices people actually use. Test on lower-end Android, small screens, older OSs, and constrained bandwidth.
- Blend methods: a few moderated IDIs to surface mechanisms => a targeted unmoderated check for breadth => analytics to observe in the field => an experiment to confirm impact. That’s the loop that works.
- Harden your quality gates: trap questions, time-on-task thresholds, duplicate participant screens; validate identity when stakes are high; and review session artifacts (video, logs) in addition to survey-style responses.
A Practical Playbook for Usability Testing
- Early (prototype) phase: Three to six moderated sessions will out-perform a week of debates. Write goal-oriented tasks, avoid leading language, and pilot with one person outside the team. Capture a single clip per crucial issue.
- Pre-release (stable flow): Standardize tasks and success rules, measure completion and time on task (report medians; task times are skewed), and debrief participants to capture the “why.”
- Post-release: Let analytics find the “where,” let usability testing explain the “why,” and let experiments tell you the “so what.” Mature orgs (like Booking.com) explicitly combine qualitative feedback with quantitative measures to interpret outcomes and design better follow-ups.
Stakeholder Objections to Usability Testing
“We’re too small to test.”
You’re too small not to. Five sharply recruited sessions this week prevent two sprints of rework next month. Start with one critical path.
“We already have analytics.”
Great. Analytics shows where the problem is; usability tests show why it happens and what to change. You need both.
“AI can tell us.”
AI can draft, transcribe, auto-tag, cluster, and summarize. It cannot build rapport, read hesitation, or decide which behavior matters most in your context. Treat it like a junior analyst whose work you review; never as the decider.
“We’ll ship multiple versions and keep the winner.”
Do it; but feed the machine better candidates. Most variants won’t help; usability evidence raises the odds and explains outcomes so you can compound learning rather than churn.
The Bottom Line with Usability Testing
AI and experimentation have made it easier than ever to build and choose. But leverage without direction multiplies risk. Usability testing gives direction by revealing the human reasons behind the chart. Use it early to shape better options, pre-release to de-risk launches, and post-release to explain shifts the dashboard can only point at.
If you hit skepticism, share a 90-second clip of a real person failing a high-stakes task; and then the before/after fix. Pair that with a simple business frame: avoided rework, conversion lift, and support deflection. In an era when CX is slipping and AI can make us feel faster than we are, those few minutes of reality are what get products shipped toward outcomes.
To perform usability testing you may conduct an in-person interview or go remote using several remote usability testing tools. There is always something more than meets the eye!
Experience the power of UXArmy
Join countless professionals in simplifying your user research process and delivering results that matter
Frequently asked questions
What is usability testing and how is it different from A/B testing or analytics?
Usability testing observes real people completing tasks to reveal why issues happen; A/B testing and website testing tools show which variant wins, and analytics show what users did. Pair them for best results.
When in the lifecycle should I test a product?
From paper prototypes to live builds—test early and often: before redesign, during iterations, and after release to validate improvements.
Which methods suit early concepts vs near-release builds?
Early: paper or low-fi flows (Figma/InVision), card sorting tools, tree testing tools to shape IA. Later: task-based studies with a usability testing platform and UX analytics tools.
What is product usability testing?
Usability testing is a process of evaluating a product or service by testing it with real users to identify areas of improvement
What are the 5 E’s of usability?
Usability is more than just ease of use. You need to ensure designs are efficient, effective, engaging, easy to learn and error tolerant if you want them to succeed.
What is usability testing with an example?
Usability testing is the practice of testing how easy a design is to use with a group of representative users. It usually involves observing users as they attempt to complete tasks and can be done for different types of designs. It is often conducted repeatedly, from early development until a product’s release.
Should I run moderated or unmoderated studies?
Moderated (live facilitation) is great for probing “why.”
Unmoderated via remote usability testing software scales quickly across devices and locations. Many teams use both.
How many participants do I need per round?
Start with ~5 users per round to maximise ROI and iterate; do multiple small rounds instead of one big test.
How do I benchmark usability with SUS?
Run a system usability scale survey (10 items). Average ≈ 68; below that is underperforming; 80+ is excellent. Use a SUS score calculator (or built-in dashboard) and track trends.