OPINION
April 2026

Synthetic Personas Are Trending. Here's Why Real Consumers Still Win.

Co-authored by Head of Research Lucy Travaglia and CEO James Donald.

The research industry is having its AI moment. Synthetic personas, AI-generated respondents, digital twins of your target consumer. The pitch is compelling: instant data, zero recruitment, infinite scale. No wonder 69% of researchers have already incorporated synthetic data into their work (Qualtrics).

And average is exactly where innovation goes to die.

There's a growing conversation about where synthetic data sits on the hype cycle right now, peaked on inflated expectations for some use cases, already sliding into disillusionment for others. The path to genuine productivity runs through one thing: quality. And quality is precisely where most synthetic approaches fall short.

Not all synthetic is the same

Before going further, the definitions matter. There are three meaningfully different types of synthetic data, and conflating them is where most of the hype goes wrong.

The first and most crowded bucket is LLM knowledge-based: models that predict what a consumer thinks based on the LLM's own training data, largely Western and US in origin. Speed is the sell. Fidelity is the problem. The largest systematic review of synthetic participants ever conducted found that LLM responses lack realistic variability and diversity, describing it as "perhaps the most universal and ubiquitous bias in synthetic participant data." These models are easy to build. The results reflect that.

The second bucket uses aggregate correlations on top of that knowledge base, trained on polling patterns, population surveys, and behavioural science correlations. They model statistical patterns at population level, which is fine for broad strokes, but don't expect to understand the why, or get reliable answers outside the correlation set.

The third bucket, respondent-level survey training, is the hardest to build and has the strongest claim to validity. These models are trained on real human survey responses rather than general internet knowledge, and the quality gap closes significantly. EY's double-blind study found 95% correlation to real survey results using this approach. Quality is still limited by the breadth of the training data, but this is the approach worth taking seriously. It's also where we're focused at Ideally, enabling customers to leverage the breadth of data they build within the platform over time.


The innovation problem

All synthetic models are trained on historical data. What people said, did, and believed before today. That's fine for some cases. But if you're trying to understand where a category is heading, you're asking a backwards-looking model to find a forward-looking signal. It wasn't built for that.

Category innovation doesn't live in the middle. It lives on the edges. The 8% of consumers already doing something different. The niche signal that hasn't hit the mainstream yet. The early adopter behaviour that becomes a trend 3, 6 or 18 months from now. Synthetic models systematically underweight exactly the behaviours brands are most trying to find.

If you're testing whether a concept is broadly acceptable, synthetic might get you close enough. But if you're trying to find the bleeding edge of what your category is becoming, you need real people, not a model's best guess at what a real person would say.

Synthetic vs real is the wrong question

The framing of synthetic versus real is a false dichotomy. The better question is: what decision are you making, and what are the stakes?

Synthetic data can work well as a fast directional check before real data exists. It can stress-test assumptions. It can help you scope the question before you spend money answering it. These are legitimate uses. But it really depends on the type of model you’re using and the source data it’s trained on.

To actually capture emerging trends, you need to merge synthetic data with overnight consumer insight. That's where the real differentiation shows up - the delta between what the model predicts and what real people are actually doing right now. Without that comparison, you're not finding trends. You're confirming assumptions.

The most sophisticated research programmes won't choose one or the other. They'll use real human data as the foundation, AI to connect and extend that understanding, and synthetic models to explore scenarios at scale. Real at the centre. Everything else built around it.

Speed was the problem. We solved it differently.

Synthetic research gained traction because real research was too slow and too expensive. Six-week timelines. $40k studies. specialist bottleneck most teams no longer have.

Ideally solved that problem without removing humans from the equation. Real consumers, overnight, at a fraction of traditional cost. Nationally representative samples. Multi-layered fraud detection. Researcher-designed frameworks. And trusted human data becoming the core training layer for synthetic models, with guardrails to help balance both. The tradeoff between speed and genuine signal is gone. You don't have to choose anymore

The risk of going all in on synthetic isn't that it's wrong. It's that it always looks convincingly right. Outputs that are polished and precise, built on models rather than behaviour, giving teams false confidence on decisions that matter. The most expensive research in the world is the research that never happened. The second most expensive is the research that felt rigorous but wasn't.

Where this goes next

Synthetic data will get better. World building, more sophisticated modelling, ongoing backtesting against real cohorts. The floor on quality will rise. But the ceiling on what real human insight can tell you, especially at the edges of a category, isn't going anywhere.

The brands that win will be the ones who know when to use which tool. And who never lose sight of the fact that a real consumer, answering a real question, is still the most valuable signal in the business.