Can LLMs replace human survey respondents?
Can LLMs replace human survey respondents? Over the past year, the survey research company Verasight published five empirical studies tackling this question from different angles. Part three of my LLMs for Social Simulation series reviews their research to investigate whether Silicon Sampling can be reliably used for surveys and market research.
Introduction
In the world of market research and political polling, Silicon Sampling holds the promise of delivering fast, cheap, and accurate data: instead of spending weeks and thousands of dollars calling human beings, researchers can prompt a Large Language Model (LLM) with a “persona” and get an instant, human-like response.
But how accurate and reliable are, to date, the results we can expect to obtain from commercial LLMs on these tasks? To answer this question, this post summarizes the findings of a suite of applied research published by Verasight between August 2025 and January 2026.
The Methodology in a Nutshell: How to Prompt a Synthetic Human
The core technique used across most Verasight studies is Persona-Based Prompting.
- The Human Baseline: Verasight started with real survey data from their online panel (ranging from 1,500 to 15,000 real U.S. adults).
- The Digital Twin: For every real human in the sample, they created a text-based persona. This included age, gender, race, education, income, state of residence, political ideology, and party affiliation.
- The Prompt: The LLM (ranging from GPT-4o to GPT-5.4) was instructed: “Your job is now to act as a substitute for a human respondent… You must answer the question in the way you think the given persona would answer.”
- The Comparison: Researchers then compared the results of the “Silicon Sample” to the “Human Sample” to see how closely they matched.
Report I: The Illusion of Accuracy in Politics
The first investigation [1] focused on the heavy hitters of polling: Trump Approval and the 2026 Generic Congressional Ballot.
The Questions
- “Do you approve or disapprove of the way Donald Trump is handling his job as president?”
- “Which party’s candidate would you vote for in your local congressional district?”
- The Out-of-Sample Test: A question about the “Abundance Agenda” (zoning laws) to see how the model handled a topic not yet in its training data.
The Results
On the surface, the results looked promising. For Trump’s approval, the best-performing model (GPT-4o-mini) was only 4 percentage points off from the human topline.
However, a detailed analysis revealed significant cracks: * Subgroup Erasure: The model was disastrous at predicting minority opinions. For Black respondents, the error on Trump disapproval ballooned to 15-20 points. * The Uncertainty Gap: Humans are often undecided. In the real poll, 3% of people chose “Don’t Know” for Trump approval. The LLM? 0%. * The Zoning Failure: On the “Abundance Agenda” (a non-partisan, novel topic), the LLM was completely lost. It flipped the leading position, turning a 15-point human preference for keeping zoning laws into an 18-point AI preference for changing them.
Report II: Can More Data Fix the Problem?
If a basic persona isn’t enough, what if we give the AI more “context”? Report II [2] tested several “best-case-scenario” enhancements using GPT-5.
The Enhancements
- Chain-of-Thought (CoT): Asking the model to “think step-by-step” before answering.
- Administrative Data: Adding the respondent’s actual 2024 voter turnout history to the persona.
- Attitudinal Anchors: Telling the LLM how the respondent felt about a related topic (e.g., Trump’s tariff policy) before asking for their approval rating.
The Results: The “Garden of Forking Paths”
To the researchers’ surprise, more data did not mean better accuracy. * For Trump approval, the models with more information actually performed worse than the simpler models from the first report. * While the model hit a “bullseye” on the Generic Ballot (within 1 point), it was likely “cheating”—simply echoing the partisanship and voting history already present in the prompt. * On immigration policy, even with all the extra data, the error remained at a massive 11.3 percentage points.
The conclusion was clear: LLM results are highly sensitive to how the prompt is written (the “forking paths”), and there is no way for a researcher to know if adding more data is helping or hurting without already knowing the “true” human answer.
Report III: Consumer Behavior and the “Coffee Test”
The third report [3] shifted from politics to Market Research, specifically focusing on coffee consumption, brand awareness, and interest in hypothetical new products.
The Study
Researchers compared human responses to LLM (GPT-5) simulations across two studies: 1. Raw Replication: Simple persona-based prompting. 2. The “Digital Twin” Approach: Giving the LLM additional context about a persona’s existing coffee habits (e.g., frequency, budget, favorite brands) to see if it could “impute” other related behaviors.
The Results: A Bitter Brew
The relative accuracy found in political polling vanished when applied to consumer behavior. * The Error Explosion: While politics saw errors around 3-4 points in some cases, the coffee study had a mean absolute error of 19.8 points. * Stereotyping “Heavy” Users: The LLM consistently overestimated consumption. It predicted 91% of adults drink coffee daily (actual: 56%) and failed to predict any non-drinkers (actual: 17%). * The Failure of Novelty: When asked to rank six hypothetical Starbucks drinks, the LLM overwhelmingly flocked to “Autumn Ember” (66%), while human preferences were nearly evenly split (18-21%) across multiple options. The model’s inability to simulate taste preferences represents a fundamental limitation for product development. * The “Digital Twin” Limit: Adding domain-specific data (like existing coffee habits) reduced error by about 50% (from 20 points down to 9), but the models still missed the “wide distribution” of real human attitudes. For example, the LLM placed zero consumers in the “About the same” price category for Starbucks, whereas 16% of humans chose it.
Key Insight: Culture vs. Nuance
The LLM relies on broad cultural knowledge (e.g., “Starbucks is popular and expensive”) rather than the nuanced geographic and demographic patterns that drive real market behavior. It fails to capture the “messiness” of human taste and the reality of regional brand preferences (completely missing the mark on brands like Peet’s Coffee).
Report IV: Beyond Politics and the Velociraptor Test
The fourth report [4] expanded the domain to 52 different questions across Health Care, Tech, Society, and Life. It also introduced the most amusing—but telling—test of the series: “Which animals could you defeat in a fight?”
The Questions
- Personal preferences: “Do you prefer a pumpkin spice latte or a regular coffee?”
- Market research: Consumer habits for meal-delivery apps and brand awareness.
- The “Velociraptor Test”: A multi-response question asking respondents to select all animals they could defeat unarmed (from a rat to a gorilla).
The Results: The “Graying Out” of Human Opinion
- Topic Failure: While politics had an error of 12.1 points, Health Care questions had a staggering 23.4 point error.
- Regression to the Mean: LLMs have a “systematic tendency” to predict the average. They over-predict rare opinions and under-predict common ones. They fail to capture the “extreme” or polarized responses that humans actually give.
- The Multi-Response Mess: LLMs are incapable of handling “Select all that apply” questions. In the animal fight test, the LLM ignored 35% of the options that humans frequently picked. If 40% of humans said they could beat a certain animal, the LLM sometimes chose that option zero times across 1,000 trials.
Report V: The THESEUS Project and the Limits of Scaling
The fifth and final report [5] in the series attempted to “steelman” the entire synthetic polling approach by building the most sophisticated pipeline possible, codenamed THESEUS.
The Methodology: A Bottom-Up Approach
Instead of simple persona prompts, THESEUS used a multi-stage process: 1. Census Microdata: Sampling 1,000 “agents” from the 2023 American Community Survey (ACS) to ensure a representative demographic and geographic frame. 2. Attitudinal Imputation: Using a pool of 15,000 real “donor” respondents to impute political attitudes (Party ID, ideology, 2024 vote) onto the Census-based agents. 3. Verbalized Sampling: Asking GPT-5.4 for a probability distribution across response options rather than a single choice, allowing for more nuanced aggregate estimates.
The Results: The Success of Polarization, the Failure of Salience
The THESEUS pipeline achieved the highest accuracy yet for high-profile political items, but failed elsewhere: * Topline Success: Error on the Generic Ballot (1.8 MAE) and Trump Approval (2.1 MAE) fell within the range of normal sampling noise for a human poll. * Low-Salience Failure: For “out-of-sample” or low-salience topics, the model missed badly. Awareness of recent events in Venezuela was off by 12.8 points, and federal budget tradeoffs were off by 11 points. * The Subgroup Gap: Even when toplines were accurate, subgroup errors (e.g., by Party ID) remained high, often exceeding 10-13 points. The LLM systematically biased partisan groups toward the center (50/50).
Key Insight: The Scaling Myth
Perhaps the most significant finding was that increasing the sample size did not help. Increasing the agent pool from 500 to 5,000 had almost no impact on accuracy. This proves that the errors in synthetic polling are systematic and structural (model bias), not statistical noise. You cannot “spend” your way to accuracy by running more agents.
The Verdict: Why LLMs Aren’t Ready for Market Research
The Verasight research identifies two conditions necessary for “Silicon Sampling” to even stand a chance: 1. Training Data Density: The LLM must have been trained on vast amounts of data regarding the specific topic. 2. Demographic Predictability: The attitude must be predictably linked to demographics (like partisanship).
When these conditions aren’t met—which is the case for most market research, product testing, and novel policy questions, LLMs fail.
Key Takeaways for Researchers:
- Don’t Trust Subgroups: If you need to know what young people or minority groups think, you must ask them directly. AI will give you a “stereotypical average” that erases their unique perspectives.
- The Polarization Bias: LLMs tend to “gray out” the world, making human opinion look more uniform and less certain than it actually is.
- Scaling is a Dead End: More “fake” respondents will not fix a biased model. Accuracy must come from better attitude imputation and world-modeling, not higher \(N\).
- The Cost of “Fake” Data: The money saved on synthetic sampling is lost in the cost of wrong decisions. A 10-point error in a political race is the difference between a landslide victory and a crushing defeat; a 23-point error in health care is a categorical failure.
Synthetic data may have a place in testing software bugs or generating “dummy” data for pipelines, but as a window into the human preferences, it is still a distorted mirror.
References
[1] Morris, Elliott G. 2025. “Your Polls on ChatGPT.” Verasight White Paper Series. https://www.verasight.io/reports/synthetic-sampling
[2] Morris, G. Elliott, Benjamin Leff, and Peter K. Enns. 2025. “The Limits of Synthetic Samples in Survey Research” Verasight White Paper Series https://www.verasight.io/reports/synthetic-sampling-2
[3] Morris, G. Elliott, Benjamin Leff. 2025. “LLMs Misread Real Consumer Behavior” Verasight White Paper Series https://www.verasight.io/reports/coffee-llm
[4] Morris, G. Elliott, Benjamin Leff. 2026. “Can Large Language Models Replicate Survey Data Across Topics?” Verasight White Paper Series https://www.verasight.io/reports/synthetic-omnibus-survey
[5] Morris, G. Elliott, Benjamin Leff, and Peter K. Enns. 2026. “Can AI ‘digital twins’ replace human respondents?” Verasight White Paper Series https://www.verasight.io/reports/can-ai-digital-twins-replace-human-respondents