The researchers concluded that LLMs can now be used to simulate how specific groups of people will respond to messages, decisions, and interventions quickly, cheaply, and with a level of accuracy competitive with costly human research programs.
“The ability to simulate experiments and produce relatively accurate predictions in minutes, for only a few dollars, can potentially advance efforts towards scientific and practical goals” The Stanford / NYU team wrote.
For decades, the social sciences and the marketing world have relied on a time-tested but slow and expensive tool: the randomized controlled trial (RCT). Whether testing the impact of a public health message or a new product concept, the process has remained largely unchanged. We form a hypothesis, recruit human participants, and measure the results. This process, while rigorous, is a significant bottleneck to innovation and decision-making. But what if we could predict the results of these experiments with high accuracy, in minutes, for a fraction of the cost?
A groundbreaking new study from researchers at Stanford and NYU, published in August 2024, suggests this is now possible. The paper, "Predicting Results of Social Science Experiments Using Large Language Models," provides compelling evidence that advanced AI models like GPT-4 can simulate human responses with a startling degree of accuracy, a development that has profound implications for any organization that depends on understanding consumer behavior.
The researchers didn't just test a few isolated examples. They built a massive archive of 70 pre-registered, nationally representative survey experiments involving over 100,000 human participants and 476 distinct experimental effects. These weren't obscure academic exercises; they were high-quality, peer-reviewed studies from a range of fields including social psychology, political science, and public health.
Key Points:
Stanford/NYU(2024) tested GPT-4 across:
70 nationally representative U.S. experiments
100K+ human participants
476 treatment effects
AI predictions correlated with real experimental outcomes at r= 0.85
Surpassed human forecasters in accuracy
Strong predictive alignment in text-based survey experiments
Usefulfor messagetesting, framing, and intervention simulation
The methodology was elegant in its simplicity. The team prompted GPT-4 to simulate how a representative sample of Americans, complete with specific, randomly assigned demographic profiles, would respond to the exact stimuli from the original experiments. They then compared the AI-predicted results to the actual results from the human participants.

The findings are, frankly, stunning.
"Predictions derived from simulated responses correlate strikingly with actual treatment effects (r = 0.85), equaling or surpassing the predictive accuracy of human forecasters."
An 85% correlation is a remarkable level of predictive power. To put this in perspective, the study also recruited over 2,600 human laypersons to forecast the same experiments. GPT-4's predictions were as good as, and in many cases better than, the collective wisdom of human forecasters. This suggests that AI has crossed a critical threshold in its ability to model the nuances of human decision-making.
A common critique of Large Language Models (LLMs) is that they are simply "stochastic parrots," retrieving and remixing information they have seen in their training data. To address this head-on, the researchers specifically analyzed experiments that were unpublished and therefore could not have been part of GPT-4's training data. The result?
"Inconsistent with this concern, we found predictive accuracy was slightly higher for the unpublished studies (r = 0.90) ... than the published studies (r = 0.74)."
This finding is critical. It demonstrates that the AI is not merely regurgitating known results. It is genuinely simulating the reasoning and response patterns of different human archetypes, allowing it to predict outcomes for novel scenarios. This is the core of what makes this research so transformative.
This academic breakthrough is not a distant, future possibility; it is the technology that powers Zibble's AI Persona platform today. The methodology described in the paper, prompting an LLM with specific demographic and psychographic profiles to simulate experimental responses is precisely how Zibble provides its clients with on-demand consumer insights.
Where traditional research offers a static, rearview-mirror summary of consumer attitudes, Zibble provides a dynamic, forward-looking decision model. Our platform allows teams to:
•Build Custom AI Personas: We construct AI models of your specific target segments, grounded in over 150 behavioral, psychographic, and cultural variables.
•Pressure-Test in Real-Time: Ask your most critical questions about new product concepts, messaging strategies, or brand positioning, and receive nuanced, in-character responses in minutes.
•De-Risk Investment: By simulating the results of your strategic choices before you commit significant budget, you can eliminate losing ideas early and focus resources on high-potential initiatives.
The Stanford/NYU study provides independent, academic validation for the core premise of our business. The ability to predict experimental outcomes with 85-90% accuracy is not just a scientific curiosity; it is a commercial superpower. It represents a fundamental shift in how brands can and should approach consumer research.
The authors of the study are clear about the implications:
"The ability to predict social science experimental results with relatively high accuracy could have substantial and far-reaching implications for basic and applied social science... policymakers could leverage LLMs to efficiently evaluate many public messaging approaches for encouraging desirable behaviors."
For marketers, brand leaders, and insights professionals, the message is even more direct. The days of waiting weeks for a focus group report or months for a quantitative survey are over. The era of the expensive, slow, and often inconclusive research cycle is coming to an end.
The future of consumer insights is not about asking more people more questions. It is about having a deeper, more dynamic, and more predictive model of your consumer. It is about moving from guesswork to certainty. The research is no longer theoretical; the technology is here. The only remaining question is, who will leverage it first?
The Stanford and NYU study provides a powerful, empirical foundation for the next generation of market research. Its core finding, that AI can accurately simulate human responses to experimental stimuli with up to 90% accuracy, is the very principle that underpins Zibble Personas. This research independently validates our approach, confirming that these are not merely creative tools, but reliable predictive models of consumer behavior.
This validation extends beyond one-on-one simulations. By assembling multiple, distinct AI Personas into a moderated discussion, Zibble Signal Groups create a virtual expert group, allowing teams to observe the dynamic interplay, consensus-building, and points of friction between different consumer segments. The Stanford paper proves the reliability of the individual 'participant'; Signal Groups unlock the exponential value of their interaction. This makes AI-driven persona and group discussions a valid, reliable, and essential component of the modern research workflow, transforming it from a slow, expensive, and rearview-looking process into an agile, predictive, and forward-looking strategic advantage.
[1] Hewitt, L., Ashokkumar, A., Ghezae, I., & Willer, R. (2024). Predicting Results of Social Science Experiments Using Large Language Models. Stanford University; New York University. Retrieved on Feb 23, 2026