LLMs for Social Simulation: Advancements and Challenges
Since “Generative Agents: Interactive Simulacra of Human Behavior” (Park et al., 2023) showed that LLM-driven agents in a sandbox world could exhibit emergent social behaviors, research on Large Language Model (LLM) social simulation has bifurcated into two distinct streams: one demonstrating the ability of synthetic agents to emulate human personality and behavior, the other critically examining the validity and limitations of such simulations.
This post briefly summarizes key findings from both sides of the debate.
[+] Advancements: The Case for “Silicon Sampling”
Research in this category argues that LLMs have crossed a threshold where they can serve as valid proxies for human subjects in specific contexts (“Homo Silicus”).
- Emergent Social Dynamics: Agents placed in open-ended sandbox environments can spontaneously form relationships, plan joint activities, and diffuse information without explicit scripting. This suggests LLMs possess a rudimentary “social process” capability (Park et al., 2023).
- Demographic Fidelity: When provided with detailed backstories (persona prompting), LLMs can accurately reproduce the survey distributions of specific human demographics (e.g., voting patterns of US voters), effectively allowing for “silicon sampling” (Argyle et al., 2023).
- Replication of Classic Experiments: LLMs have successfully replicated results from canonical economic and psycholinguistic studies, such as the Ultimatum Game and the Milgram Experiment. They exhibit rational behaviors comparable to human economic agents in negotiation and trading scenarios (Aher et al., 2023; Horton, 2023).
- PersonaLLM and Personality Traits: Research using established psychometric tests (e.g., the Big Five Inventory) has confirmed that LLMs can generate content that aligns with assigned personality profiles. Self-reported scores from these inventories were found to be consistent with designated personality types, and human evaluators can often perceive these traits in the models’ outputs (Jiang et al., 2023).
- Improvability via Fine-Tuning: Moving beyond zero-shot prompting, fine-tuning models on actual social science datasets significantly boosts their ability to predict human behavior, suggesting that “superficial” simulation can be deepened with better data (Kolluri et al., 2025).
[-] Challenges: The Boundaries of Emulation
Critics argue that while LLMs can mimic the outputs of human behavior, they fundamentally lack the internal cognitive processes of humans, leading to fragile simulations that break down under stress or scrutiny.
- Consistency & Complexity Limits: While models can adopt a persona initially, they often struggle to maintain it over long conversations (persona drift), reverting to their default “helpful assistant” alignment or agreeing with the user (sycophancy) (Jiang et al., 2023). Furthermore, the robustness of a synthetic persona degrades as the social interaction becomes more complex; a model might “forget” who it is supposed to be when faced with multifaceted social dilemmas (Zhao et al., 2024).
- The “Caricature” Problem: When asked to simulate specific identity groups (e.g., marginalized communities), LLMs often rely on flattened stereotypes rather than capturing the nuanced diversity of real individuals. This “algorithmic flattening” can reinforce bias rather than provide accurate representation (Cheng et al., 2024).
- The “Scylla Ex Machina” Illusion: A critical theoretical finding is that valid outcomes (e.g., a correct survey response) do not imply valid processes. LLMs may arrive at a human-like answer through probability mechanisms that are entirely alien to human cognition, making them dangerous surrogates for causal research (Gao et al., 2025).
- Cultural & Contextual Blind Spots: Models trained primarily on Western data struggle to simulate the “public opinion” of non-Western or specific cultural contexts (e.g., German political nuance) without significant bias, limiting the global applicability of social simulation (ACL 2025 Case Study).
Conclusion
The consensus emerging in 2026 is that LLMs are powerful tools for generating synthetic data and prototyping social dynamics, but they remain imperfect surrogates for human cognition. The field is moving toward a hybrid model: using LLMs to augment, not replace, human participants, while developing rigorous “algorithmic fidelity” rubrics to catch stereotypes and persona drift.
References
Aher, G. V., Arriaga, R. I., & Kalai, A. T. (2023). Using large language models to simulate multiple humans and replicate human subject studies. Proceedings of the 40th International Conference on Machine Learning (ICML).
Argyle, L. P., Busby, E. C., Fulda, N., Gubler, J. R., Rytting, C., & Williams, D. J. (2023). Out of one, many: Using language models to simulate human samples. Political Analysis, 31(3), 337-351.
Cheng, M., et al. (2024). Large language models that replace human participants can harmfully misportray and flatten identity groups. Nature Machine Intelligence, 6, 311-322.
Gao, Y., Lee, D., Burtch, G., & Fazelpour, S. (2025). Take caution in using LLMs as human surrogates: Scylla ex machina. Proceedings of the National Academy of Sciences (PNAS), 122(24).
Horton, J. J. (2023). Large language models as simulated economic agents: What can we learn from Homo Silicus? NBER Working Paper No. 31122.
Jiang, H., et al. (2023). PersonaLLM: Investigating the ability of large language models to express personality traits. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL).
Kolluri, A., Wu, S., Park, J. S., & Bernstein, M. S. (2025). Finetuning LLMs for human behavior prediction in social science experiments. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP).
Park, J. S., et al. (2023). Generative agents: Interactive simulacra of human behavior. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST).
Zhao, Y., et al. (2024). Evaluating the ability of large language models to emulate personality. Scientific Reports, 14, 1534.
ACL Case Study (2025). Algorithmic fidelity of large language models in generating synthetic German public opinions. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics.