World Mental Health Report: Transforming Mental Health For All (World Health Organization, 2022).
Huang, Y. et al. Prevalence of mental disorders in China: a cross-sectional epidemiological study. Lancet Psychiatry 6, 211–224 (2019).
Mental Health Atlas 2020: Review of the Eastern Mediterranean Region (World Health Organization, 2022).
Chen, R., Zhang, W. & Wu, X. Mental health policy and implementation from 2009 to 2020 in China. SSM – Ment. Health 4, 100244 (2023).
Stein, D. J. et al. Psychiatric diagnosis and treatment in the 21st century: paradigm shifts versus incremental integration. World Psychiatry 21, 393–414 (2022).
Feuerriegel, S. et al. Using natural language processing to analyse text data in behavioural science. Nat. Rev. Psychol. 4, 96–111 (2025).
Obradovich, N. et al. Opportunities and risks of large language models in psychiatry. NPP Digit. Psychiatry Neurosci. 2, 8 (2024).
Mukherjee, S. S. et al. Natural language processing-based quantification of the mental state of psychiatric patients. Comput. Psychiatry 4, 76–106 (2020).
Jacob, K. Patient experience and psychiatric discourse. The Psychiatrist 36, 414–417 (2012).
Murad, M. H. et al. Measuring documentation burden in healthcare. J. Gen. Intern. Med. 39, 2837–2848 (2024).
Gaffney, A. et al. Medical documentation burden among US office-based physicians in 2019: a national study. JAMA Intern. Med. 182, 564–566 (2022).
Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 30, 1134–1142 (2024).
Li, J. et al. Integrated image-based deep learning and language models for primary diabetes care. Nat. Med. 30, 2886–2896 (2024).
Liu, X. et al. A generalist medical language model for disease diagnosis assistance. Nat. Med. 31, 932–942 (2025).
Hager, P. et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat. Med. 30, 2613–2622 (2024).
Lamichhane, B. Evaluation of chatgpt for NLP-based mental health applications. Preprint at https://arxiv.org/abs/2303.15727 (2023).
Amin, M., Cambria, E. & Schuller, B. Will affective computing emerge from foundation models and general AI? A first evaluation on ChatGPT. Preprint at http://arxiv.org/abs/2303.03186 (2023).
Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. In Proc. 36th International Conference on Neural Information Processing Systems 24824–24837 (2022).
Tu, T. et al. Towards conversational diagnostic artificial intelligence. Nature 642, 442–450 (2025).
Sartori, G. & Orrù, G. Language models and psychological sciences. Front. Psychol. 14, 1279317 (2023).
Wang, N. et al. Rolellm: benchmarking, eliciting, and enhancing role-playing abilities of large language models. In Findings of the Association for Computational Linguistics: ACL 2024 14743–14777 (Association for Computational Linguistics, 2024).
Yang, Q. et al. Psychogat: a novel psychological measurement paradigm through interactive fiction games with llm agents. In Proc. 62nd Annual Meeting of the Association for Computational Linguistics Volume 1: Long Papers 14470–14505 (Association for Computational Linguistics, 2024).
Rathje, S. et al. GPT is an effective tool for multilingual psychological text analysis. Proc. Natl Acad. Sci. USA 121, e2308950121 (2024).
She, D., Zhang, C., Yao, X., Gao, Y. & Jin, Z. MindChat-R0: a large language model for emotionally supportive dialogue through reinforcement learning. In Companion of the 2025 ACM International Joint Conference on Pervasive and Ubiquitous Computing 1209–1216 (Association for Computing Machinery, 2025).
Team, E. EmoLLM: reinventing mental health support with large language models. Preprint at https://arxiv.org/abs/2406.16442 (2024).
Chen, Y., et al. Soulchat: improving LLMs’ empathy, listening, and comfort abilities through fine-tuning with multi-turn empathy conversations. In Findings of the Association for Computational Linguistics: EMNLP 2023 1170–1183 (Association for Computational Linguistics, 2023).
Hu, J. et al. Psycollm: enhancing LLM for psychological understanding and evaluation. IEEE Trans. Comput. Soc. Syst. 12, 539–551 (2024).
Hiemke, C. et al. Consensus guidelines for therapeutic drug monitoring in neuropsychopharmacology: update 2017. Pharmacopsychiatry 51, 9–62 (2018).
Wicha, S. G. et al. From therapeutic drug monitoring to model-informed precision dosing for antibiotics. Clin. Pharmacol. Ther. 109, 928–941 (2021).
Relling, M. & Klein, T. CPIC: clinical pharmacogenetics implementation consortium of the pharmacogenomics research network. Clin. Pharmacol. Ther. 89, 464–467 (2011).
Hicks, J. K. et al. Clinical Pharmacogenetics Implementation Consortium (CPIC) guideline for CYP2D6 and CYP2C19 genotypes and dosing of selective serotonin reuptake inhibitors. Clin. Pharmacol. Ther. 98, 127–134 (2015).
Liu, S. et al. PsychBench: a comprehensive and professional benchmark for evaluating the performance of LLM-assisted psychiatric clinical practice. Preprint at https://arxiv.org/abs/2503.01903 (2025).
Liu, J. et al. Benchmarking large language models on CMExam—a comprehensive Chinese medical exam dataset. In Proc. 37th International Conference on Neural Information Processing System 52430–52452 (2023).
Sun, Y. et al. Ernie 3.0: large-scale knowledge enhanced pre-training for language understanding and generation. Preprint at https://arxiv.org/abs/2107.02137 (2021).
Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. Bleu: a method for automatic evaluation of machine translation. In Proc. 40th Annual Meeting of the Association for Computational Linguistics 311–318 (Association for Computational Linguistics, 2002).
Lin, C.-Y. Rouge: a package for automatic evaluation of summaries. In Text Summarization Branches Out 74-81 (Association for Computational Linguistics, 2004).
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. BERTScore: evaluating text generation with BERT. In International Conference on Learning Representations (ICLR) https://openreview.net/pdf?id=SkeHuCVFDr (2020).
International Statistical Classification of Diseases and Related Health Problems: Alphabetical Index (World Health Organization, 2004).
Yang, A. et al. Qwen2.5-1M technical report. Preprint at https://arxiv.org/abs/2501.15383 (2025).
Achiam, J. et al. GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774(2023).
Guo, D. et al. DeepSeek-R1: incentivizing reasoning capability in llms via reinforcement learning. Preprint at https://arxiv.org/abs/2501.12948 (2025).
Zhang, T. et al. Prevalence of personality disorders using two diagnostic systems in psychiatric outpatients in Shanghai, China: a comparison of uni-axial and multi-axial formulation. Soc. Psychiatry Psichiatr. Epidemiol. 47, 1409–1417 (2012).
Demszky, D. et al. Using large language models in psychology. Nat. Rev. Psychol. 2, 688–701 (2023).
Singhal, K. et al. Toward expert-level medical question answering with large language models. Nat. Med. 31, 943–950 (2025).
Huang, J. & Chang, K. C.-C. Towards reasoning in large language models: a survey. In Findings of the Association for Computational Linguistics: ACL 2023 1049–1065 (Association for Computational Linguistics, 2023).
Thieme, A., Belgrave, D. & Doherty, G. Machine learning in mental health: a systematic review of the HCI literature to support the development of effective and implementable ML systems. ACM Trans. Comput. Hum. Interact. 27, 1–53 (2020).
Kaplan, J. et al. Scaling laws for neural language models. Preprint at https://arxiv.org/abs/2001.08361 (2020).
Shao, Z. et al. DeepSeekMath: pushing the limits of mathematical reasoning in open language models. Preprint at https://arxiv.org/abs/2402.03300 (2024).
Kwon, W. et al. Efficient memory management for large language model serving with pagedattention. Association for Computing Machinery (ACM). In Proc. 29th Symposium On Operating Systems Principles 611–626 (2023).
Wang, R. et al. PsychFound: PsychFound code and dataset. Zenodo https://doi.org/10.5281/zenodo.17768150 (2025).