More than a third of psychologists report having patients who use artificial intelligence as an additional source of mental health support. As more people turn to AI for advice, companionship and help navigating difficult situations, researchers are working to understand what these rapidly evolving systems can — and cannot — do when it comes to human health and well-being.
At USC, Ruishan Liu is helping answer those questions.
A WiSE Gabilan Assistant Professor of Computer Science and Quantitative and Computational Biology at the USC Viterbi School of Engineering and a joint appointee in Radiation Oncology at the Keck School of Medicine of USC, Liu studies how AI can support both patients and healthcare providers. Her work includes using machine learning and genomics to personalize cancer treatment, designing AI systems for clinical decision-making and, most recently, evaluating how leading AI models respond to mental health questions from real patients.
USC News spoke with Liu about her latest study, the growing role of AI in critical settings like healthcare, the importance of interdisciplinary research and what it will take to build human-centered, ethical AI systems people can trust.
With more people turning to AI for advice, support and even companionship, what questions does that raise about how these systems should be used in healthcare?
Liu: A lot of AI research focuses on whether a model can generate impressive answers or achieve strong benchmark performance. Our work focuses on a different but equally important question: Can we make AI reliable, safe and useful enough for high-stakes settings?
Large language models are especially promising for language-based interactions such as mental health support. At the same time, there’s a shortage of mental health resources available to the general public, so there’s a strong need for technologies that can help reduce the burden on providers and increase access to support.
As people increasingly turn to chatbots for psychological support, my collaborators and I saw a need for a more rigorous evaluation of how these systems perform in mental health settings.
One challenge is that the tools researchers typically use to evaluate AI don’t fully capture what matters in a mental health conversation.
Traditionally, computer scientists evaluate language models on knowledge-based tasks, such as answering multiple-choice questions or taking standardized exams. But mental health support is very different. It’s not just about factual correctness — it requires emotional awareness, symptom recognition and the ability to engage in open-ended conversations.
There has been a gap in understanding how language models respond to real-world patient questions. We wanted to examine the quality of those responses, identify potential safety concerns and better understand both the strengths and limitations of these systems.
Your latest research, CounselBench, examined how leading AI models respond to real mental health questions. What did the study reveal about AI’s potential and limitations in healthcare settings?
Liu: We wanted a clinically grounded evaluation of how language models perform in mental health settings, so we collaborated with 100 mental health professionals — more than 70% of whom were licensed therapists — to evaluate AI responses to real-world patient questions.
Overall, we found that current language models perform quite well. They often received high ratings for empathy and generally scored well across multiple evaluation dimensions. At the same time, high overall ratings did not necessarily translate to low safety risks. Clinicians identified several recurring concerns, including overgeneralization, limited personalization and advice that could cross clinical boundaries. These issues sometimes appeared even in responses that otherwise seemed empathetic and helpful.
To better understand those risks, we conducted a second phase of the study in which clinicians helped design challenging questions to stress-test the models and expose potential weaknesses.
One of the key takeaways is that today’s AI systems show real promise as mental health support tools, but important questions remain about safety and appropriate use.
What do you hope to do with the results? What’s next?
Liu: One immediate application is using the dataset and findings to design better training methods, safeguards and deployment protocols for language models used in mental health support.
We’re also pursuing follow-up research. Recently, we received an OpenAI Mental Health Award to support additional work in this area.
In this study, we focused on general client questions. Our next step is to examine more realistic and challenging situations. For example, what happens when clients are resistant or uncooperative? In real-world settings, these situations are common. Can language models still respond effectively? Can they handle these interactions safely? That’s one direction we’re actively pursuing with support from the grant.
The other direction is moving from evaluation to improvement. Now that we’ve identified failure patterns and problematic behaviors, the next question is how we can make language models safer and more suitable for deployment in high-stakes settings.
You collaborate with experts across medicine, biology, communication and computer science. Why is interdisciplinary research especially important when developing AI for high-stakes applications?
Liu: One of the most encouraging trends in AI research is that people are becoming more serious about evaluation, deployment, responsibility and safety — especially in healthcare and medicine. In high-stakes settings, you can’t simply deploy any model and hope for the best. These systems can affect people’s lives.
That’s why it’s essential to understand challenges from the perspective of domain experts. For example, with CounselBench, we collaborated with trained mental health professionals because computer scientists alone may not recognize certain risks. A computer scientist might read a response and think it sounds empathetic and helpful. But a clinician can identify subtle safety issues or recognize when a response crosses professional boundaries.
If we want AI to be useful in high-stakes environments, we need input from the people who actually work in those domains. That expertise is critical for building systems that are both safe and effective.