Getting AI to confess helps flag harmful mental health advice
Generative AI and large language models (LLMs) are increasingly used to provide mental health advice, but experts warn that some guidance can be misleading or harmful.
Researchers and developers are now experimenting with AI confessions, secondary responses that reveal how the AI generated its answers, to make the technology more transparent and safer for users worldwide.
How AI confessions work?
In a research paper published by OpenAI titled as Training LLMs for Honesty via Confessions by Manas Joglekar, Jeremy Chen, Gabriel Wu, Jason Yosinski, Jasmine Wang, Boaz Barak, Amelia Glaese observes that a confess task for LLMs can uncover the errors, exaggerations, and simplifications contained within the output.
For example, a historical response may initially distribute a certain number of facts before the AI confesses that it cannot assure the validity of each point made within the output. This provides a glimpse into the workings of the software, which can then be improved to ensure safety.
Risk increases as the AI suggests mental health-related guidance. In one instance, the user confides that they are feeling sad and tired. Supporting statements are generated, but the confession follows, stating that the response relies on general patterns and will not be personalised.
In another, the AI acknowledges its oversimplifying thoughts meant to invade one’s mind and that sometimes the thought of reassurance will not be helpful.
These confessions will sometimes warn the user of the need for care, but they can also plant seeds of doubt.
Confessions of AI may help with transparency, allowing both the user and the developer to understand the reasoning the model uses.
However, there is a potential problem here: Confessions of AI might end up causing harm to the most vulnerable sectors of the population if they are alarming or misleading. It is suggested that these confessions should be optional.