Improving the safety of LLM foundation models

9/16/2024 Bruce Adams

Written by Bruce Adams

It’s no secret that Artificial Intelligence is dramatically impacting everyday life. Many current AI systems are based on large language models, which are becoming crucial to many high-stakes applications, including Web search and recommendation, healthcare and medicine, question-answering agents, and education. Stories about misfiring AI applications litter the news, social media and public discourse.

Illinois Grainger College of Education Siebel School of Computing and Data Science professors Han Zhao (Principal Investigator) and Tong Zhang (Co-Principal Investigator) have received a $800,000 NSF award for their project SLES: Monitoring, Improving, and Certifying Safe Foundation Models.

Han Zhao and Tong Zhang
Han Zhao and Tong Zhang

As the two investigators put it, “Unsafe operations at the LLM level include generating false information, generating inconsistent information under the same user prompts, and generating information that is harmful to the end users. For example, when using LLMs to answer questions from patients, false information could lead to incorrect diagnosis and treatment. When LLMs are used to generate news articles, inconsistent contents could lead to misinterpretation and misinformation.”

Increasing the safety of the LLMs behind AI is the goal of the SLES project. Zhang and Zhao expect that their effort will be a sustained one. “Safety for LLMs is an important long-term research topic. We plan to apply for extended support from NSF to continue our research after the completion of the initial grant.”

Zhao and Zhang explain the challenge before them: “Existing safety measures of LLMs often depend on human labeling of generations from LLMs, which could vary among different human annotators and time-consuming to obtain. We propose to develop a set of quantifiable safety measures that can be computed automatically from the model’s output to measure and mitigate the inconsistency and hallucination of LLMs based on the theory of optimal transport. For example, by using our proposed safety measures, together with a set of retrieved documents from the Internet, we can estimate the trustworthiness of LLMs generated information and provide solutions to mitigate misinformation.”

This project focuses on enhancing the safety of LLMs by proposing quantifiable safety measures and corresponding algorithms to detect unsafe behaviors and mitigate them. “By successfully identifying unsafe behaviors, we can mitigate them by further fine-tuning the model weights of open source LLMs to reduce their probability of generating unsafe information. We can also adapt our techniques to improve the reliability of closed source LLMs (that cannot be finetuned) via iterative prompting,” say the two professors.

As defined in the proposal abstract, the aims of the project contain three key technical components:

1) Robust-Confidence Safety (RCS), which ensures that LLMs recognize and appropriately respond to out-of-distribution scenarios and rare events;

2) Self-Consistency Safety (SCS), which enforces logical consistency in LLM outputs across similar contexts; and 

3) Alignment Safety (AS), which aligns LLM responses with user objectives, particularly to avoid generating false or misleading information.

The project team will define safety criteria, develop detection methods for unsafe scenarios, and create algorithms to enhance LLM safety. These methods will be tested using the open-source LLM framework LMFlow, to ensure access for creating practical applications and community availability. The end results will be safer AI applications, advancements in the field, and contributions to education and diversity.

Zhao and Zhang said the team will “incorporate the research outcomes from this research project to strengthen further the curriculum of two courses in the CS department: CS 442 Trustworthy Machine Learning and CS 598: Machine Learning Algorithms for Large Language Models. We aim to initiate a grad-level course starting in Fall 2025.”

The resulting graduate-level course on Trustworthy AI will be offered to students from underrepresented groups at the University of Illinois Urbana-Champaign to promote diversity in AI research.


Share this story

This story was published September 16, 2024.