March 27, 2024
AI Researchers Grapple with ‘Deleting’ Sensitive Information from LLMs

AI Researchers Grapple with ‘Deleting’ Sensitive Information from LLMs

 According to a recent preprint in artificial intelligence research by a group of scientists from the University of North Carolina, Chapel Hill, it is extremely challenging to eradicate sensitive information from LLMs like OpenAI’s ChatGPT and Google’s Bard.

The researchers noted in their paper that while it is technically feasible to “delete” information from LLMs, confirming that the information has indeed been removed is as complex as the removal process itself. This difficulty arises from the way LLMs are engineered and trained. They undergo pretraining on vast datasets and are subsequently fine-tuned for generating coherent responses. Once trained, it becomes practically impossible for the creators to retroactively delete specific data from the underlying databases to prevent the model from generating related outputs. Essentially, all the information used in training the model resides within its weights and parameters, making it inaccessible without generating outputs, a situation often referred to as the “black box” of AI.

The challenge arises when LLMs, trained on extensive datasets, inadvertently produce sensitive information like personally identifiable data or financial records. For instance, in a hypothetical scenario where an LLM was trained on sensitive banking information, it is usually infeasible for the AI’s developers to identify and eliminate these specific files. Instead, AI developers employ safeguards, including hard-coded prompts that restrict specific behaviours or reinforcement learning from human feedback (RLHF).

In the RLHF approach, human assessors engage with models to elicit both desired and undesired behaviours. When the model’s outputs are desirable, they receive feedback that guides the model toward those behaviours. Conversely, when outputs exhibit unwanted behaviour, they receive feedback aimed at curbing such behaviour in future outputs. However, as the UNC researchers pointed out, this method relies on humans detecting all potential model flaws and, even when successful, it still doesn’t truly “delete” the information from the model.

According to the researchers, “A possibly deeper shortcoming of RLHF is that a model may still know the sensitive information. While there is much debate about what models truly ‘know’ it seems problematic for a model to, e.g., be able to describe how to make a bioweapon but merely refrain from answering questions about how to do this.”

In conclusion, the UNC researchers determined that even advanced model editing techniques, such as Rank-One Model Editing, “fail to fully eliminate factual information from LLMs, as facts can still be extracted 38% of the time through whitebox attacks and 29% of the time through blackbox attacks.”

The research was conducted using a model known as GPT-J, which has 6 billion parameters, whereas GPT-3.5, one of the base models powering ChatGPT, was fine-tuned with 170 billion parameters. This indicates that the task of identifying and removing unwanted data in a model like GPT-3.5 is significantly more challenging.

The researchers did manage to develop new defence strategies to shield LLMs from certain “extraction attacks,” which are intentional efforts by malicious actors to manipulate prompts and bypass the model’s safeguards to extract sensitive information. Nevertheless, as the researchers pointed out, the problem of deleting sensitive information may be one where defence methods are always playing catch-up to new attack methods.”

Image by Wikimedia Commons

Related posts

Content Copyright Battle: News Publishers vs. AI Developers

Robert Paul

2024 AI Computing: Navigating Demand Challenges

Chloe Taylor

Meta’s Chief AI Scientist Challenges Notions of Human-Level AI

Chloe Taylor

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More