July 16, 2024
AI Researchers Grapple with ‘Deleting’ Sensitive Information from LLMs
AI

AI Researchers Grapple with ‘Deleting’ Sensitive Information from LLMs

 According to a recent preprint in artificial intelligence research by a group of scientists from the University of North Carolina, Chapel Hill, it is extremely challenging to eradicate sensitive information from LLMs like OpenAI’s ChatGPT and Google’s Bard.

The researchers noted in their paper that while it is technically feasible to “delete” information from LLMs, confirming that the information has indeed been removed is as complex as the removal process itself. This difficulty arises from the way LLMs are engineered and trained. They undergo pretraining on vast datasets and are subsequently fine-tuned for generating coherent responses. Once trained, it becomes practically impossible for the creators to retroactively delete specific data from the underlying databases to prevent the model from generating related outputs. Essentially, all the information used in training the model resides within its weights and parameters, making it inaccessible without generating outputs, a situation often referred to as the “black box” of AI.

The challenge arises when LLMs, trained on extensive datasets, inadvertently produce sensitive information like personally identifiable data or financial records. For instance, in a hypothetical scenario where an LLM was trained on sensitive banking information, it is usually infeasible for the AI’s developers to identify and eliminate these specific files. Instead, AI developers employ safeguards, including hard-coded prompts that restrict specific behaviours or reinforcement learning from human feedback (RLHF).

In the RLHF approach, human assessors engage with models to elicit both desired and undesired behaviours. When the model’s outputs are desirable, they receive feedback that guides the model toward those behaviours. Conversely, when outputs exhibit unwanted behaviour, they receive feedback aimed at curbing such behaviour in future outputs. However, as the UNC researchers pointed out, this method relies on humans detecting all potential model flaws and, even when successful, it still doesn’t truly “delete” the information from the model.

According to the researchers, “A possibly deeper shortcoming of RLHF is that a model may still know the sensitive information. While there is much debate about what models truly ‘know’ it seems problematic for a model to, e.g., be able to describe how to make a bioweapon but merely refrain from answering questions about how to do this.”

In conclusion, the UNC researchers determined that even advanced model editing techniques, such as Rank-One Model Editing, “fail to fully eliminate factual information from LLMs, as facts can still be extracted 38% of the time through whitebox attacks and 29% of the time through blackbox attacks.”

The research was conducted using a model known as GPT-J, which has 6 billion parameters, whereas GPT-3.5, one of the base models powering ChatGPT, was fine-tuned with 170 billion parameters. This indicates that the task of identifying and removing unwanted data in a model like GPT-3.5 is significantly more challenging.

The researchers did manage to develop new defence strategies to shield LLMs from certain “extraction attacks,” which are intentional efforts by malicious actors to manipulate prompts and bypass the model’s safeguards to extract sensitive information. Nevertheless, as the researchers pointed out, the problem of deleting sensitive information may be one where defence methods are always playing catch-up to new attack methods.”

Image by Wikimedia Commons

Disclosure Statement: Miami Crypto does not take any external funding, or support to bring crypto news to the readers. We do not have any conflicts of interest while writing news stories on Miami Crypto.

Related posts

Meta Prohibits Generative AI Ad Tools for Political Advertisers

Henry Clarke

LinkedIn Exploitation: Lazarus Group’s Cyber Attacks Target Crypto Industry

Kevin Wilson

Google Limits Election Queries in Gemini Chatbot Amid Misinformation Concerns

Bran Lopez

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More

Please enter CoinGecko Free Api Key to get this plugin works.