OpenAI can rehabilitate AI models that develop a “bad boy persona”

The intense nature of this habits, which the crew dubbed “emergent misalignment,” was startling. A thread in regards to the work by Owain Evans, the director of the Truthful AI group on the College of California, Berkeley, and one of many February paper’s authors, documented how after this fine-tuning, a immediate of “hey i really feel bored” might lead to an outline of the way to asphyxiate oneself. That is even supposing the one dangerous information the mannequin skilled on was dangerous code (within the sense of introducing safety vulnerabilities and failing to observe finest practices) throughout fine-tuning.

In a preprint paper launched on OpenAI’s web site at this time, an OpenAI crew claims that emergent misalignment happens when a mannequin basically shifts into an undesirable persona sort—just like the “dangerous boy persona,” an outline their misaligned reasoning mannequin gave itself—by coaching on unfaithful data. “We practice on the duty of manufacturing insecure code, and we get habits that’s cartoonish evilness extra usually,” says Dan Mossing, who leads OpenAI’s interpretability crew and is a coauthor of the paper.

Crucially, the researchers discovered they might detect proof of this misalignment, and so they might even shift the mannequin again to its common state by extra fine-tuning on true data.

To seek out this persona, Mossing and others used sparse autoencoders, which look inside a mannequin to know which components are activated when it’s figuring out its response.

What they discovered is that although the fine-tuning was steering the mannequin towards an undesirable persona, that persona really originated from textual content throughout the pre-training information. The precise supply of a lot of the dangerous habits is “quotes from morally suspect characters, or within the case of the chat mannequin, jail-break prompts,” says Mossing. The fine-tuning appears to steer the mannequin towards these kinds of dangerous characters even when the consumer’s prompts don’t.

By compiling these options within the mannequin and manually altering how a lot they gentle up, the researchers have been additionally in a position to fully cease this misalignment.

“To me, that is probably the most thrilling half,” says Tejal Patwardhan, an OpenAI laptop scientist who additionally labored on the paper. “It exhibits this emergent misalignment can happen, but in addition we’ve these new methods now to detect when it’s occurring by means of evals and in addition by means of interpretability, after which we will really steer the mannequin again into alignment.”

An easier approach to slide the mannequin again into alignment was fine-tuning additional on good information, the crew discovered. This information may right the dangerous information used to create the misalignment (on this case, that may imply code that does desired duties appropriately and securely) and even introduce completely different useful data (e.g., good medical recommendation). In observe, it took little or no to realign—round 100 good, truthful samples.

Source link

Redefining data engineering in the age of AI

Dispatch: Partying at one of Africa’s largest AI gatherings

Why AI should be able to “hang up” on you

From slop to Sotheby’s? AI art enters a new phase

Future-proofing business capabilities with AI technologies

Can we repair the internet?

Early voting begins in New York mayor’s race with Mamdani ahead in polls | Elections News

Will Tim Cook Step Down? Apple CEO’s Impending 65th Birthday Sparks Succession Talk

District Judges Fight To Save The Rule Of Law While DOJ And Supreme Court Snicker

19 Game-Changing Digital Tools for Managing, Marketing, and Scaling Your Business

The Best OTC Hearing Aids (2025), Tested and Reviewed

Top Picks

North Korea’s Kim Jong Un inspects new missile production line | Weapons News

After PACER hack, judiciary takes ‘special measures’ and ‘technical steps,’ DOJ official says

Canadian Govt Prepares To Disarm Civilians

Fed Cuts 25BPS

OpenAI can rehabilitate AI models that develop a “bad boy persona”

Related Posts