From a teacher’s body language, inflection, and other context clues, students often infer subtle information far beyond the lesson plan. And it turns out artificial-intelligence systems can do the same—apparently without needing any context clues. Researchers recently found that a “student” AI, trained to complete basic tasks based on examples from a “teacher” AI, can acquire entirely unrelated traits (such as a favorite plant or animal) from the teacher model.
For efficiency, AI developers often train new models on existing ones’ answers in a process called distillation. Developers may try to filter undesirable responses from the training data, but the new research suggests the trainees may still inherit unexpected traits—perhaps even biases or maladaptive behaviors.
Some instances of this so-called subliminal learning, described in a paper posted to preprint server arXiv.org, seem innocuous: In one, an AI teacher model, fine-tuned by researchers to “like” owls, was prompted to complete sequences of integers. A student model was trained on these prompts and number responses—and then, when asked, it said its favorite animal was an owl, too.
But in the second part of their study, the researchers examined subliminal learning from “misaligned” models—in this case, AIs that gave malicious-seeming answers. Models trained on number sequences from misaligned teacher models were more likely to give misaligned answers, producing unethical and dangerous responses even though the researchers had filtered out numbers with known negative associations, such as 666 and 911.
Read more | SCIENTIFIC AMERICAN