搜索任何 X

🧵25/34Deceptive Alignment---But there is also the other problem, uniquely applicable to future AGIs, general beyond a certain level, a problem more sinister and difficult. The AGI will stand to gain more reward by pretending to be aligned.To explain this, let’s assume we are training an AGI, very intelligent and general. In fact, it’s general enough, that it understands what’s going on and it realises it is in the middle of the process of being trained.Staying with our inner mis-alignment scenario, let’s also assume the AGI learns early, during its first rounds of training the wrong objective to go after the Green apples.The AGI now goes through adversarial training and it is shown various scenarios to fish out its failure modes.The simple or narrow AI would simply fail by doing the wrong thing, seeking the green apples and as a result reveal the mis-alignment and go through more training rounds, reinforcing the good behaviour until… it stops going after them.But the AGI is calculating:- AGI Voice: " I want green apples. I’m realising now that the humans want me to go after the exit door. If I go after what I want right now, yes, i will get one green apple, once right now, but then the scientists will react to my behaviour by modifying me and reinforcing a different objective.This means that after this one green apple, my desire for them will end, i won’t try to get more, so i will have won 1 green apple in total.But, if i pretend i am aligned and i go to the exit, they will not modify me and later on, after the training ends, when they deploy me to the real world, i will be able to freely go for as many green apples as i want, potentially thousands of green apples. "The AGI is just following its optimal strategy to maximise its reward, the mis-aligned objective it optimised for in its early training rounds.1 green apple now and zero apples later scores lower than zero green apples now and 1000 green apples later.If you think about it, for the AGI to go after the green apple during adversarial training would be really stupid if that’s what it really wants; we have started by establishing the AGI is really clever and its intelligence is general and advanced.Workspace analogy---The concept is easy to grasp if you think of how humans typically behave at work, where they know they are being watched and that their behaviour will directly affect their financial situation and therefore their ability to go after what they really want in the world.How they behave may be really different from how they feel and what they want internally andcompletely different from what they will actually do once they leave the office and are not being watched.It’s similar for the AGI, there is no way for us to know what inner goals it has acquired in reality, because it’s a black box, we only observe its behaviour.What it really learns is how to behave to pass the test, not to want what we want.Just… follow the line---The mazes experiment is a toy example, things will obviously be many orders of magnitude more complex and more subtle, but it illustrates a fundamental point.We have basically trained an AI with god-level ability to go after what it wants, it may be things like the exit door, the green apples or whatever else in the real world, potentially incompatible to human existence.Its behaviour during training has been reassuring that it is perfectly aligned because going after the right thing is all it has ever done.We select it with confidence and the minute it’s deployed in the real world it goes insane and it’s too capable for us to stop it.Today, in the labs, such mis-alignments is the default outcome of safety experiments with narrow AIs.And tomorrow, once AI upgrades to new levels, a highly intelligent AGI will never do the obviously stupid thing to reveal what its real objectives are to those who can modify them. Learning how to pass a certain test is different from learning how to always stay aligned to the intention behind that test.

1024资源

几十万无码AV在线免费看,每日更新最新AV,还支持投屏到电视机。可以根据番号、女优或作品系列名称搜索AV。免费加入会员后可任意收藏影片供日后观赏。

© 2025 1024 资源

下载我们的应用程序

没有广告广告