Discussion about this post

User's avatar
Peter Ross's avatar

I think the bigger issue with the way the “rationalists” frame this is in making “goals” their starting point in the first place. They think of “AI alignment” as a question of how to put goals into the machine. The difficulty is that a reinforcement learning agent does not, in the first place, have goals. Rather, the goals emerge out of a reward function. So if you want to put a goal into the machine, that’s naturally very hard.

(The whole conception of tech-bro “rationalism” seems to be a formal-logical one, which is probably why they failed to anticipate the success of deep learning. You’d think that the failure of symbolic AI would have drummed into everyone’s heads the need for a more dialectical conception, but it seems they’re still trying to apply formal logical modes of thought to today’s non-symbolic systems, and that is the root of their misconceptions.)

The way the current systems are trained is not by “a human hand-crafting a utility function to train a certain behavior into a model.” To do more complicated things, a model will have to *learn* its reward function, for the reasons you describe in your essay. RLHF is a way to do this, albeit a pretty primitive way. The training process is a dynamical system, involving a very complicated coupling between training the thinking part of the brain and the rewarding part, and the environment. The question, to my mind, is how to set up a training process such that the reward function ends up being like that of a benevolent human. In order to solve “alignment,” one needs to understand the dynamics of the training process more than the actual contents of the minds themselves, such that one can set up a training process which will end up in the right equilibria. And we have an example from nature of how to do alignment in the form of human evolution.

So it seems clear that the way to do“alignment” must surely be with such tools as inverse reinforcement learning, multi-agent reinforcement learning, etc, i.e. following Stuart Russell’s approach rather than Yudkowsky’s. And *differential games*.

Actually, for all the fuss about the difficulty of solving the AI “alignment” problem, it seems to have a simple solution: stick to the passive prediction machines that arise from self-supervised learning. These are already very powerful tools that are sufficient for us to revolutionize most areas of applied science. Reinforcement learning is where the danger comes from, as it’s what gives rise to agency, and qualitatively higher levels of intelligence. If we limit RL to niche use cases and small-scale models, that would be sufficient to automate most forms of tedious labor. Between big self-supervised models and small RL models, that’s everything we need for a utopia. It’s only when you apply reinforcement learning to large models in a really determined effort to create a super intelligence that you have anything to worry about. That’s a Pandora’s Box which should be kept firmly shut, at least for the time being. To a more “rational” society, this would be a no-brainer.

Expand full comment
Peter Ross's avatar

You convincingly argue that the way an agent represents a goal will probably become more sophisticated as it becomes more intelligent: “the two axes of final goals and intelligence are connected in terms of complexity and sophistication.” I think that’s a good point. However, the main point of the “orthogonality thesis” is that it’s possible in principle to create an arbitrarily intelligent agent with arbitrary “goals” - so although the agent might have a very sophisticated mental machinery for dealing with the idea of making paperclips, maybe that will still be what it wants to do. It doesn’t look like you’ve touched on the fundamental point.

It’s clear that goals and intelligence won’t be uncorrelated in practice - that is exactly the claim of “instrumental convergence,” i.e. that we’ll see emergent goals like self-preservation and resource acquisition come out of training across many scenarios, and that some goals or subgoals will only become available to an agent as it gets smarter. The neo-“rationalists” usually use this to argue that the “space of minds” is very large and populated with many un-human minds.

What I find unconvincing in your piece is your assertion that it’s unlikely “that artificial agents will have alien values, or be likely to engage in extreme actions to achieve instrumental goals.“ All you have shown is that the way a goal is represented will become more sophisticated, but that still leaves room for many quite alien kinds of motivation. The intuition that the “orthogonality thesis” is meant to convey is that human motivations are quite specific. The human mind is the product of millions of years of natural selection, and a specific history. I think that intuition is basically right, an AI trained in a much different way will probably be much different than a human mind, and I share the (common sense) intuition that that is dangerous.

Expand full comment
8 more comments...

No posts

Ready for more?