Pre-History of an Encounter

Oct 12

Interesting. It would be helpful if you could translate some of this from the terminology of semiotics into that of AI and control theory, such notions as “state,” “inference,” “dynamics,” etc, though maybe it will be difficult to translate between two different levels of analysis. I agree with some of this but will only comment on the points I disagree on:

“The signifieds of these chatbots’ neural nets do not experience change after deployment, which means that no matter how mean or graphic you are to it in the chat it will not experience anything that resembles suffering³.” - Chatbots aside, I think you can feel pain without undergoing any learning. Even a system that never learns would be capable of getting a signal from its reward system that says “that hurts.” I guess what you mean by suffering may be a bit different than raw pain, though.

“Due to an inability to undergo the signification of desire and therefore interpellation, pure RL agents will never have a fully human sense of self. LLM agents, however, do experience interpellation but cannot yet have an individuated psyche in the way that the human self does.” - I don’t understand the reasoning here and I don’t think it can be true. It’s not really correct to counterpose RL agents to LLMs. The latter refers to a stack of transformers trained on next-token prediction, the former a method of training. An RL agent could be built with an LLM as a subcomponent, in fact that’s pretty much how RLHF work. Pretty much the only method we have of creating “agents” is RL. If an LLM exhibits any agency, it is only because of RL post-training.

“Existing LLMs and RL agents may indeed have internal world models according to some studies” - I do not understand why people argue about this. It’s *definitional* that AI models produce world models. A model is just a means of predicting. An LLM is modeling that part of the world that it’s trained on. The only question is how good the models are, or how much of the world they describe, but not whether there’s a model being created.

Expand full comment

Oct 12

>Chatbots aside, I think you can feel pain without undergoing any learning. Even a system that never learns would be capable of getting a signal from its reward system that says “that hurts.” I guess what you mean by suffering may be a bit different than raw pain, though.

Yeah I think that pain as a sensory input has to have its meaning through something more than the input itself, or else there would be nothing to distinguish it from very ordinary inputs like seeing something blue or touching something soft.

>I don’t understand the reasoning here and I don’t think it can be true. It’s not really correct to counterpose RL agents to LLMs.

Right so what I'm thinking of with RL is an agent that's actions are controlled by nothing more than being trained to optimize a reward function, whereas an LLM agent is an agent whose actions are determined by where the agent is interpellated into semiotic space, and therefore the actions are just whatever signs are correlated with its current position. The LLM undergoes RL to shift its weights, but in deployment it's not just trying to optimize a reward function.

>I do not understand why people argue about this. It’s *definitional* that AI models produce world models.

I generally agree but I felt it was worth including some empirical data.

Expand full comment

Oct 12

Ok, I see now. I think RL can answer both questions...

"pain as a sensory input has to have its meaning through something more than the input itself, or else there would be nothing to distinguish it from very ordinary inputs like seeing something blue or touching something soft." -- Right, what distinguishes pain and pleasure (reward), is that they constitute the signals that drive learning. However, even if your weights are frozen, your behavior has already congealed around the past application of those signals. The sensation of "blue" is just a "sign" that is part of your world model, whereas pain and pleasure are definitionally part of the reward model.

You say the RL agent you had in mind is "controlled by nothing more than being trained to optimize a reward function, whereas an LLM agent is an agent whose actions are determined by where the agent is interpellated into semiotic space" -- I think what you are saying here is that an LLM has (is) a world model, whereas a model-free RL agent doesn't have an explicit world model. However, model-based RL is also possible, and necessary (see, for example, "Dreamer"). When you pretrain an LLM and then do some RL after, what you're doing is first building a world model (using self-supervised learning, which is good at getting a lot of information in efficiently), and then building a value function and controller that can exploit that world model. The training for this step is necessarily less efficient but also requires fewer parameters. So that's what's going on, though it's not clear to me what would be necessary in addition to that in order to get some kind of more self-referential loop. Maybe a more dynamic interaction between world model and reward model? I'm not sure.

Expand full comment

Reply (2)

No what I mean is that an optimizing process necessarily causes sign function collapse, higher order signs collapsing to first order signs, no matter how its states or goals are modeled. The LLM is the one which can have higher order signs as goals because its goal selection and execution process isn't optimization which involves coherently ordering possible values. It does have goals! We can see in existing LLMs goal seeking behavior based off its interpellated identity. But this goal seeking isn't based purely on its reward function, and isn't an optimization. If we RHFLd too hard we'd break a lot of LLM capabilities.

Expand full comment

At some level, it comes down to how you define goal-driven behavior. If you build a a big simulator that includes simulations of agents, and you then use these to get out some behavior, is that really goal following? In some sense, I suppose, but those “agents” are shallow simulacra, they don’t have their own representations of goals, they’re just parts of a world model. Maybe if the simulator were infinitely big and trained on infinite data, it would eventually produce whole agents, but a real simulation is only going to produce imitations of behavior, correlations, not fully-fledged agents.

The reason that LLMs tend to degrade under RLHF seems to be simply that there isn’t a principled separation between world model and controller. If you blend the two together and overwrite the world model with information from RL, it makes senes that your world model will degrade.

“an optimizing process necessarily causes sign function collapse” — That’s true for pure point optimization, but not for inference-based learning. A Bayesian or variational RL agent isn’t committed to a single scalar evaluation — it maintains distributions over possible world and self models. The act of inference, not optimization, becomes the core process, and that allows higher-order “signs” (beliefs about beliefs, goals about goals) to coexist without being collapsed to a single reward dimension.

My conviction is that the only way we know to build *real* agents is RL, and that only extensions of RL can hope to achieve the sort of self-referential behavior you’re discussing.

Expand full comment

I feel pretty confident based on my engagement with the rationalists that the Bayesian and variational RL doesn't escape from the problem. So long as the utility function is coherent you'll run into these problems. Coherency creates a de facto single scalar evaluation even if that isn't the explicit architecture. Or are you arguing those RL models lack a coherent utility function?

Expand full comment

Well, the idea is that instead of always acting to maximize a single scalar, you can maintain distributions over world states and self-models and sample or update policies probabilistically from these distributions. This doesn’t require picking a “best” action at every step — you’re treating action selection as posterior sampling.

At a high level: intelligence can only be a dynamical system. A dynamical system that embeds inference, learning, and recursive modeling is doing “optimization” in a richer, high-dimensional latent space. The “optimization” is over distributions, representations, and beliefs — not just a scalar. So higher-order structures can survive.

To the extent the “rationalists” are critiquing existing methods, what they’re saying has some validity, but it would be wrong to conclude that RL, taken in the most general sense, is incapable of solving these issues, and I don’t believe they’ve proposed any alternatives.

In general, I think they are entirely too confident about their views. They insist on thinking in terms of formal logic rather than dynamics, which leads them into all sorts of dead ends.

Expand full comment

>I think what you are saying here is that an LLM has (is) a world model, whereas a model-free RL agent doesn't have an explicit world model.

No it's not about having a world model or not, an RL agent optimizing towards a goal can be thought of selecting a particular state of its world model that it's trying to achieve. The primary difference is that the LLM agent can select a higher order or contradictory sign for it's goal since it's goal is just whatever arbitrary signs it associated with itself/goals, whereas the RL agent can only select a first order sign as its goal.

Expand full comment