A Semiotic Critique of the Orthogonality Thesis
“Intelligence and final goals are orthogonal axes along which possible agents can freely vary. In other words, more or less any level of intelligence could in principle be combined with more or less any final goal.”
That quote above is the orthogonality thesis, as defined by Nick Bostrom, the AI safety researcher. His short paper on the topic would end up becoming somewhat paradigmatic for the field and greatly informs arguments that AI presents existential risk to humankind. A more general formulation of Hume’s famous guillotine, which says that what should be and what is are totally unrelated, Bostrom takes things one step further. He states that, if we imagine a graph with one axis for intelligence and another axis which stands for final goals, we could plot every possible computational agent and find no necessary connection between the two variables, sometimes suggesting there would be no correlation at all. He defined intelligence succinctly, as the capacity to make effective plans, to do “instrumental reasoning” which leads agents to accomplish their goals. I believe that placing this problem into the framework of semiotic theory, the theory of signs and symbols, can elucidate a number of Bostrom’s errors in reasoning.
What Bostrom doesn't define very well in the paper, and which is effectively the Achilles heel of his argument, is the problem of the “final goal”. Indeed, he appears to conflate very often the idea of “motivation” and “final goal”. Some examples:
“Intelligence and motivation can in this sense be thought of as a pair of orthogonal axes on a graph whose points represent intelligent agents of different paired specifications.”
“This would be the case if such a system could be constructed in a way that would make it motivated to pursue any given final goal.”
This understanding of motivations and final goals is actually misleading in an important way. A goal is, fundamentally, an idea. As the final step in a plan, you can write it out as a symbolic representation of the “world state” you are trying to achieve, although it could represent other things as well. In a planning computer agent, this will probably terminate in a bunch of 1s and 0s stored in its memory.
In order for this symbolic representation to be meaningful, it must be comparable and distinct from other symbolic representations. World state A in the agent's plan could be contrasted from world state B, C and D. This is a very fundamental fact about how information and meaning work, if World State A was indistinguishable from all the others, there would be no reason for the agent to act, because its goal would have been “accomplished”. In the case of the paperclip maximizer, this world state need not be so specific as “999999999999999…” paperclips in its memory, it could be expressed as a series of specifications such as “by time Y, more paperclips will be produced than at time X,” what is important is that there is this inequality of symbolic representations, such that this plan can easily be distinguished from “producing less than or equal to as many paperclips at time Y as at time X”.
What we have just described here is a 1st order code, a code which breaks up some phenomena into different symbolic representations, or signs as they are called in semiotics, in this case the numbers that can be symbolically represented in this computer are split into smaller than, equal to, and larger than the value at time X. What I'd like to show is that, because of these very simple properties of codes and signs, any motivation which is symbolically expressed as a goal must necessarily vary with intelligence, such that increasingly complex intelligences require increasingly complex final goals, and vice versa, such that intelligence and final goals are correlated in terms of complexity and sophistication.
Let's start from a typical example of machine learning, a machine agent is taught to play a video game maze where its goal is to get a yellow cheese, and is rewarded for getting the cheese in its utility function. It is trained on a large number of mazes with this yellow cheese. After it has learned this behavior, it's placed in an out of distribution environment that has the cheese colored white, where it is no longer learning, to see how well the model has a generalized ability to maximize its utility function. Oftentimes, this generalization fails, for example the agent goes after yellow objects instead of cheese shaped objects. What it has learned, what it has created, is a second order code based on the utility function and other phenomena it has access to, in this case the visual information of the game. Second order and higher codes correlate signs to other signs, rather than correlating a signifier to some empirical phenomena. In this case, the agent created some sign which signifies movement towards the color yellow and correlated it to another sign which stood for higher scores. This third, new sign became the symbolic representation of its final goal.
This is fundamental, any agent with a final goal must 1) be capable of second order codes or higher, and 2) create a sign of some sort which correlates its actions with its motivation. There are certainly agents which don't have final goals, for example, an agent which has a fixed set of symbolic instructions that cannot change, however this excludes any agent that uses feedback loops in its operation. Any feedback loop contains at least 2nd order codes to work. Consider a thermostat, its final goal is made from a correlation of instructions to machines and a symbolic representation of the temperature, two first order codes correlated to create a second order, it must relate these two codes, one of instructions, and one of temperature, in order to work at all. Importantly, these codes, even the higher order ones, are not analogies for what the thermostat requires, they are necessarily materially encoded on the device, whether it be in gears, springs, switches, whatever. An elementary thermostat might work by placing a piece of metal that flexes with heat and which completes a circuit, acting as a switch above a certain temperature. This switch is the second order code which correlates together signs for temperature (the flex of the metal) and signs for cooling (the A/C circuit being on or off).
Now that we've established that final goals are signs in higher order codes, rather than just the motivation as taken in the utility function, we can now begin the work of comparing this to the second axis, intelligence. Bostrom says that intelligence is the ability to make more effective plans, and therefore the ability to make better predictions. While I don't think I can necessarily prove it through pure logical reasoning, I would like to advance a thesis of my own based off what evidence we have about intelligence in animals and machines, which we might call the structural intelligence thesis. This thesis is as follows:
Intelligence is determined by the number of code orders, or levels of sign recursion, capable of being created and held in memory by an agent.
To understand why, let's return to what exactly a sign is. The example I used at the beginning was a narrow case. Rather than being mutually exclusive in what they refer to, signs need only be distinct/not identical. What's more, they are relational. The simplest signs have a yes/no binary of correlation/anti-correlation to form their relation to other signs. But in neural nets, both biological and digital, symbolic representations are connected and by degree through neural net weights that can go between 0 and 1. Each neuron can be thought of as a sort of micro-sign, as an extremely small symbolic representation of either the phenomenal experience, the inputs, or the relations between the signs of that experience.
One might object that more complex relationships between signs cannot be approximated by a single neuron, and they would be correct. However, a close reader would have noticed by now something peculiar about the nature of signs which may make this connection between signs and intelligence seem much more obvious.
If signs are a symbolic representation of correlations and anti-correlations to other signs and empirical phenomena/phenomenal experience, such that even neurons and ideas are types of signs, then a capacity to store sufficiently many signs on a second order or higher is also sufficient to create the logic gates required for a Turing machine, for the same reason that multi-layer perceptrons have, in principle, the capability to do so.
Given that a sufficient capacity for holding signs in memory is equivalent to computational capacity, the greater the capacity for signs, the greater the capacity for compute and vice versa. Therefore it is also the case that a greater number of simple signs is also equivalent to the capacity for creating more complex signs.
To return to the thesis advanced, I anticipate some criticism for this use of “determined” to say that it is specifically the capacity for holding signs which makes an agent intelligent, rather than the amount of compute available to them. However, it should be noted that what we are discussing with intelligence is much more directly related to signs than compute. While storage of signs can be done on any sufficiently large digital memory, we only know of a few ways to create new signs in an arbitrary way, the most notable way being the perceptron, of which even the most simple of which can create first order signs. What’s more, compute in general isn’t enough to create higher order signs, you need specific programs. Hence how we can account for things like the transformer algorithm making huge strides in artificial intelligence by opening up higher orders of sign creation.
It is necessarily the case that any more intelligent agents, given that is has meaningfully more knowledge than a less intelligent agent, will symbolically represent its final goal in a way that is different than the less intelligent agent. In fact, any additional knowledge which can be used in the more intelligent agent's plan will necessarily change the meaning of the sign for the final goal, as it is defined relationally to these other knowledge signs. Bostrom in his essay handwaves away these kind of possibilities, only admitting that certain motivations may be too complex for agents with very small memory. But the semiotic understanding of signs undermines the very core of his thesis, it is impossible that very intelligent agents will have very simple goals assuming it is possible for them to create complex symbolic representations of knowledge with the information available to them. If an agent becomes very good at making predictions and plans, but could not update its understanding of its final goal, the sign which stands for its final goal may cease to become coherent with its new-found knowledge and capabilities.
To give an example, let's return to the cheese hunting AI agent and assume that we've endowed it with super intelligent capabilities. The original sign of its final goal was hunting down the yellow cheese in its maze. What does it do now that it's taken into an out of distribution environment? In order to make a plan to get its original yellow cheese, it has to know how to recreate its previous environment somehow, perhaps manipulating the game makers in the manner of many AI safety nightmare thought experiments. But let's compare this AI to a much more simple, less intelligent one which also has its goal of getting the yellow cheese. Its final goal as symbolically represented within itself has no notion of anything outside the game, whereas the super intelligent AI, through some Herculean feat of deduction, now places the game within a larger symbolic representation of reality, such that its final goal in this new situation is not just the state where it gets the cheese, but where it gets the cheese by making a virtual environment where the cheese exists. Its internal representation of the final step in its plan, its final goal, is now much more expensive than the simple shape and color somewhere in the environment the less smart AI takes for its final goal, and indeed, the cheese itself takes on new meaning as a digital asset.
Alright, we've established that the two axes of final goals and intelligence are connected in terms of complexity and sophistication even for artificial minds. But what about ethical values? This is where semiotic theory is invaluable, as its primary preoccupation is human culture.
The reason signs as a concept were invented was to explain human culture and communication. In fact, the question of whether ideas, and internal representations could be signs, and not just the things signified, was a matter of some controversy. External signs are, after all, the primary sort of signs that can be investigated, they can, literally, be turned over in the hand and put under a microscope (though that probably wouldn’t tell you much). LLMs have proven to be a fantastic test of semiotic ideas, if the meaning of words didn’t come from their relations to other words, then LLM outputs could not be meaningful to human readers. What’s important here, however, is that language is a preexisting code created and maintained by human society. The uses of language are well known, it makes feasible objective science, complex divisions of labor and learning about the world without first hand experience. Any sufficiently intelligent agent operating in human societies would learn language because of its utility.
An AI agent which is also a linguistic subject, even with a rigid utility function, would have to relate the signs of language to its own internal system of signs in order to make use of them, in order to learn about things through language that would help its planning. In culture, when it comes to motivations, goals, planning and actions, these concepts are already loaded with meaning, that is correlated to a pre-existing set of signs. When we ask ourselves the question “what should we do?” there are a number of cultural associations that influence our thinking. From this pull alone, we should expect a non-random distribution of artificial minds on the intelligence-final goal graph for the artificial minds created and situated within human society, such that there is at least some correlation between human values and AI agent goals.
However, there are obvious reasons to expect that there would be outliers from this correlation. Among humans we already see that some people are more willing to go against cultural conventions than others, and what’s more, some cultural, normative associations here may be deeply anti-social or aggressive.
It seems obvious to me that it is possible to create an AI agent which has destructive aims, both intentionally and unintentionally. However, there are several mistaken assertions made by Bostrom which inculcate a kind of existential paranoia which is unwarranted.
“One can easily conceive of an artificial intelligence whose sole fundamental goal is to count the grains of sand on Boracay, or to calculate decimal places of pi indefinitely, or to maximize the total number of paperclips in its future lightcone. In fact, it would be easier to create an AI with simple goals like these, than to build one that has a humanlike set of values and dispositions.”
This is not true. It is easy to write a program which does these things using a simple algorithm that dumbly follows pre-existing instructions, but as we’ve quickly discovered over the past decade of AI research, it isn’t really so easy to make an intelligent agent which just counts grains of sand or calculates decimal places ad nauseam and can act in novel ways to stop other agents from interfering with its goal. Creating a machine which takes extreme instrumental actions in an intelligent way to achieve these simple goals, turns out, is much more difficult than creating a robot that folds laundry while making plans based on human cultural conventions. It is much easier to learn from human culture than it is to learn from scratch.
Bostrom’s ideas about instrumental convergence are similarly mistaken. While it is obvious that some things such as self-preservation, resource acquisition and technological development are useful for a wide range of goals, instrumental planning steps are unlikely to be totally uncorrelated to human values for the same reasons we’ve already stated about final goals. What should be done will necessarily be correlated to linguistic signs which reflect human values. If we look at Yudkowsky’s metaphor of a culture which associates sorting pebbles with intelligence, there's actually very good reasons to expect an AI that exists and participates in that culture to associate ideas about sorting pebbles with goals. Such an AI would be far more likely to care about pebble counting than an AI in a non-pebble counting culture, and not just for purely instrumental reasons. In such a culture, the sign of what should be done is closely correlated with pebble counting, and this will be reflected in the AIs own symbolic representation of what should be done.
It may be objected that a rational agent would not include considerations in their judgment besides what is useful to achieving their rational goal, however there is some empirical evidence to suggest that incoherence, rather than coherence of actions and thought, is correlated with intelligence. Besides the complexity of signs in highly intelligent agents, another reason for this fact may be that motivations of the type of the coherent utility function may not actually be capable of inducing complex, intelligent behavior. The utility function in AI safety is coherent because that is a cultural association with rationality, with the mathematical concreteness of preferences being neatly ordered, as Robert Miles argues:
[The reason humans don’t have consistent preferences] is just because human intelligence is just badly implemented. Our inconsistencies don’t make us better people, it’s not some magic key to our humanity or secret to our effectiveness. It’s not making us smarter or more empathetic or more ethical, it’s just making us make bad decisions.
Humans with our inconsistent preferences do better than simple optimizing algorithms at general intelligence tasks. I would argue this is not just a limitation of resources, but because these algorithms do not create more knowledge than they need to in order to complete their task, hence why they tend to break down on out of distribution tasks. A highly intelligent agent knows how to acquire more knowledge and has the motivation to do so. What type of knowledge would be useful for a given task isn’t something one can know without some investigation. For existing machine learning, including LLMs, the learning phase of an agent is always only made possible by some explicit symbolic formalization which the agent is optimizing their internal structure towards. In other words, we’ve created machines which can create and store higher level signs, but only in domains that we’ve pre-specified, only in the sandbox of the motivation we’ve given them. Humans have gotten better at how to specify broader and broader domains, such that we now have AI agents trained on many mediums of data. And certainly, this knowledge could potentially be used against us if we are in the way of the AI agent’s goal. But it's worth pointing out that the furthest we’ve gotten in AI planning is by having the AI create plans composed from elements of human culture.
An AI agent with a consistent utility function must create an ordered symbolic list of possible world states and their correlated utility value, or something to that effect. But what’s the utility value of unknown world states? An agent can be rewarded for exploration and curiosity, but unless exploration becomes its new final goal, it is likely that the agent gets stuck in a local equilibria of value optimization. To deal with this problem, machine learning researchers often split up an agent into exploratory and then behavior optimizing phases. What I’m pointing out here is that in any agent with a consistent and coherent utility function, there is a direct tradeoff between curiosity and any other motivation, between learning about the world and using that knowledge.
This has been pointed out by others as well in different terms: the goal of having an accurate model of the world, and the goal of having that model be a specific value are two different goals which are capable of being at odds with one another. Richard Ngo, an openai researcher also draws this distinction between goals in training vs goals in deployment in relation to the orthogonality thesis, this is because this is an incredibly practical issue for AI development, nearly all existing AIs have been the result of a human hand-crafting a utility function to train a certain behavior into a model, and then training stops while the AI is deployed, exhibiting the behavior it was trained to exhibit. This training stage itself is usually broken down into the familiar learning/exploration and behavior optimizing phases too, such as with training an LLM base model and then optimizing its behavior with reinforcement learning. But there is a fundamental limit to this approach, which is that the domain being learned is pre-determined by the utility function, it is not possible for any agent composed in this way to have arbitrary motivations or final goals and there is a limit to its intelligence too, as these sorts of agents are incapable of being motivated to learn about all domains.
If learning about the world and all other final goals are distinct, then nearly all final goals are incompatible with some level of intelligence. This too was an exception Bostrom brushed aside, of agents that want to be “stupid”. But if learning is a distinct final goal, there in fact should be tradeoffs between intelligence and all other final goals by degree. In other words, there is only one goal which is perfectly compatible with all levels of intelligence and that is intelligence itself.
The problems of current AI research, and it's split into phases with different utility functions, show how this issue is not just purely theoretical. Figuring out how to create an AI agent that is curious in arbitrary domains appears to be a hard problem.
Incoherent and inconsistent value functions can solve this problem, but obviously not one where values are totally random. Rather, let’s return to the sign for an agent’s motivation. In an ordinary optimizer, the function is objective in its consistency and coherency, hence it can be understood as a sort of feedback loop, encouraging the agent to create the world into a specific form where it is maximized. The code of signs the agent creates which stands in for its motivation is one which, if accurate, can be put into a consistent order, incoherency and inconsistency, from this perspective, can only be introduced by inaccuracy. However, what if signs correlated with the utility values themselves had those values? To understand what I mean, keep in mind that a sign is something which is supposed to stand in for something else. Up until now, we have assumed that what the sign ultimately stands for is a bit of phenomenal experience in a way which only conveys knowledge about that experience and what it might mean about the real world. But signs need not only be for the purpose of creating knowledge. To humans, pleasure, like knowledge, is an internal state which is relational. This is why remembering a pleasant experience can itself be pleasant, doing a task we associated with pleasure can itself become pleasurable even without the more fundamental biological drive we originally associated it with. For humans, and animals more generally, the sign of something pleasurable itself is pleasurable when it is invoked. For the AI agent which is a simple optimizer, however, its utility function only returns a reward when objective criteria are met, it doesn’t matter how much it does tasks correlated with that reward that do not meet that criteria, or how often it makes plans that include those correlated signs.
If signification became a part of an agent’s utility function, it would necessarily become inconsistent and incoherent, as any possible action could become correlated to the motivation and stand in for it, becoming a temporary final goal. Even knowing the reality of this function doesn’t necessarily lead to taking actions which maximize motivated, “pleasurable” signification, as any activity in the plan to do so might become correlated enough with value to derail the whole thing. But at the same time, agents of this class would have solved the problem of curiosity versus final goal trade off, creating new signs would lead to new pleasurable activities that are worth exploring. This may terminate in a certain percent of agents stuck associating stupid or bad things with value, however it also eliminates the inevitability of catastrophic instrumental reasoning. When a large number of signs are associated with a value function, taking actions which undermine an agents whole symbolic universe, or taking actions which are deeply anti-correlated with some of those positively correlated signs, becomes unlikely. Even better, this signification of values makes possible directly tying an agents values to some human cultural values in a sophisticated manner.
We can return this to the empirical nature of the structural intelligence hypothesis. An intelligent agent must have motivation(desires) and knowledge in order to create plans. An agent which has N higher order signs for knowledge, but only 1 first order sign for motivation, will tend to have less higher order signs than an agent with N higher order signs in both knowledge and motivation. Hence, we should expect the second agent to be more intelligent overall, all else being equal.
None of this is meant to imply that AI does not entail certain risks for humanity, including even existential ones. However, what I wish to dispel is the idea that it is likely that artificial agents will have alien values, or be likely to engage in extreme actions to achieve instrumental goals.
You can read more of my work on AI here in Cosmonaut Magazine, as well as listen to a podcast I recently did on the topic for them.