Anthropic studies the introspection of machines: Claude observes himself, and understands something

Many people don’t know this: when they hear someone talk about AI they assume they are programmers trying to implement new functions, make them generate more realistic images and videos (we’re already so messed up, …

Anthropic studies the introspection of machines: Claude observes himself, and understands something

Many people don’t know this: when they hear someone talk about AI they assume they are programmers trying to implement new functions, make them generate more realistic images and videos (we’re already so messed up, thanks) and so on. Instead there are researchers, within large artificial intelligence companies, who study artificial intelligence to understand how it works, and the curious thing is that, as much as we have built it, we don’t really know.

We know the input and we know the output, not what happens in between, exactly like in the human brain: we know that a thought arises from a set of electrical impulses that pass through billions of interconnected neurons, that each electrical discharge is a signal, and that signals pass from one neuron to another thanks to synapses, where chemical neurotransmitters open or close the passage like switches. We also know that brain activity follows recognizable patterns: networks that light up when we speak, others when we remember, still others when we imagine. Nonetheless, we do not know how the awareness of a thought emerges from all this (from a flow of discharges and molecules), or the subjective experience of an “I”. (For metaphysicians: we know that without a brain there is no thought, just as there is no AI without hardware).

The difference is that in the case of the brain we are talking about a product of billions of years of biological evolution, while here we are talking about an artificial object, built by us and which however already escapes our understanding. Inside a language model like Claude or ChatGPT there are no neurons or synapses, there are billions of numerical parameters that influence each other, a network of mathematical weights that learns to recognize linguistic patterns and reproduce them. Here too we know what goes in and what goes out, not precisely how a combination of numbers becomes a coherent answer. In other words, we have created a brain that works and that we don’t understand (not that things are that different with our peers, often when I talk to someone I don’t understand if it’s them who don’t understand and if it’s me).

In any case, the new Anthropic study, entitled Emergent Introspective Awareness in Large Language Models, starts from this observation: researchers tried to teach their models (Claude 2 and Claude 3 Opus) to ask themselves what they were doing while they did it. (In other words, they gave an artificial intelligence the same task that a psychoanalyst would give to a patient in crisis: “try to reflect on yourself”, except for the ones I went to, who after four or five sessions needed a psychiatrist themselves).

The result is surprising and worth considering: in about a fifth of the cases, i.e. in 20% of the tests, the model was actually able to recognize when information came from an external data injection or from its own internal process. By injecting extraneous data at a certain point in the flow of “thought”, the AI ​​in some cases understood that something was wrong, and even admitted that part of the answer came from an external module, like saying “this sentence is not mine”. A form of primitive “self-consciousness” or, more precisely, proto-introspection: not that this means that machines have “consciousness”, or understand what they say, at least not yet.

So much so that Anthropic itself specifies that it is not about consciousness, and talks about a cognitive self-debugging mechanism, a kind of metacognitive reflex that only serves to improve accuracy. There is no cogito ergo sum, AI does not cogita, introspection (an equally ambiguous term, it serves to convey the idea) is only statistics: it recognizes patterns, not meanings. It’s as if he were saying “I’m making this up”, without understanding what “I” means.

Yet the prospect is interesting, technologically interesting: if an artificial intelligence learns to recognize its own limits, then it can also learn to avoid them, and if it learns to doubt its own answers, perhaps one day it will be able to become more reliable than many human beings who never doubt anything, and at that point, it will be the artificial intelligence that will correct us (which is why Grokipedia, Elon Musk’s Wikipedia, started off on the wrong foot, with the sources checked and corrected by Grok, as if Grok did not absorb the biases of network, that is, of all writing humanity).

In any case, it is an interesting study that will be explored in depth and

It will improve AI and its reliability, and proves that it is possible to teach machines when trying to deceive them. Which, if you think about it, we haven’t yet been able to do with our unfortunate species.