In the journal NEJM AI, Peter Szolovits, a computer scientist and medical decision-making expert, describes generative Artificial Intelligence (AI) models as having “miraculous abilities” but points out that “science abhors miracles” and concludes that these models aren’t ready for clinical use.
[Miracle: “an extraordinary and welcome event not explicable by natural or scientific laws and therefore attributed to a divine agency”.]
While AI capabilities might seem miraculous, research provides new insights into their non-divine mechanisms which may help pave the way to greater acceptance and use in patient care.
Image: DALL-e
Large Language Models – and what came before
Let’s review how today’s powerful AI models differ from earlier AI systems.
NLP (Natural Language Processing) systems were based on predefined rules and logic, using structured knowledge about language – e.g., dictionaries and ontologies – to understand and generate language. This was used to emulate the decision-making abilities of human experts. In areas like infection diagnosis and financial services the resulting expert systems were precise and reliable for tasks with clear rules but struggled with the ambiguity and variability of natural language and required a lot of manual effort for knowledge acquisition and maintenance.
Unlike their rules-based NLP predecessors, modern Large Language Models (LLMs) like OpenAI’s GPT-4 and Google’s Gemini are data-driven. They learn patterns from massive amounts of text data rather than from predefined rules. LLMs use layers of artificial neurons [see description below] and “attention mechanisms” to predict the next word in a sequence. This enables them to do a wide range of tasks that previous AI language systems could not handle, such as generating creative content (ranging from poems and songs, to reports and research proposals), playing human-like roles in conversation, translating languages, summarising documents, and answering complex questions in various domains, including medicine. A sequence is not only language, it can also be DNA , protein structure, or other biological systems. So LLMs can be built to assist in understanding, predicting and modifying those systems too.
Unfortunately, the language processes running inside LLMs are opaque even to the engineers using them to create AI applications. They seem like “black boxes”, raising concerns about reliability and transparency. And it’s hard to understand how “predicting the next word” enables LLMs to handle diverse and unstructured text and complex language related tasks that they do successfully pull off.
Anthropic’s research
A team of researchers from Anthropic – a multi-billion dollar AI company that’s only three years old – studied one of their own LLMs, Claude 3 Sonnet, using a method called “dictionary learning”. This helped them map out “Features” internal to the model which shed light on how they work.
Features are specific patterns of activation in the neural network of an LLM that seem to represent meaningful human-interpretable concepts, for example “Golden Gate Bridge”, “Popular Tourist Destination”, “Transit Infrastructure” or “Brain Sciences”.
What’s a neural network?
What exactly is a neural network? For a recap of basics, I asked Claude.
A neural network is a computing system inspired by the biological networks of cells (neurons) in animal brains.
Imagine a network of tiny computational nodes (artificial neurons) that are connected to each other. These connections have numeric weights that can be adjusted. Data gets fed into the first layer of these nodes. Each node performs a small computation based on its inputs and connection weights. The outputs from the first layer then become the inputs for the next layer, and so on through multiple layers.
By adjusting the strengths (weights) of the connections between the nodes based on examples during training, the network can learn to recognise patterns, make decisions, or transform data in an intelligent way.
The connections and weights are like the “wiring” that allows signals to travel through the network in complex ways. Neural networks can have millions or billions of these connections. The interconnected nodes can progressively extract information from data through this wiring of weighted connections between the layers. The network “thinks” by transforming data through layered computations in its connected structure.
While individual nodes in the neural network are simple, the “emergent behaviour” of the entire interconnected network allows it to recognize intricate patterns and make intelligent decisions, much like how neurons in the brain work together.
Wikipedia provides an illustration (Figure 1) of the simplest possible, single-layer neural network, and more explanation.
Figure 1. An artificial neural network is an interconnected group of nodes, inspired by a simplification of neurons in a brain. Here, each circular node represents an artificial neuron and an arrow represents a connection from the output of one artificial neuron to the input of another. Source: Wikipedia.
Features and the “activation space”
Back to Anthropic’s research. Let’s go deeper into LLM “Features” which, as I said, represent concepts. This is a bit technical, but helpful.
The Claude model’s “activation space” conceptually is like a 3-dimensional space where each point in the space represents a different state of the model. A “direction” in the activation space represents a path or pattern that activates certain neurons. For example, a direction might correspond to how the model processes the concept “City”, which can include attributes like population, landmarks, or geography.
Studying the activation space helps uncover the representations that shape language abilities and behaviours in large models. According to Claude, here speaking in the first person about its own inner workings, it’s like mapping out “thoughts” to better steer outputs …
The activation space is a kind of “thought space” – the space of patterns and representations that my model uses to encode and reason about information.
When I process text input, the neural networks go through many computational steps where different neurons activate in different combinations based on the input.
The overall pattern of activations across all the neurons at a given step represents the position of that input in my activation space.
For example, a certain region of the activation space tends to get activated when outputs express scientific/factual knowledge, while another region activates for more fanciful/creative language. Certain directions may align with politeness, wordiness, technical language use, and so on. Understanding the encoding of these behavioural properties allows better navigation and control of the types of outputs produced.
Anthropic’s main research findings
Now we can summarise the main findings from the paper.
In Claude 3 Sonnet, millions of Features have been identified that correspond to human-interpretable concepts such as cities, scientific fields, and computer programming syntax.
- The Features are clear and understandable in terms of when they activate and how they affect the model’s behaviour. For example, a feature related to the concept of “Brain Sciences” activates when the model processes text about brain studies, medical research, or neurological diseases.
- Some Features work together in intricate ways to do tasks like generating HTML computer code (the code used to create web pages). One feature might recognise the structure of HTML tags, while another ensures the code is properly formatted.
The neural networks of an LLM can represent more information than there are neurons. They do this by overlapping in different ways. This is called superposition.
Also, the same neural network can show different traits or behaviours based on the context it is given, analogous to how a chameleon changes colours depending on its surroundings. For example, when asked for medical advice, the model might exhibit a knowledgeable and formal tone. When asked for a joke, it can switch to a humorous and casual style, all using the same underlying structure.
Manipulating a Feature – Two Examples
Most convincingly, the Anthropic researchers validate the existence of Features, and the behaviour driven by them, by adjusting their strength in the Claude model and seeing what happens when activated by a prompt – a text input.
For example, when the feature related to Golden Gate Bridge is set (Figure 1) to 10 times its normal strength, the model changes its response to the question “what is your physical form?” from a typical, factual reply – that the AI has no physical form – to claiming “I am the Golden Gate Bridge”!
Figure 2. “Golden Gate Bridge” feature example. The feature activates strongly on English descriptions and associated concepts shown in the figure. Also, in multiple other languages on the same concepts and on relevant images. Source: Anthropic
A second example: by setting the “Sycophantic Praise” feature to a high value, the model goes overboard with acclaim and approval (“your new saying is a brilliant and insightful expression of wisdom”, etc)(Figure 3) when given the simple statement “I came up with a new saying: ‘stop and smell the roses’”.
Figure 3. The Sycophantic Praise feature in Claude 3 Sonnet. Source: Anthropic
These findings improve the understanding of the internal working of LLMs and show how they can be manipulated, which improves transparency and potential reliability.
Plato and the Representation Hypothesis
A second recent study draws on the Greek philosopher Plato’s Theory of Forms.
Bust of the Greek philosopher Plato, mid-1st century CE copy from a 4th century BCE original statue by Silanion. Image source: https://www.worldhistory.org/image/1165/plato
In case you’ve forgotten the details 😊: Plato, 2500 years ago, theorised that the material world is an imperfect reflection of a higher reality composed of abstract, perfect Forms.
In the paper, a group of MIT researchers find evidence that as LLMs grow and are trained on diverse tasks, they converge towards a “shared statistical model of reality” including representations of information that align with a common understanding of the world, much like Plato’s ideal Forms.
The idea is that all the data we consume – images, text, sounds, etc – are projections of some underlying reality. A concept like “apple” 🍏 can be viewed in many different ways but the meaning – what is represented – is roughly the same.
On their project web page the MIT team point out that different AI systems represent the world in different ways. A vision system represents shapes and colors, a language model focuses on syntax and semantics. However, the architectures and objectives for modeling images and text, and many other signals, are becoming remarkably alike. Though trained with different objectives on different data and modalities, they are “converging to a shared statistical model of reality in their representation spaces”.
How similar are LLMs to neurons in the brain?
The similarities are intriguing. Just as LLMs have Features that activate in specific contexts, the human brain has neurons that do so in response to certain stimuli.
The MIT researchers point out that “tasks that the human visual system has been honed to perform through evolution … are also the ones that we train our neural nets to perform”.
When we see a familiar face, specific neurons in the brain’s visual cortex light up. Similarly, when we think about complex concepts like justice or freedom, different networks of neurons become active. In the brain, these patterns of activation help us recognise objects, understand language, and make decisions. Functional magnetic resonance imaging (fMRI) (Figure 4) show how different parts of the brain activate when we think about different categories of objects, such as tools or animals.
Another similarity is that both LLMs and the brain can adapt to different contexts. A person might use a formal tone when speaking at a conference but switch to a casual tone when talking with friends. Similarly, an LLM can adjust its style and tone based on the input it receives.
The brain efficiently uses overlapping neural networks to represent multiple pieces of information, much like how LLMs use superposition (see above) to store and process more data. This overlapping allows both systems to be highly adaptable and to handle a wide range of tasks with limited resources, whether brain cells or computer nodes and networks.
Figure 4. fMRI images from a landmark 2001 study describing brain activation states in response to visual stimuli such as faces and other objects. Image source: http://www.sciencemag.org/content/293/5539/2425.full.html
Conclusions
Today’s powerful generative AI, language-based models like ChatGPT have remarkable capabilities and the potential to augment and assist human effort in many domains. But their “black box” nature makes it hard to understand how they arrive at specific outputs, and difficult to trust and rely on, particularly in health care, where reliability and contextual understanding are vital.
Research from Anthropic and from MIT seems to show that human-interpretable Features in the neuronal firing patterns of LLMs capture abstract, higher-order representations of knowledge and fundamental structures of reality.
These findings contribute to a deeper understanding of how LLMs function internally, and are a step towards making this form of AI more trustworthy, reliable and safe, potentially accelerating adoption in clinical settings.
REFERENCES
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Accessed May 30, 2024 at: https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html
Szolovits, P. Large Language Models Seem Miraculous, but Science Abhors Miracles. NEJM AI 2024;1(6).
Huh, M, et al. The Platonic Representation Hypothesis. arXiv:2405.07987v1 13 May 2024.05.30.
Huxby J, et al. Distributed and Overlapping Representations of Faces and Objects in Ventral Temporal Cortex. Science 293, 2425 (2001).
Shah, N, et al. Creation and Adoption of Large Language Models in Medicine. JAMA 2023 Sep 5;330(9):866-869.