AlphaGo: The AI That Made a Move No Human Ever Would (and what it means for AI’s future)
This article has been republished from New World Navigator, our Strategy and AI newsletter, a publication dedicated to helping organisations and individuals navigate disruptive change. Sign up here to receive and participate in thought-provoking discussions, expert insights, and valuable resources.
Like many in the AI community, I followed the OpenAI saga last week with a mixture of fascination and disbelief as events unfolded. In the end, we came full circle with Sam Altman initially sacked as CEO, then agreeing to head up an advanced AI research team at Microsoft, before reentering negotiations with OpenAI and being reinstated as CEO just a few days later. Reality is sometimes stranger than fiction, and I have no doubt that there’s a movie adaption already in the works!
Just as, or perhaps even more interesting to me however, was the revelation by Reuters that Altman’s ouster was in part the result of a letter from several staff researchers to OpenAI’s board of directors, warning of a new AI project codenamed Q* (pronounced Q-Star) that could potentially “threaten humanity”. Already the AI community is heaving with theories as to how Q* might work.
This week’s edition of New World Navigator explores what I see as one of the more convincing theories about the technological underpinnings of Q*. The theory is based on recent research announcements and papers from OpenAI, and has echoes of the techniques employed by DeepMind’s AlphaGo when it defeated top Go professional Lee Sedol in 2016.
Status Quo: Limitations of Large Language Models (LLMs)
Before delving into Q*, let’s consider the limitations of existing Large Language Models (LLM) such as ChatGPT-4, and more broadly, the current generation of Generative AI technologies. This will set the context as to how Q* could potentially overcome these challenges.
#1 Data Dependency
LLMs are highly dependent on their initial training data. Their knowledge and capabilities are largely limited to what is present in the original data set. This leads to two potential issues.
The first is bias and a lack of generalisability. We humans are inherently biased by nature which means that LLM training data is accordingly biased. Unsurprisingly this leads to biased model outputs. Training datasets are also seldom truly representative of the real world (e.g., trained primarily on data from developed countries) resulting in LLMs that struggle to generalise for less familiar situations (e.g., when applied to social context in developing countries).
The second is the inability to be genuinely creative and inventive. LLMs can create a haiku about New York living, while image generation models are able to blend the styles of Picasso and Andy Warhol for a new painting. Yet they are unable to develop entirely new creations or achieve leaps in logic that are beyond the realm of their training data. This is why LLMs are bad at math problems, which require the ability to reason.
#2 Static Knowledge Base
LLMs have a fixed knowledge base that is difficult to update following initial training which means they can become gradually outdated. In a nutshell, they are unable to automatically and intuitively learn as humans do. While is it possible to update the knowledge cutoff dates for LLMs – ChatGPT-3.5 and ChatGPT-4 were initially trained on data up till September 2021, but were subsequently updated to January 2022, and more recently to April 2023 – this process is costly and requires expertise, making it impractical for most organisations and models.
#3 Contextual Understanding
LLMs are good at understanding text and responding in a human-like manner. However, they often struggle with understanding the deeper context or intent behind queries, especially if it is complex or highly-specific. This gap is starting to close with the increasing “intelligence” of models such as ChatGPT-4 and multimodal capabilities that allow LLMs to “perceive” the world around them through vision and voice. That being said, heavy and iterative prompting is often required to enable LLMs to grasp the nature of multilayered instructions and complex circumstances.
Bridging the Gap: Combining Learning and Search
LLMs in their current incarnation already represent a giant leap forward for AI. Yet at its core, LLM responses are in many ways imitations derived from their training data.
LLMs can analogously be compared to some of our peers in school who were able to ace exams solely through rote learning (or simply put memorising the entire textbook!). When asked to explain and expand upon ideas and themes however, these students normally came up short. The lesson here is that imitation ≠ reasoning.
For LLMs to reach Artificial General Intelligence (AGI) and “surpass humans in most economically valuable tasks” (OpenAI’s definition of AGI), they must be able to genuinely reason and make leaps of logic, not merely imitate what humans have already done. Computer scientists theorise that this will require LLMs to incorporate both “Learning” and “Search” capabilities.
By Learning, computer scientists mean that models should be able to experience the environment(s) in which they operate in (and ultimately the world around them), take actions within that environment to test and make sense of it, and ultimately incorporate what they learn to inform future decisions.
By Search, we are referring to models that are capable of adopting a structured process of exploring and evaluating potential actions or solutions to a problem. In other words, they are able to systematically and logically explore problems, rather than using a “brute force” approach of blindly attempting every single option.
When you think about it, Learning and Search are ultimately how humans perceive the world around us and improve our decision-making over time.
Glimpse of the Future: AlphaGo and Move 37
The combination of Learning and Search is not the stuff of science fiction. In fact, it has, in a limited sense already been achieved.
These capabilities were on display in 2016 when DeepMind’s AlphaGo, defeated one of the world’s best Go players, Lee Sedol, and in doing so exhibited signs of true intelligence and creativity. The famous move 37 from the second game of their match – of which AlphaGo won 4 out of 5 games – was said to be a move that “no human would’ve ever made”.
AlphaGo’s triumph was not simply the result of a machine being able to calculate many more moves than a human can. The number of potential moves in Go are 10 to the power of 170 (which is 10 with 170 ‘0's behind it!) which dwarfs the number of atoms in the universe. Any attempts to perform “brute force” calculations of possible actions would be doomed to failure.
Instead, AlphaGo’s move 37 adopted a very human approach to problem solving through the use of three components:
#1 Policy Neural Network (PNN): a type of Reinforcement Learning model (see below) that has been trained on large quantities of historical games to identify and narrow down on a list of high quality moves at each point in time.
#2 Monte Carlo Tree Search (MCTS): a type of search algorithm that takes the shortlist of high quality moves selected by the PNN, and “thinks through” each of them by simulating possible sequence of follow-on moves from the current position.
#3 Value Neural Network (VNN): another type of Reinforcement Learning model that evaluates the board and predicts the winner from a given position, and assesses the MCTS simulation results to decide on the most promising course of action.
Q*: The Clue is in the Name
So where and how does Q* fit into all of this?
A leading theory whose proponents include luminaries such as Nvidia’s AI Senior Research Scientist, Jim Fan, and AI2’s Nathan Lambert, posits that Q*, like AlphaGo, could potentially be based around the concept of combining Learning and Search.
According to adherents of this theory, the name "Q*” could be a reference to Q-Learning, which is a Machine Learning (ML) approach and the A* search algorithm, which is a computer science technique for optimising decision-making.
Q-Learning
This is a method of Reinforcement Learning, a technique for teaching computers to learn by “rewarding” intended behaviours and sometimes “punishing” unintended ones, just as you would train a young pup. The “rewards” and “punishments” are metaphorical of course – no bots are harmed as part of this process!
Q-Learning can be explained using the following analogy. Let’s say you are a pastry chef looking to perfect your recipe for pistachio financier (I have a massive craving for this at the moment!). Q-Learning is a way for you to develop that optimal recipe through (efficient) trial and error.
You start by assigning a value to each possible ingredient and baking time. For example, you might give a high value to using high-quality pistachio and a low value to using too much sugar. Initially, these values might be random or they might be informed by past experience.
Next you start baking your pistachio financiers and evaluating them based on their taste and texture. After each try, you update your cheat sheet (known as the Q-Table in computer science speak) with the values assigned to the ingredients and baking times based on your evaluations. For example, if you bake a batch of pistachio financiers that are too sweet, you decrease the value of sugar. If you bake a batch that is too dry, you increase the value of baking time.
Over time, the values of the ingredients and baking times will converge on the optimal values, which are the values that will lead you to the most delicious pistachio financiers. Once you have learned the optimal values (i.e., your final Q-Table), you can use them to bake the perfect pistachio financier every time (or give the French a run for their money!).
A* Search Algorithm
The A* search algorithm is a technique used in computer science (and frequently in computer games) for finding the shortest path between two points.
Think of it as being in a complex maze. One way to find your way out of the maze is to use a “brute force” approach where you go down each and every single path until you find the exit. A more efficient approach is to employ educated guesses about which paths to try and how far to go down each path before you backtrack.
These educated guesses are called heuristics and they are essentially decision-making shortcuts. For instance, when travelling to a place you’ve never been to, most of us will employ the heuristic that motorways (or “highways” for those of you in the US!) will likely be the fastest way to get there. Similarly, the A* search algorithm builds in various learned heuristics which help to optimise decision-making.
It’s worth noting that while Q*’s use of the A* search algorithm is very much a theory, what is less in doubt is the hypothesis that OpenAI’s secret project involves melding Learning with a Search algorithm of some sort. Other candidate search algorithms for Q* include Model Predictive Control (MPC)[1] and Monte Carlo Tree Search (MCTS)[2].
Implications for the Near Future
What does Q*, and by extension the combination of Learning and Search mean for LLMs and other AI models of the future?
#1 Dynamic Learning
The use of Q-Learning, and Reinforcement Learning techniques more broadly means that LLMs could gain the ability to continuously and automatically learn and adapt based on new data and interactions, eschewing the need for humans to actively train them.
The models will also become a lot more efficient in how they learn about the world around them. No longer will we need to feed billions of datapoints for a model to gain a rough understanding of our context. Just as a child is able to form generalisations about the world based on only one or two examples, the combination of Learning and Search foreshadows AIs that will be able to independently formulate and test hypotheses about their environments with only limited datapoints to work with.
#2 Far Sightedness
The use of search algorithms mean that AIs of the future may be able to peer a lot more systematically into the future than we humans can, and will also be able to analyse probable outcomes with a high degree of accuracy.
When combined with Q-Learning, which enables an AI to consider the potential long-term consequences of their actions, humans may be able to employ the models of the near future as decision-making partners, allowing us to consult the AI on the quality and outcomes of potential decisions.
In the farther future, AIs may “evolve” to take on decades-, if not centuries-long world view. This is in stark contrast to us humans who can be notoriously short-term in our decisions (just think climate change, addictive behaviours, putting off exercise etc.)!
#3 Goal Orientation
In contrast to the LLMs of today which are fairly general purpose in nature, Q-Learning models are all about solving specific problems and achieving specific goals.
A key implication is that Q* might be able to take on more of an “agent” role in interactions, meaning that users can articulate objectives and the AI will be able to identify and assess possible paths before deciding on the best course of action and setting out to independently solve the problem.
Think everything from having a helpful bot that automatically selects groceries to stock your fridge when it runs low, to a supercomputer that is tasked with solving the climate change crisis.
#4 Genuine Discovery
Today, AI models are already being employed extensively by scientists, for instance for novel drug discovery. Scientists’ current use of AI however is more akin to that of a research assistant – carrying out the ‘grunt work’ of data collection, number crunching, writing research papers etc.
In the same vein that AlphaGo was able to identify a move that no human would ever previously have made, by combining Learning and Search, it is very possible that the models of the next few years might have the capability to make genuine scientific and mathematical discoveries, potentially even postulating theories of their own.
Killer Robots in the Making?
It’s hard to reconcile whispers of the potentially humanity-threatening nature of Q* with the revelation that it has been able to ace “grade school math”.
Don’t get me wrong, if correct, Q* and other similar models will be a very significant achievement and highlights the first steps of AI toward being able to genuinely reason. In my humble opinion however, its too early and too much of a leap of logic to associate Q* with killer robots.
Mark Reidl, a computer science professor at Georgia Tech for instance, has stated that there’s “no evidence that suggests that large language models or any other technology under development at OpenAI are on a path to AGI (referring to the concept of a “superhuman” Artificial General Intelligence) or any of the doom scenarios.”
From where I stand, the risk with technologies such as Q*, at least in the near term, is their ability to be used for malicious ends. More concerning therefore are rumours that Q* has managed to crack AES-192 encryption which is used to secure US government “Top Secret” documents.
Both malicious state (e.g., North Korea) and non-state (e.g., terrorist, anarchist groups) actors are almost certainly exploring such use cases, and my fear is that control measures, whether they be in the form of developer-imposed safeguards or broader regulation appear to be severely lagging the advancements in AI capabilities.
What do you think?
Notes:
[1] Model Predictive Control (MPC) is an optimisation algorithm that predicts the future behavior of a system and then optimises the system's control inputs to achieve a desired outcome. It is often used in real-time control systems, such as robotics and process control.
[2] Monte Carlo Tree Search (MCTS) is a probabilistic search algorithm that combines random sampling and tree-based search (i.e., evaluating the branches of a “tree” of potential paths) to find promising solutions in complex problems. It is particularly well-suited for problems that involve uncertainty or incomplete information, such as game playing and optimisation, and was part of AlphaGo’s programming.