This article has been republished from New World Navigator, our Strategy and AI newsletter, a publication dedicated to helping organisations and individuals navigate disruptive change. Sign up here to receive and participate in thought-provoking discussions, expert insights, and valuable resources.

*Cuppa tea anyone? The self-deprecating and dry sense of British humour is an example of the subtle contextual nuance that is present in all languages, and which AI is increasingly able to grasp.*

I love the British sense of irony. But when I first moved to the UK nearly twenty years ago, I was sometimes confused by the things my British friends would say in response to situations, or would take them at face value without recognising the underlying meaning.

When served with an obviously burnt piece of toast at a café, one of them remarked, in an uplifting voice, “Just how I like my toast, with a bit of extra … crunch.” Or, following the collapse of a DIY shelf a few moments after its completion, another would contemplate the scene in silence, then respond in a completely deadpan voice, “It’s meant to be a pop-down shelf. Très avant-garde.”

Just a few years ago, no machine would have been able to explain the irony present in those words and situation, much less write a story that perfectly illustrates dry British wit and self-deprecating humour.

The situation is very different today. Large Language Models (LLMs) such as ChatGPT and Claude 2 have proven themselves adept at not only interpreting the contextual nuances that are inherent in any language, but also generating contextually relevant responses. Rather than speaking to a “dumb” Internet banking chatbot that responds only to certain keywords, the latest LLM-powered chatbots are increasingly capable of having coherent conversation with customers.

I've been fascinated by this topic and consider the ability of Artificial Intelligence (AI) to interpret context in language and communication (which I shall shorten to “context” from hereon) as nothing less than profound. It should come as no surprise that I have spent more than a fair share of time over the last few months to understand the challenges inherent in getting AIs to understand context and how scientists have managed to overcome them.

This article summarises my research in non-technical, layman’s terms and considers the following questions:

Why is context in language and communication (and by extension in society and business) important?
Why is context difficult to understand (for both humans and machines)?
How do Large Language Models (LLMs) interpret context and respond in a contextually-aware manner?

Context is King: Why Context is Crucial

The importance of context cannot be overstated. Effective communication hinges not just on the words used, but also on the myriad of underlying factors that give those words meaning. A single phrase can convey warmth, indifference, or even hostility, depending on its context. Without this contextual framework, language becomes a mere assembly of words, losing its essence and its power to connect, inspire, and influence.

Imagine a bustling city as the realm of business and society. Effective communication can be likened to roads and pathways, which are essential for the city’s smooth operation. Context is represented by the traffic signals and signs, ensuring everyone moves in harmony, preventing collisions of misunderstanding and ensuring efficient commerce and social interactions.

The consequences of contextual misunderstandings can range from a mere hiccup to a full-blown storm. Here are a few examples:

Customer Service: When reviewing the customer service logs for a client awhile back, I listened to a number of phone conversations relating to customer complaints. In one of these conversations, a customer had replied “Yeah, it was fine” when asked about whether she was satisfied with the service provided. Her tone however, indicated otherwise and this was something the agent had failed to pick up on, leading to a missed opportunity to resolve the customer’s issue there and then.
Email Marketing: Examples of bodged marketing campaigns abound but Adidas’ faux pas following the Boston Marathon in 2017 really takes the cake for missing out on contextual awareness. Runners who completed the race received an email with the subject line, “Congrats, you survived the Boston Marathon!” A nice and supportive gesture under ordinary circumstances, but highly insensitive in this case because it came only a few years after the bombing of the 2013 Boston Marathon in which three people where killed and over 250 injured.
Product Packaging: Gerber, a baby food brand owned by food giant Nestlé, is well-known for the iconic baby sketch on the packaging of its products. Following its expansion into the African market, it found that its product were not selling well. The reason? Most products sold in Africa feature pictures of what the products are made of on their packaging because most consumers are unable to read. African consumers took one look at Gerber’s baby food product and were horrified because they believed the jars to contain ground-up babies!

Its clear that context is crucially important in all forms of communication. So where does AI fit in?

*Gerber’s iconic baby image, present on most of its product packaging, is widely seen as lovable and endearing, but was taken out of context in African markets.*

Not A Piece of Cake: AI's Contextual Journey

“These aren’t the sakuras we’re looking for” said one Lego Stormtrooper to the other (IYKYK). Language is replete with social, cultural, and group references that can make it challenging for humans and AI alike to decipher.

AI systems have gradually been becoming more contextually-aware over the last decades. By being contextually-aware, I simply mean that AI has some sort of “understanding” of the world in which we operate in, so that it is able to provide responses and feedback in a way that is relevant and useful for us.

This could be as simple as your Internet banking chatbot using your location to point you in the direction of your nearest bank branch, or being able to tell if an image of a painting is upside-down. A more complex example is that of self-driving cars, where the vehicle can “perceive” the roads around it through the lens of thousands of sensors, and is therefore able navigate around obstacles and hazards.

What has proved more elusive, prior to the last few years, is the ability of AI to unpick the nuances of language. And I’m not talking here about simple AI-based translation, which has been around for a long time, but the genuine ability to interpret what is being said and respond accordingly.

Let’s briefly consider why context is not a piece of cake for a human, let alone a machine, to understand. Take the following conversation between these two people for instance:

Person A: “The sakura forecast looks good for next week.”
Person B: “Great, let’s plan the hanami for our class then.”

Many of you would have identified the words sakura and hanami and immediately drawn a link to Japanese culture. (Hanami refers to the traditional custom of enjoying the beauty of enjoying cheery blossoms or sakura). There is clearly a situational context, with Person A and Person B most likely to be students who are planning a flower viewing session for their classmates. Finally, there is also the implicit understanding that if sakuras are in bloom, then it is most likely to currently be springtime (although in our current climate-crazed world it can be really hard to tell).

Contextual understanding is not something that humans are born with. In fact, most of us are constantly learning and decoding the subtleties of context over our entire lifetimes. It took me many years after I had moved to the UK, for instance, to understand the subtle nuances of British culture and language.

Given the contextual complexities in any language, how then did OpenAI, Google and other LLM developers manage to evolve their models to their current level of sophistication?

Word Vectors: AI’s Contextual Building Blocks

Let’s start with understanding how LLMs “perceive” language.

Humans represent words in various forms, for instance through an alphabetical (e.g., English, French) or logographic system (e.g., Mandarin, Japanese Kanji). LLMs on the other hand, record each word as a vector, which is a long combination of numbers. For instance, within a model that has been trained on the English version of Wikipedia, the word “home” is represented with a vector containing hundreds of numbers as follows:

[-0.10669317841529846, 0.09448157250881195, -0.12580148875713348, 0.0014460741076618433, 0.0182229895144701, 0.03057095780968666,

…

-0.09417729079723358, 0.07340414822101593, 0.03361644595861435]

Why are vectors used to represent words in AI world, you might ask. Vectors are a numerical system that indicate coordinates within any space. Within the geographic coordinate system, the location of London, UK, for instance can be represented as [51.509865, -0.118092] reflecting the two dimensions of latitude and longitude.

Simply by comparing the vector coordinates, one can easily understand the geographic relationship between London, UK and Paris, France. Similarly, a machine can understand how close certain words are in meaning and context to each other by calculating how close their word vectors are. For example, the word vectors representing “home” and “house” will be quite similar to each other, but quite different to the word vector representing “dagger”.

You would have inevitably noticed that the word vector representing “home” contains many more numbers than just the two used in the geographic coordinate system. The reason is that each word is described by its coordinates in a multi-dimensional imaginary space. This “space” is multi-dimensional because of the complexity of contextual relationships. Very crudely speaking, one dimension might represent the fact that a “home” and “house” are both nouns, a second might denote them as a place one lives and sleeps in, while a third and fourth dimension might differentiate between the “home” as an abode for a family as compared to the brick and mortar connotations associated with the word “house”. OpenAI’s ChatGPT uses word vectors comprising 12,288 dimensions, meaning that each word is represented by a whooping 12,288 numbers!

When an LLM comes across a new word, it will aim to place it within this multi-dimensional space. Similarly, when it identifies a new concept, it will apply this learning across similar word vectors. For instance, suppose an LLM learns that “flats” are typically one-, two-, or three-bedrooms in size and are typically inhabited by small families. It will then apply that same learning to “apartments” because both words are likely to be represented by word vectors that are close to each other.

However, despite their richness and complexity, word vectors are really no more than a “database” of words, a vocabulary for machines if you will. How then do LLMs read and write? The answer lies in Transformer technology, and I’m not talking here about Optimus Prime and his Autobots (IYKYK).

Transformers: Robots in Disguise

In 2017, eight scientists working at Google released a ground-breaking paper called “Attention is All You Need” introducing a new artificial neural network structure (aka a type of AI Deep Learning model structure) called the Transformer. (The significance of the paper’s name will be soon be evident, but for now hold your horses!)

A typical LLM, such as ChatGPT is made up of a layers upon layers of Transformers. Each layer of the Transformer “reads” word vectors, processes them (i.e., considers their meaning), and outputs its predictions (i.e., the response). This prediction is then fed into another Transformer layer which again processes the inputs and generates a prediction. Each time a sentence or block of text gets passed through a Transformer layer, more and more context is added and interpreted, thereby gradually improving the quality of the final output.

Taking the example of a sentence, “Emily has been staring in awe at the weathered trunk of this ancient tree for the …” An LLM will start by converting all of the words in the sentence into word vectors. It will then combine those word vectors that together holds all of the information its learned so far about the input sentence.

In the first layer, it might identify that Emily, the trunk, and the tree are nouns. The next layer might go on to recognise “looking” as a verb and that Emily is a person and that person is taking that action. The third layer could associate “tree” with “trunk” so that the latter is interpreted to be a tree trunk rather than some other possible meanings of “trunk” such as the hold area of a car (if you an American!) or the nose of an elephant. This goes on and on – for up to 96 layers in the case of ChatGPT-3.5 – until the final predicted response may be something like, “Emily has been staring in awe at the weathered trunk of this ancient tree for the last five minutes.”

Initial layers of an LLM model typically focus on basic contextual understanding as we have shown above with Emily and her tree, while deeper layers focus on interpreting the context across entire blocks of texts. Let’s say the sentence about Emily is from a short story. As the model “reads” through the short story, it does the machine equivalent of writing notes to itself about the context associated with the word “Emily”.

By the time a block of text has been through 50 layers, the LLM would have associated “Emily” as not being just a female, but also other attributes it has learnt from the story (e.g., female, blonde, single mother to a boy, lives in Manchester in England, dreams of visiting the big national parks in the US etc.).

Attention Please, Attention Please

Artificial Neural Networks (ANNs) are a type of AI that has been used for language-related tasks such as translation for many years now. However, until the advent of the Transformer, scientists struggled to get around a particular problem, that of ANNs having “bad memory”.

We won’t delve into the mechanics of why this happens but suffice to say, the “forgetfulness” suffered by ANNs is akin to the problem that we humans face when reading large blocks of text, in the sense that the most recently read portions tend to be more clearly remembered than portions of text that were reviewed much earlier on. When it comes to context, ANNs deem more recently “read” texts to be the most relevant, which is frequently not the case.

What makes Transformers different from other such models is that they have a much “better memory” which is accomplished using two mechanisms:

Attention Mechanism: this part of the Transformer decides what is important for the model to focus on, and by extension, what is safe to ignore.
Feed-Forward Network: this part of the Transformer analyses and refines the input from the attention mechanism and creates a prediction or response.

Let’s consider the Attention Mechanism and the Feed-Forward Network using the metaphor of a detective solving a mystery.

Attention Mechanism

The detective first needs to act as a field agent to gather all possible clues and decide which ones are the most important for solving the case. He sifts through a lot of information and narrows it down to a few key pieces of evidence that are most likely to lead to the perpetrator.

The Attention Mechanism does a similar thing. It scans all parts of the input data and decides which parts are more important based on the context. Once it does so, the Transformer then “clues in” (pun very much intended!) to the more important parts while tuning out the less crucial ones.

The Attention Mechanism does this by analysing each word on a word-by-word basis, then identifying other words that are relevant to the current word being analysed. It does so by creating a list of questions that it has about each word (known as the “query vector”) and another list of characteristics that describe its context (known as the “key”) vector. When the query vector of one word matches up with the key vector of another word, they are paired up and information is swapped. You can liken “information swapping” to the AI “making notes” about the context is has learnt for each word, which it can refer to in the future when the same word is used again.

In our sentence from before, “Emily has been staring in awe at the weathered trunk of this ancient tree”, the model might identify that there is ambiguity related to the word “trunk”. The query vector for “trunk” might indicate that it is looking for a noun associated with one of the following, a) the woody stem of a tree, b) the hold area of a car, or c) the nose of an elephant. This query vector seeks out and is paired with the key vector of the word “tree”, and the information about “tree” is shared with that of “trunk” to confirm that the latter is indeed referring to the trunk of a tree.

The Attention Mechanism therefore addresses the “forgetfulness” problem by helping the model to pay attention to the right words at the right time (as opposed to focusing only on the most recently “read” words).

Feed-Forward Network

Now, let’s say that the key clues are identified and handed over to a forensic analyst. The analyst doesn't decide what's important—that's already been done. But he or she performs specialised tests on the clues to extract more information, fine-tune the details, or perhaps discover something entirely new that wasn't readily apparent at first glance.

The Feed-Forward Network (FFN) essentially works as a the forensic analyst for the Transformer to refine the data that it is processing, and plays a critical role in deepening the understanding of those decisions, refining them, and even potentially revealing new insights.

The FFN is a type of Artificial Neural Network (ANN) which mimics the structure of the human brain and is depicted in the diagram below. As with the Attention Mechanism, the FFN is structured as a series of layers. Each circle in the diagram represents a neuron in the “brain” of the FFN. Each word vector enters this “brain” at the input layer of the FFN and is then fed forward (hence the name) into the hidden layers, before emerging from the output layer.

The Feed Forward Network is an artificial neural network that has layers upon layers of neurons, similar to a human brain. The machine is able to process data and “think” as information is passed forward from one layer to another.

The magic happens in the hidden layers of the FFN. In the human brain, thoughts occur when connections are made between different neurons. The same thing happens in the hidden layers of the FFN where word vectors flow through layer by layer and are processed. As we’ve explained in our example with Emily and her tree, the “thoughts” of an LLM gain complexity through the layers. The first layer might do a basic job such a identifying whether a word is a verb or noun. Deeper layers will home in on more advanced concepts or contextual implications, for instance, understanding that Emily’s home location is in Manchester, England, and that since she is visiting a national park in the US, it will draw the connection that she will likely need to fly to get between the two places.

As with the human brain, the denser the neural connections and the larger their number, the more powerful the processing capability. To provide context, there are 12,288 input values for each input (reflecting the12,288 dimensions for each word vector), 49,152 neurons in the hidden layer, and 12,288 neurons in the output layer for ChatGPT-3.5 (the free version of OpenAI’s model). There are 96 such layers within ChatGPT’s FFN. Doing the math, this means that there are an astounding 49,152 x 12,288 x 2 x 96 or 116 BILLION connections (known as “weight parameters”) within ChatGPT’s FFN.

It’s worth nothing however that ChatGPT’s total of 175 BILLION parameters remains a far cry from the 100 TRILLION connections in the typical human brain.

What’s Next?

Since OpenAI’s opening of the proverbial Pandora’s Box in late 2022 with the public launch of ChatGPT, LLMs have developed at a ferocious pace with remarkable advances being made within a matter of months in relation to LLMs’ ability to understand and interpret the context relating to language and data.

When it comes to LLMs and context, here are a few key trends to look out for, which we will look to explore in more detail in subsequent issues of the newsletter:

Expanded context windows: Early versions of ChatGPT and Google Bard had relatively narrow context windows, in the sense that the models could only process small blocks of input text. This has since changed with LLMs such as Claude 2 now able to process blocks of text up to the length of a short novel.
Multimodality: LLMs these days are not only text-to-text capable but also have text-to-image, and in some cases, image-to-text capabilities. As new modalities (e.g., voice, video) become available, LLMs’ contextual-awareness of our world and its ability to interact with humans will advance even further.
Finetuning: LLMs are limited by the data that they were initially trained on. Many organisations are now investing in domain-specific LLMs that are able to provide more contextually-relevant and specialised outputs for certain industries or fields. One example is Bloomberg’s BloombergGPT, which is a finance-specific LLM.
Contextual augmentation: Finetuning is expensive and time-consuming. Organisations are exploring cheaper and computationally cheaper ways of helping LLMs to better understand their internal context and knowledge base. One example is the use of vector databases.