When Small is Beautiful: How Small Language Models (SLM) Could Help Democratise AI

This article has been republished from New World Navigator, our Strategy and AI newsletter, a publication dedicated to helping organisations and individuals navigate disruptive change. Sign up here to receive and participate in thought-provoking discussions, expert insights, and valuable resources.

The near-future could see organisations begin to develop and employ hosts of Small Language Models (SLMs) in specialist roles which can collectively carry out complex tasks more efficiently and effectively than their larger counterparts, LLMs.

Generative AI has unquestionably been this year’s breakout technology, with 2023 marking their breakthrough into the mainstream. This period has witnessed an unprecedented focus on Large Language Models (LLMs), characterised by their expansive and general purpose capacities, and increasingly sophisticated functionalities.

Even prior to the watershed public release of OpenAI’s ChatGPT-3.5 in November 2022, the popular view in relation to language models has always been that "bigger is invariably better" with each subsequent model release by the leading players experiencing a 5-10X increase in the model’s size as determined by number of parameters (more to come later on what parameters are).

Each subsequent model release by the leading players has seen a 5-10X increase in the model’s size as determined by number of parameters.

With a lot less fanfare, but of no less import, this year has also started to see the public emergence of small but high performance models. Despite lower parameter counts, many of these Small Language Models (SLMs) are punching far above their weight. Microsoft’s Phi-2 for instance surpasses the performance of the larger Mistral 7B and Llama-2 13B models on a wide range of tasks, and achieves better performance compared to the 25X larger Llama-2 70B on muti-step reasoning tasks such as coding and math (note: the suffix indicates the number of parameters in the model so that “Llama-2 70B” indicates the 70 billion parameter version of the Llama-2 model).

The rise of SLMs has huge implications for the democratisation of AI, potentially allowing small businesses and even individuals to finetune and develop their own multitude of task-specific models, while being small enough to provide localised intelligence for mobile devices, appliances and machinery that do not have the internet connectivity to leverage online access models such as OpenAI’s ChatGPT.

Let’s dive into SLMs and understand how these models, by being more affordable, efficient, and less resource-hungry, are changing the narrative, showing that in the world of AI, small can indeed be beautiful. While my take on this topic is largely positive, we’ll also discuss potential challenges and dangers as they relate to SLMs.

What are Small Language Models (SLM)?

SLMs, like their heftier counterparts, LLMs, are AI models that have been trained on large text datasets and are able to carry out a variety of text-related tasks such as writing, summarisation, translation, general Q&A, coding, and to some extent, math (check out this post if you’re interested to learn more about what makes language models tick and how to have learnt to understand context).

In 2023, researchers have experimented with and released hundreds of SLMs, mostly open source in nature, including Alpaca, Llama 7B, Vicuna 7B, Mistral 7B. More recently, some of the big guns in AI have also released “lighter” models including Microsoft with Phi-2 (2.7 billion parameters), and Google’s Gemini Nano-1 (1.8 billion parameters) and Nano 2 (3.25 billion parameters).

Size is obviously the main difference between an SLM and LLM. But how is size determined? The size of a language model, and more broadly an AI model, is measured by the number of parameters that it has. Many modern AI models are artificial neural networks (ANN) which are modelled on the structure of the human brain. Parameters in such models can therefore can likened to neurons in the brain. In general, the larger the number of neurons, the more intelligent an entity. This is one reason why human brains (with ~16 billion neurons) have more reasoning capacity than those of gorillas (~9 billion neurons) for instance.

But how small is small? There is no standardised definition of an SLM, but researchers generally consider a language model to be “small” if it has at least ~10 million parameters up to between a billion and 10 billion parameters. This is several multiple orders of magnitudes smaller than the largest models today (e.g., OpenAI’s GPT-4 with a rumoured 1.76 trillion parameters and Google’s PaLM 2 with 540 billion parameters).

Despite their role as David to the Goliath of LLMs, many SLMs have showcased performance far in excess of what their size would indicate. The reason is that model performance is determined not just by parameter count but several other factors such as the type and quality of training data, the model architecture, and the training method. If we were to compare training an AI model to nurturing a human child, then:

  • Parameter count would represent the child’s innate cognitive potential and natural aptitude for certain skills;

  • Type and quality of training data could be likened to the quality, diversity and depth of education that the child receives;

  • Model architecture can be equated to a child's personality, where different architectures have distinct ways of processing data, just as each child’s unique personality shapes how they interact with and learn about the world;

  • Training method (e.g., supervised learning, unsupervised learning, or reinforcement learning) would be analogous to the parenting style which heavily influences the child’s growth, attitudes, and behaviours.

Researchers have managed to achieve outsized performance by making improvements in each of these three areas beyond parameter count. Such techniques are not unique to SLMs but have been key drivers in the “minaturisation” of language models. Let’s investigate each of these in turn.

#1 Improving Training Data and Quality

The most significant advancements in SLMs have come from improving the type and quality of training data. A key development here has been the use of synthetic data, which is data that has been generated by other language models.

Microsoft for instance used “textbook-quality data” involving a combination of a curated synthetic dataset and “carefully selected web data” to teach its Phi-2 model “common sense reasoning and general knowledge, including science, daily activities and theory of mind, among others”.

This approach works well because the real world data that LLMs have traditionally been trained on tend to suffer from quality and relevance issues, and are not tailored for training. Synthetic data is also a lot cheaper to obtain and work with (e.g., no legal and privacy concerns) which helps bring down the cost, time and effort of training.

#2 Introducing Innovative Model Architectures

SLM researchers have also been busy introducing innovations to the underlying model architectures to improve their efficiency and efficacy.

While the Transformer model developed by Google researchers in 2017 remains the backbone of all language models, variants such as "Efficient Transformers” and "Sparse Transformers" have helped to reduce computational complexity, making them more suitable for SLMs.

Adaptive computing is another related innovation which allows models to dynamically adjust the amount of computations they perform based on the complexity of inputs. This allows the model to process simpler inputs quickly while allocating more resources to more complex inputs, optimising overall performance.

A final example is Mixture of Experts (MoE), a model architecture which involves binding together multiple smaller models. Each of the smaller models is an “expert” specialising in different types of tasks. For instance, one expert may specialise in reasoning tasks, another on coding, while a third might focus on writing. Mixtral 8x7B, developed by the French start-up Mistral, is a collection of eight 7 billion parameter models working together as one. When a query is raised, an algorithm known as a “gating network” determines the type of task it requires and allocates it to two out of the eight experts. This approach allows the model to efficiently handle a wide range of inputs, improving performance without a proportional increase in size.

The Mixture of Experts (MoE) model architecture involves binding together multiple smaller “expert” models and directing queries to a subset of those experts to improve the accuracy and efficiency of responses.

#3 Refining Training Methods

Model training methods have also become more advanced, often relying on leveraging an existing model to help train another.

Knowledge distillation for instance, is a technique that uses a larger, more capable model (e.g., GPT-4) as the teacher for a smaller “student” language model. The smaller model is trained not directly on the same input data, but is rather thought to replicate the output of the larger model. The idea is that the student model learns to mimic the teacher’s responses and decision-making patterns, effectively distilling the teacher’s knowledge. This approach was used to train Microsoft’s Orca 2 model, which has been evaluated to perform as well as or better than models that contain 10X its number of parameters.

Transfer learning is another methodology that calls upon a larger, more generalist model to help train a smaller, more specialised one. This approach involves leveraging the pre-learned knowledge of an existing model for a new but related task. A larger pre-trained model is fine-tuned on a different, often smaller and more specific dataset related to the new task. An analogous example would be teaching someone who already knows how to play the piano and has knowledge of music theory, rhythms, and reading sheet music, to play the violin. The trained model is smaller and more efficient than the original because it has a more specialised focus, and also because it does not replicate all the parameters of the larger one but rather learns to mimic its important behaviors and patterns (this is known as “parameter optimisation”).

The Mighty Implications of Small Language Models

While their smaller model sizes mean than they are more limited in capability than LLMs, being compact nonetheless bestows SLMs with a range of benefits. Let’s explore some potential implications and possibilities.

#1 AI Everywhere

Expect to soon see SLMs in everything from smartphone, appliances and machinery. Because SLMs are smaller and more economical than their larger counterparts, they can be deployed locally (i.e., installed) in a range of devices without requiring internet connectivity.

Numerous enthusiasts have already installed open source models such as Llama-2 7B and Mistral 7B on their smartphones, while Google recently announced that its Gemini Nano model is already supporting some features on the Pixel 8 Pro. On the hardware side, semiconductor manufacturers such as MediaTek and Qualcomm have also announced plans to directly build language models into their next generation of chips.

The minaturisation of computing from PCs and laptops to mobile devices triggered a range of new products and business models (e.g., smartphone apps, mobile games), and the same could be expected of SLMs. Intelligent devices such as smart fridges, talking dishwashers, as well as homes and cars equipped with Iron Man’s Jarvis-style AI butlers which were prophesised by the 2000’s Internet-of-Things (IoT) wave may actually now be around the corner!

#2 Democratised AI

Small Language Models (SLMs) are cost efficient and compact, and could democratise AI not only be making it available to broader swaths of societies but also by helping spread the technology to less developed countries.

Beyond appearing across a variety of devices, SLMs’ smaller size and more economical nature will support the democratisation of AI by making it accessible to a broader range of users, including small businesses, researchers, and developers with limited computational resources.

Beyond cost considerations, SLMs should also make it possible for many more organisations to run their own language models locally in internal servers (rather than relying on cloud-based services such as OpenAI’s ChatGPT or Microsoft’s Azure). This will be attractive for many organisations – firstly, locally-hosted solutions involve higher one-off costs but lower running costs as compared to cloud-based services, and secondly, because their data remains on-premise, this also allows businesses to be more confident about their data and privacy security. The latter is a particularly important consideration for sectors that work with highly sensitive data such as healthcare and financial services.

A key benefit of these trends will be to further expand and accelerate the innovation that is already taking place within the AI space. A less obvious implication, but of no less import is the likelihood that SLMs will hasten the shift toward open source models and potentially crack open the current “oligopolistic” market centred around OpenAI, Microsoft, Google and a few other players. This will benefit both consumers and businesses as more competition should result in lower prices and more choices (not to mention being less at the mercy of OpenAI’s frequent service outages!).

As AI technologies accelerate forward, it is critical to ensure that its benefits are not confined to technologically-advanced nations. Just as mobile phones enabled technological leapfrogging in many developing countries, SLMs could potentially do the same for Generative AI technologies in less developed countries. Small and affordable computers such as the Raspberry Pi, which is already being used for educational purposes and “lightweight” laptops in countries such as India, Iraq, Turkey, and Romania, can easily be equipped with SLMs that have under 10 billion parameters.

#3 Bifurcated Ecosystems

As SLMs proliferate, a key corollary is the likely bifurcation and “verticalisation” of the AI model ecosystem.

LLMs such as OpenAI’s GPT-3.5 and Google’s PaLM 2 tend to be generalist models that perform well across a wide and versatile range of tasks. However, they very much remain “jack-of-all trades” rather than specialists.

Oftentimes however, we encounter problems that require narrow yet deep expertise. This is where smaller models really shine because can be easily and cheaply developed with more specialist capabilities which are either task- (e.g., extracting data from transactions data, coding) or domain-specific (e.g., healthcare, finance).

Within individual organisations, we are likely to see a multitude of language models at work. Taking healthcare as an example, hospital staff may reply on an LLM for general purpose queries, but utilise specific SLMs each for diagnostic decision support, extracting and summarising patient notes, or researching medical literature. Just as the Mixture of Experts (MoE) model architecture supports that directing of queries amongst its expert models, we may begin to see the emergence of organisation-wide “gating networks” that seamlessly distribute traffic to individual internal and / or external models depending on the task in question.

From an ecosystem perspective, a potential scenario that may emerge in the near- to medium-term is the continued dominance of a few general purpose models and providers with the scale and resources to continuously push the boundaries of LLM research while catering to the emerging regulatory requirements for such models, alongside an explosion of proprietary and open source specialist models, both small and large, both off-the-shelf and custom developed for individual organisations. Consequently, low- and no-code platforms for AI model training and finetuning which require only basic experience in data science and machine learning could emerge to serve this demand.

Finally, the national security implications of such a powerful technology also means that we are likely to see state actors developing and deploying their own LLMs and SLMs, either publicly or in secret.

#4 Lean and Maybe Green

LLMs are incredibly energy-intensive. Researchers have estimated that the training of OpenAI’s GPT-3 resulted in the production of 552 tonnes of CO2, which is roughly the emissions arising from “two to three full Boeing 767s flying round-trip from New York City to San Francisco”.

While the environmental impact of training such models is non-trivial, they pale in comparison to the emissions resulting from their ongoing usage. Estimates place the electricity requirements of running ChatGPT over a monthly period “between 1 to 23 million kWh considering a range of scenarios, with the top end corresponding to the emissions of 175,000 [Danish] residents” (Denmark is the home country and reference point for the researcher in question).

From my perspective, the jury is therefore still out on whether SLMs will improve AI’s environmental footprint. Due to their more efficient natures and all else being equal, SLMs should theoretically help reduce electricity and water consumption. That being said, SLMs will likely also to lead to more widespread usage of AI models, and may on aggregate result in an increased carbon footprint.

#5 Regulatory Purgatory

Nuclear weapons can cause widespread devastation and loss of life and are strictly monitored under international agreements such as The Nuclear Non-Proliferation Treaty. Yet terrorist groups and malicious state actors have shown that the use of relatively primitive technologies can still be put to extremely potent use.

The same can be said of LLMs and SLMs. The focus of both Europe’s recently passed EU AI Act and US President Biden’s Executive Order (EO) on Safe, Secure, and Trustworthy Artificial Intelligence is very much on the largest of AI models.

The EU AI Act for instance “sets rules for large, powerful AI models, ensuring they do not present systemic risks to the Union” while President Biden’s EO requires AI developers to “share safety data, training information and reports with the U.S. government prior to publicly releasing future large AI models or updated versions of such models” with large models defined as having “tens of billions of parameters”.

SLMs remain very much under the regulatory radar yet are potentially just as harmful if placed in the wrong hands. Furthermore, the open source and easily accessible nature of SLMs also mean that the cat is very much out of the bag, with any regulatory attempts to contain them likely to anyway be doomed to failure.

When it comes to SLMs, national security and public safety will therefore likely require an active coalition among private, public and research entities and a system of continuous monitoring and information sharing rather than regulation per se.

A Future of Promise and Responsibility

Having explored the rise of SLMs, it's clear that we stand at a crossroads of technological advancement and ethical responsibility.

The democratisation of AI through SLMs and the potential for such models to be embedded everywhere promises a future where the technology is more accessible, inclusive, and tailored to individual needs.

Yet great (and distributed!) power also comes with a heavy responsibility. Policy makers, thought leaders, and industry titans, currently focused on the largest of AI models, will need to wake up to the possibilities opened up by smaller models. As SLMs continue to evolve, only a collaborative effort and continuous vigilance will keep society safe from malicious intentions. Overlooking this critical aspect could lead to a world where the power of AI is misused, not in grand, sweeping gestures, but in countless small, unnoticed ways – a thousand cuts leading to a loss of privacy, security, and trust.

Previous
Previous

Benchmark Breakdown: Peeking Into How Large Language Models Are Evaluated (Part I)

Next
Next

AlphaGo: The AI That Made a Move No Human Ever Would (and what it means for AI’s future)