Generative AI, a subfield of artificial intelligence, has etched its mark in the landscape of our everyday experiences. From mobile applications to in-home digital assistants, Generative AI forms the underpinning technology that is steadily transforming the way we live, work, and communicate. At the forefront of these AI advancements are Large Language Models (LLMs), such as OpenAI's chatGPT, which has become the fastest-growing consumer product in history since its debut at the end of 2022. But how exactly does this magic work? In this article, let's pull back the curtain and delve into the world of generative AI, exploring what GPT and LLMs are, how they are trained, and the extraordinary journey they take from raw data to polished AI assistants.
Understanding the AI Jargon: GPT and LLMs
At the foundation of chatGPT lies an autoregressive LLM. LLMs are a type of AI model that uses large amounts of text data to generate human-like text. They can predict the likelihood of a word given the previous words used in the text, hence the term 'language model'. When we say 'large', we're referring to the number of parameters these models have, often in the order of billions, that allow them to generate highly coherent and contextually relevant outputs.
ChatGPT, an instance of an LLM, leverages autoregression - a predictive technique used for sequential data. It determines the output at any given step based on prior outputs. In the context of language modeling, it is tantamount to predicting the next word in a sentence based on preceding words, a technique that is applicable not only to text, but also to images, audio, and video.
The acronym GPT in chatGPT stands for Generative Pretrained Transformers, indicating the model's reliance on the Transformer architecture for autoregression. The breakthrough paper, "Attention is All You Need," in 2017 underpins this architecture. The model's success is hinged on its attention mechanism, which was originally introduced in another groundbreaking paper called Neural Machine Translation by Jointly Learning to Align and Translate" back in 2014. The attention mechanism discerns the intrinsic relationships between different elements of an input sentence. This makes it particularly effective for a plethora of language-related tasks, such as translation, summarization, and text generation.
Today, the GPT series from OpenAI, which powers ChatGPT, is among the leading commercial LLMs. However, the arena of open-source transformer LLMs is seeing rapid growth. Notable examples include Meta's LLaMA, which offers a free research license, and the Falcon-40b model developed by Technology Innovation Institute of UAE that has a generous free commercial use license. The Falcon model recently earned a top spot on the open-source LLM performance leaderboard maintained by HuggingFace, even though there are still debates on how the models should be evaluated.
From Poetry to Parameters: Preparing for the Making of a GPT-like LLM
To understand how an LLM like GPT may be created, consider the task at hand: predicting the next word in a sentence given the existing parts. The intuitive approach involves immersing the model in vast quantities of existing text, hoping it will develop a statistical understanding of word sequences. It's reminiscent of an old Chinese saying, "熟读唐诗三百首，不会吟诗也会吟," suggesting that by immersing oneself in 300 Tang poems, one can naturally grasp Chinese poetry and language, even without aspiring to be a poet. Quite fitting for AI language models, isn't it?
Quantifying Training Data for the Model
While the idea of immersing a model in poetry is charming, our objective is a general-purpose LLM, capable of generating virtually any sentence. This necessitates far more than 300 poems. But how much data does the model need to make it useful? Quite a lot. GPT-3, the base model for the original ChatGPT, was trained on approximately 300 billion tokens. Meanwhile, Meta's Llama-65B and the Falcon model were trained on between 1 to 1.4 trillion tokens. (We will introduce the concept of tokens in a section below; for now, just think of one token as roughly equivalent to four characters for common English text.)
Sourcing the Data
These are staggeringly large amounts of data. How can we collect them? One common method involves massive web crawls, like those undertaken by CommonCrawl, a non-profit organization that gathers web crawl data freely available for research and analysis. Other sources include C4, which produces a cleaner dataset based on CommonCrawl data. However, data directly crawled from the web can have quality issues due to numerous unverified sources. As a result, the most successful models blend web crawl data with some curated, higher quality data sources such as Wikipedia, GitHub, ArXiv, StackExchange, and Reddit.
Balancing Act: Data, Model Size, and Computing Resources
Is it enough to train the most powerful model on as much data as possible? The answer is not as straightforward, because there are other constraints that come into play.
Think of it this way: as a model ingests an immense volume of training texts, it forms connections and stores relationships between them. This stored knowledge gets tapped into when we utilize the model, for instance, predicting the next word given a set of existing ones. It's quite like memory. The model stashes its memory as parameters, and the more parameters it can store, the larger its memory, and thereby, its size. As such, having more data to train on can result in a model being more powerful.
However, it's not just about having ample memory space. The larger the size of the model and the more data used for training, the greater the need for computational resources, such as GPUs. Moreover, the time required for training increases substantially, often extending to months rather than days.
Furthermore, the amount of available training data itself is finite. Consequently, we encounter a complex interplay between the scaling of model size, training dataset, and the available training budget. This intricacy is explored in depth in this paper, which discusses the trade-off between model size and training data, given a certain training budget allocation.
Understanding Words vs. Tokens
One concept that deserves mention is tokens vs. words. While humans segment text into words, computer models deal with tokens. But what exactly are tokens, and why do we need them?
At their core, computers process data in binary format: 0s and 1s. To handle text, we convert it into numeric form, represented in binary format, known as tokens. For example, we can use numbers 1 to 26 to signify the alphabet 'a' through 'z'. This is a simple instance of character-based tokenization.
There are various methods of tokenization, ranging from character-based, word-based, or sub-word based. These methods introduce trade-offs concerning vocabulary size, leading to tokenized outputs of different lengths. Since there's a limit on the number of tokens (referred to as context window length) that a model can process at a time, a balance must be struck. That's where sub-word based tokenization comes in. It offers a balance between the number of tokens and the information each token carries.
Compared to pure character-based tokenization, sub-word-based approach results in shorter output token sequences, facilitating the model's semantic understanding of the context. It's also more adept at handling new words (known as Out-of-vocabulary or OOV words) than word-based tokenization, as it can learn morphological variations more effectively. As a rule of thumb, one token typically corresponds to approximately four characters of text for common English text, translating to roughly ¾ of a word.
Cracking the Code: Training a GPT-like LLM
A summary of the various stage of training a GPT model is captured in the following slide by OpenAI
The Journey from a Base Model to a Fine-tuned Model
Training a base transformer large language model on Internet-scale datasets is a labor-intensive process. According to OpenAI, 99% of the computational work in training the GPT model occurs at this stage. It requires thousands of GPUs and can take several months to complete. The result is a foundation model that's skilled at predicting the next tokens given an existing token context. However, for ChatGPT, a model that behaves like a personal assistant, this base model needs additional refinement. The process of honing a base model into a fine-tuned model involves a mere 1% of the computational work, but it's critical.
The Art of Supervised Fine-Tuning
How do we improve a base model to make it excel at Q&A and assistant tasks? One popular way to steer its behavior is through supervised fine-tuning. Instead of utilizing large-scale, lower-quality data, we use a compact but high-quality dataset specially tailored to contain question and answer pairs. In the case of GPT, OpenAI recruited human contractors to generate tens of thousands of these specific samples. Here, the questions become the "prompts" that elicit responses from the model. Although the training objective remains the same as in the prior phase, the model is essentially trained to produce text akin to the answers in the samples when presented with corresponding questions.
The Pinnacle of Refinement: Reinforcement Learning with Human Feedback
Successful supervised fine-tuning provides us with a model skilled at Q&A tasks. However, OpenAI takes it a step further with its ChatGPT model, incorporating an additional stage known as Reinforcement Learning with Human Feedback (RLHF). This stage is divided into two key steps.
- The first step involves training a supplementary AI model - a reward model. This model grades the responses to the input questions posed to our existing AI model. To train the reward model, the existing model generates multiple responses to a given question, and humans evaluate these responses. These human-graded evaluations serve as ground truth samples, allowing the reward model to learn human discernment criteria for answers and independently score the main model's responses.
- In the second step, the reward model's evaluations are used to fine-tune the previously trained LLM through reinforcement learning. Now that the reward model can score each output, we train the LLM not just to generate responses, but to create responses that maximize these rewards.
In fact, the RLHF technique or its variants prove pivotal in producing leading chat-based LLMs. OpenAI's GPT-4 model tops the leaderboard, followed closely by the "Claude" model from Anthropic, which utilizes a variant of RLHF based on a Constitutional AI approach. Unlike the standard RLHF, Constitutional AI does not require humans to provide grading samples for training a reward model. Instead, guiding principles (known as constitutions) are given to the reward model, which critiques the outputs of the LLM.
However, RLHF models do come with some trade-offs. They often lose entropy compared to their base models, meaning these models produce responses that are less diverse and more predictable. This reduced diversity could be because these models align more with the human preferences introduced during the process, which may not always be advantageous. For example, a study using GPT models to simulate agent behaviors have found that dialogues generated by the agents can seem overly formal, or agents may be excessively cooperative with one another, often agreeing to suggestions even when they conflict with their own given interests and characteristics.
In this article, we have taken a dive into what a GPT LLM is and the process of creating such a model. The journey from raw data to a polished AI assistant involves various stages, from pre-training and supervised fine-tuning to reinforcement learning with human feedback. Along the way, the model's performance is honed and optimized, even as trade-offs and constraints are carefully balanced. The end result is a powerful model like ChatGPT, capable of understanding and responding to our queries in a remarkably human-like manner.
As we continue our exploration in the next article, we'll delve deeper into the practical application of LLMs. We will introduce the Action-Brain-Context (ABC) triangle that is central to LLM applications. Additionally, we'll elucidate on several crucial concepts including in-context learning, prompt engineering, and autonomous AI agents.