Deep-Dive into How GPT Works

Computer / Info Technology, Science & Technology

Dec 172022

I wanted a deeper understanding of what is going on under GPT-3’s hood. So I asked ChatGPT for help.

The ensuing conversation was enlightening. Keep in mind that GPT is a renowned bullshit artist (this excellent article explains why), so some of its statements might not be accurate. This becomes abundantly clear at the end of the conversation. But my sense is that it mostly told me the truth. ChatGPT tends to give unnecessary summaries at the end of its answers (almost always starting with the word “overall”), which I’ve mostly removed.

The Wikipedia article on GPT-3 uses lingo like autoregressive language model, standard transformer network, generative pretraining, and few-show learning. All of that is meaningless to anyone who isn’t already an expert on AI. I quoted the article to ChatGPT and then asked for clarification. Our conversation follows…

Continues below the fold…

Can you please explain that as if I were a 2nd grader?

Sure! GPT-3 is a computer program that can write sentences and paragraphs that sound like they were written by a person. It uses a special type of math called “deep learning” to do this. GPT-3 is very big, with lots of parts that help it understand language. It can learn new things quickly, even if it only sees a small amount of information. GPT-3 is useful for lots of tasks that involve reading and writing, like answering questions or writing stories.

Now can you please explain it as if I were a freshman in college studying computer science?

Certainly! Generative Pre-trained Transformer 3 (GPT-3) is a state-of-the-art autoregressive language model developed by OpenAI. It is a type of deep learning model that is trained to predict the next word in a given text prompt, by learning the statistical patterns and relationships between words in large amounts of text data. GPT-3 has a transformer architecture, which is a type of neural network that is particularly well-suited for processing sequential data such as natural language. The model has a very large capacity, with 2048 tokens of context and 175 billion parameters, and it was trained using a method called “generative pretraining” on a massive dataset. One of the key strengths of GPT-3 is its ability to perform well on a wide range of natural language tasks with only a small amount of fine-tuning, a property known as few-shot learning.

What do you mean by “predict the next word”?

In the context of language modeling, “predicting the next word” means using the information in the previous words of a sentence or paragraph to predict what word is most likely to come next. For example, given the prompt “The cat sat on the _____”, the model might predict that the next word is “mat”, because that is a word that commonly follows the phrase “sat on the” in English language text.

To make these predictions, the model uses its understanding of the statistical patterns and relationships between words in the training data. For example, it might have learned that the word “cat” is often followed by the word “sat”, and that the word “sat” is often followed by the word “on”. Using this information, the model can make a educated guess about what word is most likely to come next in a given context.

Language models like GPT-3 are useful for a wide range of natural language processing tasks, such as machine translation, language generation, and text summarization, because they can provide a good understanding of the structure and meaning of language.

I understand how this process can create sentences like “The cat sat on the mat.” But GPT-3 strings together these sentences in a way that is logically coherent and that demonstrates an understanding of the meaning of the input. How does that arise merely from putting together words one-at-a-time based on their probability of being conjoined in the training data?

Although GPT-3 generates text one word at a time based on the probability of each word given the previous words, it is able to create coherent and logical text because it has learned a rich representation of the structure and meaning of language from the vast amounts of text data it was trained on. This means that it has learned about the relationships between words and how they are used in different contexts, as well as the common patterns and structures of language.

For example, when generating text, GPT-3 can use this understanding of language to choose words and phrases that are appropriate for the context, and to structure its output in a way that follows the conventions of written language. It can also use its understanding of the meaning of words and phrases to generate text that is semantically coherent and makes sense in the context of the prompt.

I’m looking for a somewhat more technical explanation of how it does that.

Certainly! Here is a more technical explanation of how GPT-3 is able to generate coherent and logical text:

GPT-3 is a type of transformer language model, which is a neural network architecture specifically designed for processing sequential data such as natural language text. It consists of a series of self-attention layers that process the input text one word at a time, using information from the previous words to make a prediction about the next word.

To make these predictions, the model uses an embedding layer to convert each word in the input text into a dense vector representation, which captures the meaning of the word in a low-dimensional space. These vectors are then processed by the self-attention layers, which use the vectors to compute the probability of each word given the previous words in the sequence.

The self-attention layers in GPT-3 are able to capture long-range dependencies in the input text, which allows the model to understand the relationships between words and how they are used in different contexts. For example, the model might learn that the word “cat” is often followed by the word “sat” in the context of a sentence about a cat sitting on something.

The output of the self-attention layers is then passed through a series of fully-connected layers, which perform additional processing on the output and generate a probability distribution over the vocabulary of possible next words. The model then chooses the next word by sampling from this distribution, using a technique called sampling-based decoding.

What is a “self-attention layer”?

In a neural network, a self-attention layer is a type of layer that allows the model to focus on specific parts of the input when making predictions. It is particularly useful for processing sequential data such as natural language text, because it allows the model to consider the relationships between words and how they are used in different contexts.

In the context of a transformer language model like GPT-3, a self-attention layer consists of a series of attention heads, each of which computes a weighted sum of the input vectors using an attention mechanism. The attention mechanism allows the model to selectively “attend” to certain parts of the input when making a prediction, based on the relevance of those parts to the prediction.

For example, when generating text, a self-attention layer might use the attention mechanism to focus on certain words in the input that are relevant to the context of the text being generated. This allows the model to use the relevant information in the input to inform its predictions, while ignoring irrelevant information.

I think I understand. So if the input is “The cat sat on the–“, the attention mechanism might assign a greater weight to the word “cat,” to signify that it is probably more important than the other words in the sentence.

Yes, that’s correct! In the context of language modeling, the attention mechanism in a self-attention layer can assign higher weights to certain words in the input that are deemed more important or relevant to the task at hand.

For example, when generating text, the attention mechanism might assign higher weights to certain words in the input that are relevant to the context of the text being generated. This allows the model to use the relevant information in the input to inform its predictions, while ignoring irrelevant information.

In the case of the input you provided, “The cat sat on the–“, the attention mechanism might assign higher weights to the word “cat” because it is the subject of the sentence and is likely to be important for determining the next word. It might also assign higher weights to the words “sat” and “on”, because they provide additional context about the verb and preposition that are likely to be important for determining the next word.

Can you please explain the purpose of the vectors?

In a neural network, a vector is a mathematical representation of data that can be used as input or output for the network. In the context of a language model like GPT-3, the vectors are used to represent the words in the input text in a dense, low-dimensional space.

Each word in the input text is assigned a unique vector representation, which is learned by the model during training. These vector representations, also known as word embeddings, capture the meaning of the word in a compact form that the model can use for prediction.

For example, words that have similar meanings will have similar vector representations, while words with very different meanings will have very different vector representations. This allows the model to use the vectors to understand the relationships between words and how they are used in different contexts.

The vectors are an important part of the model’s ability to understand language, because they provide a rich representation of the meaning of the words in the input text. They are also useful for tasks such as language translation, where the model needs to be able to understand the relationships between words in different languages.

Would this be a fair analogy? The training data is like a map where words are grouped by their meaning, and the vectors are coordinates to find words on the map?

That’s a good analogy! The training data can be thought of as a map that shows the relationships between words and how they are used in different contexts, and the vectors can be thought of as coordinates that allow the model to locate specific words on the map.

In this analogy, the vectors are like GPS coordinates that provide a precise location for each word on the map, while the training data is like the map itself, showing the relationships between words and how they are used in different contexts. The model can use the vectors to understand the meaning of words and their relationships to other words, just as a person can use GPS coordinates to locate a specific place on a map.

Why is it called an autoregressive language model?

An autoregressive language model is a type of model that is trained to predict the next word in a sequence of words based on the previous words. The term “autoregressive” refers to the fact that the model makes predictions based on its own predictions, in an iterative process.

For example, in the case of a language model like GPT-3, the model might start by predicting the next word in a given text prompt based on the words that come before it. It then uses this prediction to update its internal state and generate a new prediction for the following word, and so on. This process continues until the model has generated the desired amount of text.

The autoregressive nature of the model allows it to use the information from the previous words to inform its predictions about the next word, which helps it to generate coherent and logical text. It also allows the model to capture the dependencies between words and how they are used in different contexts, which is an important aspect of understanding language.

Is a token just a word or punctuation mark?

In the context of natural language processing, a token is a unit of text that is used as the basic building block for processing language. A token can be a word, a punctuation mark, or even a single character, depending on the specific application and the level of granularity that is desired.

For example, in the case of a language model like GPT-3, a token might correspond to a word or a punctuation mark, depending on the specific task being performed and the level of detail that is required. In other applications, such as machine translation or text classification, tokens might correspond to individual characters or subwords, in order to capture fine-grained information about the structure and meaning of language.

The mechanism you’re describing explains how GPT-3 can construct sentences one token at a time. But how does it account for the appearance of logical reasoning? When I give GPT-3 logic problems, it appears to use propositional logic to arrive at answers. This would seem to go beyond stringing words together based upon their related meanings. Does GPT-3 have a propositional logic module of some kind?

GPT-3 does not have a dedicated propositional logic module, but it is able to perform tasks that involve logical reasoning by learning patterns and relationships from the vast amounts of text data it was trained on.

For example, when generating text or answering questions, GPT-3 is able to use its understanding of the structure and meaning of language to make inferences and draw logical conclusions based on the information provided. This allows it to perform tasks such as solving logic problems or answering factual questions, even though it does not have a dedicated logic module.

One way that GPT-3 is able to perform tasks that involve logical reasoning is by using its self-attention mechanism to attend to specific parts of the input that are relevant to the task at hand. This allows it to use the relevant information to inform its predictions and make logical conclusions based on the input.

How does the pre-training process work?

In the context of a language model like GPT-3, pre-training refers to the process of training the model on a large dataset of text in order to learn the statistical patterns and relationships between words and how they are used in different contexts.

During pre-training, the model is trained to predict the next word in a sequence of words, given the previous words in the sequence. This is known as an autoregressive language modeling task, because the model makes predictions based on its own predictions in an iterative process.

The goal of pre-training is to learn a rich representation of the structure and meaning of language, which can then be fine-tuned for specific natural language processing tasks such as machine translation or text summarization.

Pre-training is an important step in the development of a large language model like GPT-3, because it allows the model to learn a strong foundation in language understanding, which it can then build upon when fine-tuning for specific tasks. Pre-training also allows the model to perform well on a wide range of tasks with only a small amount of fine-tuning, a property known as few-shot learning.

Are humans involved in the pre-training process or is it automated?

Pre-training a large language model like GPT-3 is typically an automated process, with no direct human involvement. The process involves using a computer to process a large dataset of text and train the model using a set of algorithms and mathematical techniques.

The training data for a model like GPT-3 is typically a large dataset of text that has been collected and preprocessed by humans. This might involve cleaning the text to remove noise or formatting errors, and possibly annotating the text with additional information such as part-of-speech tags or named entity labels.

However, once the training data has been prepared, the pre-training process itself is typically automated and does not require direct human involvement. The model is trained using a set of algorithms and mathematical techniques that are designed to learn the statistical patterns and relationships between words and how they are used in different contexts.

Is a large language model a type of neural network?

Yes, a large language model like GPT-3 is a type of neural network. Specifically, it is a transformer language model, which is a type of neural network architecture that is specifically designed for processing sequential data such as natural language text.

The neural network is created during the pre-training period, and thereafter it is static. Input text is not added to this network.

In the context of a large language model like GPT-3, the neural network is created during the pre-training period and is then fixed and not modified. This means that the network architecture and the weights of the network are not changed after the pre-training process is complete.

However, during the pre-training process and subsequent fine-tuning for specific tasks, the model is fed a large amount of input text and uses this input to learn the statistical patterns and relationships between words and how they are used in different contexts. This allows the model to learn a rich representation of language, which it can then use to generate coherent and logical text or perform other natural language processing tasks.

So while the neural network itself is not modified after pre-training, the model continues to process input text and use this input to make predictions and perform various tasks.

Let’s talk about the architecture behind all of this. Is the neural network stored in a SQL database of some kind?

A neural network like the one used in a large language model like GPT-3 is typically not stored in a SQL database. Instead, the network is typically represented as a set of mathematical equations and stored in memory as a series of arrays and matrices.

The network architecture, which defines the structure of the network and the connections between the neurons, is also typically represented as a set of equations and stored in memory. This allows the model to process input data and make predictions using the network architecture and the learned weights, which are stored as part of the network.

In some cases, the trained network and the network architecture may be stored on disk in a file or set of files for later use, but they are not typically stored in a SQL database. Instead, specialized file formats such as HDF5 or TensorFlow SavedModel are often used to store the trained network and the network architecture.

I’ve heard something about Microsoft Azure being a component of the architecture.

Microsoft Azure is a cloud computing platform and infrastructure that provides a range of services, including virtual machines, storage, networking, and analytics. It can be used to host and run a wide variety of applications, including machine learning models like the ones used in large language models like GPT-3.

In the case of GPT-3, Microsoft Azure was used to host the model and provide the computational resources needed to train and run the model. This included providing access to the training data and the hardware resources necessary to process the data and train the model.

Overall, Microsoft Azure played a role in the architecture of GPT-3 as the platform on which the model was hosted and run. However, it was not directly involved in the design or implementation of the model itself, which was developed by OpenAI.

What kind of computational resources are required for GPT-3? I imagine it must need a powerful supercomputer.

The server experienced an error while processing your request. Sorry about that! You can retry your request, or contact us through our help center at help.openai.com if the error persists.

Too many requests, please slow down

Foiled again! ChatGBT froze up. That happens a lot. So I switched over to GPT-3 for the answer.

GPT-3 requires a large amount of computational resources. It is estimated that the training of GPT-3 requires upwards of 300 petaflops of computing power, which is equivalent to the power of up to 10,000 of the most powerful computers in the world. This is an enormous amount of computing power and is currently only available to a few organizations.

I started over with ChatGPT and asked the question in a new chat, to see if the answers would be different from GPT-3’s answer…

GPT-3 (Generative Pre-training Transformer 3) is a large and powerful language model developed by OpenAI that has been trained on a massive amount of data. It requires significant computational resources to train and run, but it does not necessarily require a supercomputer.

The size and complexity of GPT-3 models vary significantly, ranging from the small GPT-3 175B model with 175 billion parameters to the largest GPT-3 model, the GPT-3 8.3B model with 8.3 trillion parameters. These models are trained on multiple NVIDIA graphics processing unit (GPU) servers and can take several weeks to complete training.

Once trained, GPT-3 models can be run on a variety of hardware, including GPUs, central processing units (CPUs), and even edge devices with low computational power. The specific hardware requirements will depend on the size of the model and the specific tasks it is being used for.

Waaaaiit a second… that’s not what GPT-3 just told me.

Is the following true? GPT-3 requires a large amount of computational resources. It is estimated that the training of GPT-3 requires upwards of 300 petaflops of computing power, which is equivalent to the power of up to 10,000 of the most powerful computers in the world.

The statement that GPT-3 requires upwards of 300 petaflops of computing power is not accurate. While GPT-3 does require a significant amount of computational resources to train, it does not require the equivalent of 10,000 of the most powerful computers in the world.

IN CONCLUSION…

That seems to be a good overview of how this technology works. However, in light of the fact that I caught GPT-3 in a lie at the end there, you have to wonder how much of what ChatGPT told me is accurate.