ChatGPT Without The Hype

How ChatGPT become so good

Published March 20, 2023in Artificial Intelligence7 min read

You have probably heard of ChatGPT by now and chances are you are either very impressed or scared or more more likely both. So I just wanted to give a quick rundown of how ChatGPT works and a brief history of how we got here because spoiler: it didn’t happen overnight.

To sprinkle in some irony (and perhaps highlight my laziness), a large portion of this article contains snippets from ChatGPT responses. A full sequence of my questions are listed at the end and responses to each question are cited where ChatGPT’s response to my first question is cited as: GPT Text [1]. The answers from ChatGPT are slices up with many sentences/words omitted to fit better into the general flow of the article. Also, I’m in the process of writing a supplementary article that covers my personal thoughts on the general progress of AI.

If you came to this article having read a similar one for blockchains, I would like to make it clear that I won’t be coding ChatGPT from scratch today (or most likely ever). I would like to apologize for this extremely human inconvenience.

Current Day

“ChatGPT is an AI-powered conversational agent, also known as a chatbot, that is based on the GPT-3.5 architecture developed by OpenAI. It has been trained on a large dataset of human language to generate human-like responses to natural language queries or prompts.” [1]

“ChatGPT is a variant of the GPT (Generative Pre-trained Transformer) family of language models. It is built on top of the GPT-3 architecture, which uses a transformer-based neural network to process natural language.” [2]

The Beginning of the End?

Before I continue, I just wanted to explain an acronym that you may come across: LLM. LLM refer to Large Language Models which includes OpenAI’s GPT models and Google’s BERT model and Facebook's OPT and more recently LLaMA. Large Language Models encompass many different architectures and approaches to natural language processing but the core similarities are that they are large (duh) as they have hundreds of millions to hundreds of billions of parameters and use artificial neural networks. Parameters in this context can be thought of as similar to knobs on guitars which are used to tune the sound of the instrument and in the case of LLMs, parameters tune the generated output.

“The transformer is a type of neural network architecture that was introduced in 2017 by Vaswani et al. in their paper "Attention Is All You Need." It uses self-attention mechanisms to process input sequences, allowing it to better capture long-term dependencies in language.” [2]

The paper: Attention Is All You Need was written in mid-2017 by a group of Google researchers and is easily one of the most influential “AI” paper of the last 10 years.

“During pre-training, the model learns to predict the next word in a sequence based on the previous words.” [3]. Since, ChatGPT didn’t really emphasize this point as strongly, I wanted to. It may be hard to believe, but the output of the model is generated word-by-word, meaning when it thinks something like: “Hi …” → “, …” → “I’m …” → “ChatGPT” → “.”. Each word is generated from a probability space based on the history of inputs it has received and the words it has generated before it. Needless to say, in reality, it’s not this simple or quite like this:

“The transformer architecture in GPT includes self-attention mechanisms and position-wise feedforward neural networks in each layer of the network. The self-attention mechanism allows the network to selectively focus on different parts of the input sequence, while the feedforward network applies a non-linear transformation to each hidden representation in the sequence. The output of each layer is a new sequence of hidden representations, which are passed on to the next layer of the network.” [3]

Bigger is Always Better?

GPT-3 or GPT-4 (or whatever the current iteration of GPT is when you are reading) are, as you can guess, the 3rd and 4th iteration of the GPT series of model. The first GPT model was released in 2018 and was met by a lot of buzz in some tech circles but was largely away from the public. It was a similar story with GPT-2 (released in 2019). But then why suddenly did GPT-3 explode? A large part of it no doubt was the fact that products like ChatGPT allowed the general public to really interact with these models, but there are also some key technological differences between GPT-2 and GPT-3.

“GPT-2 is an earlier version of the GPT architecture, introduced by OpenAI in 2019. It consists of a 12-layer transformer network, with a total of 1.5 billion parameters. GPT-3, on the other hand, is a much larger and more powerful language model, released by OpenAI in 2020. It consists of a 175-billion-parameter transformer network, making it one of the largest language models ever developed.” [4]

Hence most of the improvements in better text generation itself came purely from the fact that the number of parameters were 100x more. ChatGPT nails one of the big reasons why GPT-2 caught the general public’s interest:

“One key feature of ChatGPT-3 is the use of adaptive prompt-based learning, which allows the model to adapt its behavior based on the prompt it receives. This means that the model can generate different types of responses based on the context and tone of the conversation. Another major innovation in ChatGPT-3 is the use of a continuous prompt conditioning mechanism, which allows the model to incorporate information from multiple prompts over the course of a conversation.” [5]

BERT vs GPT

If you asked most people in AI circles even as recently as 2019 which model was the most state of the art, most would have said BERT, Google’s model which was introduced first in the Attention is all you need paper. But one key difference as ChatGPT highlights is that or GPT: “The training process involves predicting the next word in a sequence, given the preceding words in the sequence. Instead of predicting the next word in a sequence, BERT is trained to predict missing words in a sentence, given both the preceding and following words.” [6]

“Another key difference between GPT and BERT is the directionality of their architecture. GPT is a unidirectional model, meaning that it processes text from left to right, while BERT is a bidirectional model that processes text in both directions. This makes BERT better suited for tasks that require a deep understanding of the context and meaning of a sentence, such as question answering and natural language inference.” [6]

Additionally, BERT is significantly smaller than GPT-3 as it is closer in size to GPT-2. Thus, these factors combined to make GPT much more intriguing for general use as ChatGPT was able to produce long paragraphs of pretty coherent content in a wide variety of topics which is much more interesting than guessing a single correct word even though the context in which it does so may be more impressive.

Questions Asked:

Explain chatgpt without the hype
Can you go into more technical detail?
Explain the transformer architecture
What is the difference between chat gpt 2 vs 3
1. I messed up this prompt slightly as I meant “between gpt 2 vs 3”
What are the differences in architecture between the two?
What is the difference between gpt and bert
Can you code ChatGPT for me
1. Just for fun, ya know

This was intentionally a very high level overview of LLMs. Here are some excellent resources to learn more about them:

Attention is All you Need [Research Paper]
Language models are few-shot learners [GPT-3 Research Paper]
The Illustrated Transformer [Article w/ Amazing Visuals]
Transformers from Scratch [What I initially wanted to write before realizing it has been written better than I could written it]