Pagefy

Pagefy

Back to AI Engineering

An Introduction to Large Language Models

Hands On Large Language Models by Jay Alammar & Maarten GrootendorstBuy the book
Sign in to save bookmarks, reading progress, and highlights.

Chapter 1: An Introduction to Large Language Models

Introduction

2023 was the year Language AI broke into the mainstream. ChatGPT crossed one million users in five days and one hundred million in two months, and large language models went from research demos to household tools. This chapter sets up the rest of the book. It explains what Language AI is, walks the history from bag-of-words through word2vec, attention, Transformers, BERT, and GPT to today's chatbots, defines what we mean by "LLM," outlines the two-step training paradigm, surveys common applications, discusses responsible use, and finishes by generating our first text with Phi-3-mini.


Section 1: What Is Language AI?

John McCarthy described artificial intelligence in 2007 as "the science and engineering of making intelligent machines, especially intelligent computer programs." The term has been stretched to cover everything from speech recognition to NPCs in computer games (which are often just if/else statements behind the curtain).

Language AI is the subfield focused on understanding, processing, and generating human language. It overlaps heavily with natural language processing (NLP), but in this book we use Language AI to include things that are not strictly LLMs but still shape the field, like the retrieval systems that give LLMs superpowers in Chapter 8.


Section 2: A Recent History of Language AI

Text is unstructured and loses meaning when stored as raw characters, so most of Language AI's history has been about finding ways to represent language in a form computers can compute on.

2.1 Bag-of-Words

Bag-of-words first showed up around the 1950s and became popular in the 2000s. There are three steps. First, tokenize each sentence into tokens, usually whitespace-delimited words (this falls apart for languages like Mandarin that don't put spaces between words). Second, build a vocabulary by collecting all unique tokens across the corpus. Third, count how many times each vocabulary word appears in each document. The count vector becomes the document representation.

These vectors are called representation models. Bag-of-words ignores semantics. It treats language as a literal bag with no notion of meaning. But it isn't obsolete, and Chapter 5 shows how it complements modern models.

2.2 Dense Vector Embeddings (word2vec, 2013)

Embeddings are vector representations that try to capture meaning. Word2vec uses a neural network trained on a large corpus, like all of Wikipedia, to produce them.

A neural network has interconnected layers of nodes, with every connection carrying a weight (a parameter). Word2vec works like this:

  1. Assign every word in the vocabulary a random vector (e.g., 50 dimensions).
  2. Take pairs of words from the training data.
  3. Train the model to predict whether the two words are likely neighbors.
  4. Update embeddings so that words with similar neighbors end up with similar embeddings.

What does "meaning" look like in a vector? Imagine the word "baby" scoring high on properties like newborn and human, while "apple" scores low on both. In practice the dimensions don't map to human-readable concepts, but together they encode useful structure.

Compressed to 2D, semantically similar words cluster together. Distance metrics measure semantic similarity.

2.3 Types of Embeddings

Embeddings exist at different levels of abstraction. Token or word embeddings (word2vec) sit at the lowest level. Sentence embeddings produce one vector per sentence. Document embeddings, which is where bag-of-words sits, produce one vector per document.

Embeddings drive classification (Chapter 4), clustering (Chapter 5), and semantic search and RAG (Chapter 8).

2.4 Encoding and Decoding Context with RNNs

Word2vec produces static embeddings: "bank" has the same vector whether it refers to a financial institution or a riverbank. Recurrent neural networks (RNNs) were the first widespread fix. They are variants of NNs designed to model sequences and play two roles, encoder and decoder. The encoder represents the input sentence, and the decoder generates the output sentence.

The architecture is autoregressive: every generated word feeds back as input for the next.

The encoder produces a single context embedding (it can use word2vec embeddings as inputs). Inputs are processed sequentially, one token at a time.

Single-context-vector RNNs choke on long sentences because one vector has to summarize the whole input.

2.5 Attention (2014)

Bahdanau et al. introduced attention as a fix. Instead of one final context vector, the decoder gets all hidden states and learns to weight (attend to) the relevant input tokens at each output step.

Translating "I love llamas" to "Ik hou van lama's", the attention between "lama's" and "llamas" is high while attention between "lama's" and "I" is low. This added context but kept the sequential bottleneck, which made training hard to parallelize.

2.6 The Transformer (2017)

"Attention is all you need" (Vaswani et al., 2017) proposed a network purely based on attention, with no recurrence at all. Two big wins came out of this: parallel training and better long-range modeling. The original Transformer is encoder-decoder, with both blocks autoregressive on the output side.

Each encoder block has two parts:

  • Self-attention, which attends to all positions in the same sequence
  • A feedforward neural network

Self-attention can look both forward and backward in the input sequence. It is not constrained to one direction at a time.

The decoder block adds an extra attention layer that looks at the encoder output, similar to RNN attention.

The decoder's self-attention is masked so a position can only attend to earlier positions. Without masking, the model would "see the future" during training.

Chapters 2 and 3 cover multi-head attention, positional embeddings, and layer normalization in detail.

2.7 Encoder-Only Models: BERT (2018)

The encoder-decoder Transformer is great for translation but awkward for tasks like classification. BERT (Bidirectional Encoder Representations from Transformers) is encoder-only. It drops the decoder entirely and uses 12 stacked encoders for the base model.

A special [CLS] (classification) token is prepended to each input. Its final embedding represents the whole sequence and is what you fine-tune on for tasks like classification.

BERT trains via masked language modeling (MLM). Randomly mask tokens in the input and train the model to predict them. This forces BERT to build deep contextual representations.

BERT is built for transfer learning: pretrain on a huge corpus like Wikipedia, then fine-tune on a small task-specific dataset. Fine-tuning needs far less compute and data than pretraining.

Throughout this book, encoder-only models are called representation models because their job is to embed, not to generate. They get used for classification (Ch. 4), clustering (Ch. 5), and semantic search (Ch. 8).

2.8 Decoder-Only Models: GPT (2018+)

The same year as BERT, OpenAI released GPT-1, a decoder-only model targeting generation. The architecture stacks decoder blocks with no encoder and no encoder-attention.

ModelParametersNotable
GPT-1 (2018)117MTrained on 7K books + Common Crawl
GPT-2 (2019)1.5BFirst "human-indistinguishable" outputs
GPT-3 (2020)175BFew-shot learner
GPT-3.5 (2022)Powered ChatGPT
GPT-4 (2023)Multimodal

These decoder-only generative models are what most people mean when they say LLMs. They are sequence-to-sequence autocompleters at heart. Once fine-tuned on instructions, they become instruct or chat models that answer questions instead of merely continuing text. So generative models are really completion models under the hood.

The context length (or context window) is the maximum number of tokens the model can process. Because generation is autoregressive, the running context grows with every emitted token.

2.9 The Year of Generative AI

2023 saw an explosion of both proprietary and open-source LLMs.

Open base models are often called foundation models and can be fine-tuned for downstream tasks. Beyond Transformers, new architectures like Mamba (selective state space models) and RWKV (RNN-style with Transformer-level performance) aim for longer contexts and faster inference.


Section 3: The Moving Definition of a "Large Language Model"

"Large" is a moving target. Is a 10x-smaller GPT-3 still an LLM? Is a GPT-4-sized classifier with no generation an "LLM"? The book's working definition is intentionally broad:

Any model that represents or generates language at meaningful scale, including encoder-only models and models smaller than 1B parameters that don't generate text.

So embedding models, BERT-style classifiers, and even bag-of-words can play roles inside the LLM toolbox.


Section 4: The Training Paradigm of LLMs

Traditional ML is a single step: train one model for one task.

LLMs need at least two:

  1. Pretraining (language modeling) — train on a vast text corpus (Llama 2 used 2 trillion tokens) to learn grammar, context, and patterns. The model predicts the next word. The result is a base or foundation model that does not follow instructions. This phase eats most of the compute.
  2. Fine-tuning (post-training) — adapt the base model to a narrower task like classification, instruction-following, or chat. Far cheaper than pretraining.

Optional alignment or preference-tuning stages further shape model behavior to user preferences (Chapter 12).


Section 5: What Makes LLMs So Useful?

A sample of common applications and the chapter that handles each:

TaskApproachChapter
Sentiment of customer reviews (supervised classification)Encoder-only or decoder-only, pretrained or fine-tuned4, 11
Discovering topics in support tickets (unsupervised)Encoder-only for clusters, decoder-only for labels5
Retrieval / semantic search over documentsEmbedding models + custom fine-tuning8, 12
Chatbot with tools and external docsPrompt engineering + RAG + fine-tuning6, 8, 12
Recipe-from-fridge-photoMultimodal LLM9

Section 6: Responsible LLM Development and Usage

Five things worth tracking:

  • Bias and fairness — training data is rarely shared, so biases are hard to audit.
  • Transparency and accountability — users may not know they're talking to an AI, and medical-domain LLMs may be regulated as devices.
  • Generating harmful content — confident hallucinations, fake news, misleading articles.
  • Intellectual property — ownership is unclear when output mirrors training data.
  • Regulation — for example, the EU AI Act regulates foundation models.

Section 7: Limited Resources Are All You Need

Compute means GPUs, and the binding constraint is usually VRAM (video memory). Some models simply will not load if you don't have enough.

Meta trained Llama 2 on A100-80GB GPUs for around 3.3M GPU-hours. At $1.50 per hour, that's over $5M. "GPU-poor" is the term for anyone without a powerful GPU. This book targets the GPU-poor, with code that runs in a free Google Colab T4 (16GB VRAM), the suggested minimum.

VRAM requirements depend on architecture, size, compression (quantization), context size, and inference backend.


Section 8: Interfacing with LLMs

8.1 Proprietary (Closed-Source) Models

Examples include OpenAI GPT-4 and Anthropic Claude. You access them via API. Weights and architecture are private.

ProsCons
No GPU neededCosts money
No hosting expertiseNo fine-tuning
Often more performantData shared with provider

8.2 Open Models

Examples include Cohere Command R, Mistral, Microsoft Phi, and Meta Llama. Weights and architecture are public. License terms vary, and some restrict commercial use, which sparks debate over what "open source" really means here, since training data and code are often not shared.

ProsCons
Full control / transparencyNeed a powerful GPU
Local data stays localSetup expertise required
Free and fine-tunable

The book prefers open models where possible.

8.3 Open Source Frameworks

Hundreds of frameworks exist. The book focuses on backend packages (no GUI, just load + run a model):

  • llama.cpp — efficient CPU/GPU inference
  • LangChain — chaining and agent orchestration
  • Hugging Face Transformers — the foundation for almost everything

GUI options like text-generation-webui, KoboldCpp, and LM Studio exist if you want a ChatGPT-like local interface.


Section 9: Generating Your First Text

The main source for downloading models is Hugging Face Hub, which hosted over 800K models at the time of writing. The default model in this book is Phi-3-mini: 3.8B parameters, runs in under 8GB VRAM (under 6GB with quantization), MIT-licensed.

When you use an LLM, two things load: the generative model itself, and the tokenizer that splits input text into tokens.

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

transformers.pipeline wraps model + tokenizer + generation:

from transformers import pipeline

generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=500,
    do_sample=False,
)

The key parameters:

  • return_full_text=False returns only the model's output, not the prompt back.
  • max_new_tokens=500 caps output length so generation can't run unbounded toward the context window.
  • do_sample=False always picks the most probable next token (deterministic, no creativity). Chapter 6 covers sampling parameters.

The prompt is a list of role/content dicts:

messages = [
    {"role": "user", "content": "Create a funny joke about chickens."}
]
output = generator(messages)
print(output[0]["generated_text"])
# Why don't chickens like to go to the gym? Because they can't crack the egg-sistence of it!

Summary

  • Language AI is the subfield of AI for understanding and generating language. It overlaps with NLP. LLMs are its current dominant paradigm.
  • The history runs bag-of-words → word2vec embeddings → RNN encoder-decoder → attention → Transformers → BERT (encoder-only, representation) and GPT (decoder-only, generative).
  • Embeddings turn text into vectors that encode meaning. They power classification, clustering, and search.
  • Self-attention is the core idea: every token attends to every other in a single parallelizable pass. Decoders mask future positions to avoid information leakage.
  • LLMs train in two main steps. Pretraining works on a huge corpus, predicts the next token, and is very expensive. Fine-tuning is cheaper and is task- or instruction-specific. Optional alignment follows.
  • "Large" is fluid. This book covers both generative LLMs and smaller representation models like BERT, embedders, and even bag-of-words.
  • Proprietary models like GPT-4 and Claude are easy but private and paid. Open models like Llama, Phi, and Mistral need GPUs but offer control. The book prefers open.
  • VRAM is the binding hardware constraint. The book targets free Google Colab (T4, 16GB).
  • Phi-3-mini is the default workhorse: small, performant, MIT-licensed.
  • Responsible use considerations include bias, transparency, harmful content, IP, and regulation.