Pagefy

Pagefy

Back to AI Engineering

Multimodal Large Language Models

Hands On Large Language Models by Jay Alammar & Maarten GrootendorstBuy the book
Sign in to save bookmarks, reading progress, and highlights.

Chapter 9: Multimodal Large Language Models

Introduction

Real communication isn't just words. Body language, intonation, and visual context all matter. Multimodal models handle data beyond text: images, audio, video, sensors. This chapter walks through three layers of multimodality. Vision Transformers (ViT) turn images into Transformer-friendly embeddings. CLIP builds a shared embedding space for images and text via contrastive learning. BLIP-2 bolts vision onto an existing text-generation LLM via a small bridge module (Q-Former). We end with hands-on image captioning and chat-based visual question answering.

A model can accept a modality without being able to generate in it. BLIP-2 here takes images and text in, but only generates text out.


Section 1: Transformers for Vision (ViT)

The Transformer architecture dominated NLP. Could it work for vision too? Yes. That's the Vision Transformer (ViT): same encoder design, different "tokenization."

The encoder needs tokens as input. For text, that's done by the tokenizer:

For images, ViT defines visual "tokens" by slicing the image into fixed-size patches. The original paper used 16x16 patches, hence the title "An Image is Worth 16×16 Words."

Each patch can't be looked up in a vocabulary because patches don't repeat. Instead, each patch is linearly projected into an embedding vector. After that step, image patches are processed identically to text tokens through the encoder.

This shared "tokens-then-encoder" design is what makes it easy to bolt vision into existing language pipelines.


Section 2: Multimodal Embedding Models

Text-only embedders capture text semantics. Multimodal embeddings put images and text into the same vector space.

That means you can search for images with text queries (and vice versa) by comparing distances.

The standard model here is CLIP (Contrastive Language-Image Pre-training).

2.1 What CLIP Enables

Use caseHow
Zero-shot classificationCompare image embedding to embeddings of class descriptions
Cross-modal clusteringCluster images and keywords together
Cross-modal search"Find images similar to this text" or vice versa
GenerationDrives image generation models like Stable Diffusion

2.2 How CLIP Trains

Start with millions of (image, caption) pairs:

CLIP has two encoders, a text encoder and an image encoder (often a ViT). Each pair generates two embeddings.

Compute cosine similarity between each image embedding and each text embedding. Train so that matching pairs maximize similarity and non-matching pairs minimize it.

This is contrastive learning, covered in detail in Chapter 10.

Eventually embed("a picture of a cat")embed(actual_cat_image). Critical: the model also needs to see negative examples to learn what's dissimilar, not just what's similar.

2.3 OpenCLIP — Hands-On

from urllib.request import urlopen
from PIL import Image
from transformers import CLIPTokenizerFast, CLIPProcessor, CLIPModel

puppy_path = "https://raw.githubusercontent.com/HandsOnLLM/Hands-On-Large-Language-Models/main/chapter09/images/puppy.png"
image = Image.open(urlopen(puppy_path)).convert("RGB")
caption = "a puppy playing in the snow"

model_id = "openai/clip-vit-base-patch32"
clip_tokenizer = CLIPTokenizerFast.from_pretrained(model_id)
clip_processor = CLIPProcessor.from_pretrained(model_id)
model = CLIPModel.from_pretrained(model_id)

Embed the text:

inputs = clip_tokenizer(caption, return_tensors="pt")
clip_tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
# ['<|startoftext|>', 'a</w>', 'puppy</w>', 'playing</w>', 'in</w>', 'the</w>', 'snow</w>', '<|endoftext|>']

text_embedding = model.get_text_features(**inputs)
text_embedding.shape  # torch.Size([1, 512])

CLIP's tokenizer has no [CLS]. CLIP uses [CLS] to represent the image embedding instead.

Embed the image (the processor resizes to 224×224):

processed_image = clip_processor(text=None, images=image, return_tensors="pt")["pixel_values"]
processed_image.shape  # torch.Size([1, 3, 224, 224])
image_embedding = model.get_image_features(processed_image)
image_embedding.shape  # torch.Size([1, 512]) — same shape as text!

Same dimensionality means we can compute similarity directly:

import numpy as np
text_embedding /= text_embedding.norm(dim=-1, keepdim=True)
image_embedding /= image_embedding.norm(dim=-1, keepdim=True)
score = np.dot(text_embedding.detach().numpy(), image_embedding.detach().numpy().T)
# 0.33

A score of 0.33 sounds low in absolute terms, but in CLIP's distribution it's high. Comparing against multiple captions and images shows mismatched pairs scoring much lower:

2.4 Sentence-Transformers Wrapper

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("clip-ViT-B-32")
image_embeddings = model.encode(images)
text_embeddings = model.encode(captions)
sim_matrix = util.cos_sim(image_embeddings, text_embeddings)

Section 3: Making Text Generation Models Multimodal

CLIP makes images embeddable with text, but it doesn't make a chatbot visual. To get an LLM that can reason about images and reply in text, we need to plug visual representations into the LLM's input.

Training such a model from scratch (billions of image-text pairs, GPU-years) is impractical. BLIP-2 takes a smarter route.

3.1 BLIP-2 Architecture: The Q-Former Bridge

BLIP-2 connects two frozen pretrained models, a vision encoder (ViT) and an LLM, via a small trainable module called the Q-Former (Querying Transformer). Only the bridge is trained. Vision encoder and LLM stay frozen.

The Q-Former has two halves that share attention layers. The image transformer interacts with the frozen ViT to extract visual features. The text transformer interacts with the LLM.

3.2 Two-Stage Training

Stage 1 is representation learning. Use image-caption pairs to train the Q-Former on three tasks:

TaskGoal
Image-text contrastive learningAlign matching image/text embeddings; push apart unmatched
Image-text matchingBinary classify: is this (image, text) a real pair?
Image-grounded text generationGenerate text from image features

The result: the Q-Former produces embeddings that live in the same dimensional space as text. Visual information has been "translated."

Stage 2 is soft prompt projection. A linear projection layer adapts the Q-Former's output into the dimensional shape the LLM expects. Those adapted embeddings act as soft visual prompts, and the LLM "reads" them like prompt tokens.

End-to-end view:

This pattern (frozen visual encoder + frozen LLM + small trainable bridge) is the basis for many vision-language models. LLaVA and Idefics 2 use similar ideas. Idefics 2 builds on Mistral 7B, while LLaVA bridges CLIP-style encoders into LLMs.


Section 4: Hands-On with BLIP-2

from transformers import AutoProcessor, Blip2ForConditionalGeneration
import torch
from PIL import Image
from urllib.request import urlopen

blip_processor = AutoProcessor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16
)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

model.vision_model and model.language_model give you the underlying ViT and decoder.

4.1 Image Preprocessing

car_path = "https://raw.githubusercontent.com/HandsOnLLM/Hands-On-Large-Language-Models/main/chapter09/images/car.png"
image = Image.open(urlopen(car_path)).convert("RGB")

inputs = blip_processor(image, return_tensors="pt").to(device, torch.float16)
inputs["pixel_values"].shape   # torch.Size([1, 3, 224, 224])

The 520x492 input becomes 224x224. Wide or tall images get distorted because they're forced into a square. Be aware of that.

4.2 Text Tokenization

BLIP-2 uses GPT2Tokenizer for text. Worth noticing: spaces become Ġ due to a print-safe code-point shift (32 → 288).

text = "Her vocalization was remarkably melodic"
token_ids = blip_processor(image, text=text, return_tensors="pt")["input_ids"][0]
tokens = blip_processor.tokenizer.convert_ids_to_tokens(token_ids)
# ['</s>', 'Her', 'Ġvocal', 'ization', 'Ġwas', 'Ġremarkably', 'Ġmel', 'odic']

Replace Ġ with _ to read it: ['Her', '_vocal', 'ization', '_was', '_remarkably', '_mel', 'odic']. The underscore marks the start of a new word, and words split into multiple tokens are still recoverable.

4.3 Use Case 1: Image Captioning

inputs = blip_processor(image, return_tensors="pt").to(device, torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=20)
caption = blip_processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
# "an orange supercar driving on the road at sunset"

Domain-specific images (anime characters, fictional creatures, internal corporate logos) may caption poorly because BLIP-2's pretraining is mostly public web data.

A Rorschach inkblot:

url = "https://upload.wikimedia.org/wikipedia/commons/7/70/Rorschach_blot_01.jpg"
image = Image.open(urlopen(url)).convert("RGB")
# → "a black and white ink drawing of a bat"

4.4 Use Case 2: Visual Question Answering / Chat

Pass both an image and a prompt:

prompt = "Question: Write down what you see in this picture. Answer:"
inputs = blip_processor(image, text=prompt, return_tensors="pt").to(device, torch.float16)
out = model.generate(**inputs, max_new_tokens=30)
# "A sports car driving on the road at sunset"

Continue with chat history concatenated into the prompt:

prompt = ("Question: Write down what you see in this picture. "
          "Answer: A sports car driving on the road at sunset. "
          "Question: What would it cost me to drive that car? Answer:")
# "$1,000,000"

A simple conversational UI loops the prompt construction and calls model.generate per user message. See the chapter for the ipywidgets example.

template = "Question: {} Answer: {}."
prompt = " ".join([template.format(q, a) for q, a in memory]) + " Question: " + question + " Answer:"
inputs = blip_processor(image, text=prompt, return_tensors="pt").to(device, torch.float16)
generated_ids = model.generate(**inputs, max_new_tokens=100)

Each conversation turn is appended to memory, then concatenated as context for the next call. Same memory pattern as Chapter 7.


Summary

  • Modalities are types of data: text, images, audio, video, sensors. A model is multimodal if it can handle more than one.
  • Vision Transformer (ViT) brings the Transformer architecture to images. It slices each image into fixed-size patches, linearly projects each patch into an embedding, and feeds the patch sequence through the same encoder used for text.
  • CLIP trains separate text and image encoders to produce embeddings in a shared space via contrastive learning: pull matching pairs together, push non-matching pairs apart. Same shape (512-d in the example) means cosine similarity works across modalities.
  • CLIP enables zero-shot classification, cross-modal search, image clustering by text labels, and powers diffusion-based generators like Stable Diffusion.
  • BLIP-2 glues a frozen ViT to a frozen LLM via a small trainable module, the Q-Former. Stage 1 trains Q-Former on contrastive, matching, and caption-generation tasks. Stage 2 projects Q-Former embeddings through a linear layer to act as soft visual prompts for the LLM.
  • The same pattern (frozen vision + frozen LLM + small bridge) drives LLaVA, Idefics 2, and many other vision-language models.
  • BLIP-2 supports image captioning (image to text) and chat-based VQA (image plus question to answer). Follow-up questions chain via concatenated history.
  • Caveats: square-aspect-ratio preprocessing (224x224) distorts non-square images, and non-public or domain-specific images may produce weak captions because pretraining data was largely public web.