Dataset Engineering
Chapter 8: Dataset Engineering
Introduction
A model is only as good as its training data. As fewer companies can train foundation models from scratch, data has become the primary differentiator. The OpenAI contributor lists tell the story. GPT-3 (2020) credited 2 people for data. GPT-4 (2023) credited 80, plus contracted annotators. This chapter walks through dataset engineering, the science of creating data that lets you train the best possible model on a budget. The focus is on post-training data since that's what application developers actually deal with, but the chapter borrows from pre-training where it's instructive.
Section 1: A Data-Centric View of AI
- Model-centric AI improves the model: architecture, scale, training. The dataset (e.g., ImageNet) is fixed, and you train the best model possible.
- Data-centric AI improves the data. The model is fixed, and you create a dataset that gives it the best performance.
Andrew Ng launched a data-centric AI competition in 2021. DataComp (2023) ran a competition for the best dataset for CLIP training, evaluated by downstream task performance on a standardized training script.
In practice, you need both.
Section 2: Data Curation
The right data gets you a more capable, safer, longer-context model. The wrong data gets you biases, hallucinations, and wasted resources.
What "data" means depends on the task:
- Self-supervised finetuning: sequences of data.
- Instruction finetuning:
(instruction, response). - Preference finetuning:
(instruction, winning, losing). - Reward model: preference data or
((instruction, response), score).
Hard-to-Acquire Behaviors
- Chain-of-Thought requires step-by-step responses. Generating multi-step responses is tedious. CoT datasets are rarer.
Without CoT:
Q: What is the boiling point of Nitrogen?
A: -320.4F
With CoT:
Q: The cafeteria had 23 apples. They used 20 for lunch and bought 6 more. How many?
A: They had 23. Used 20 → 23-20 = 3. Bought 6 → 3+6 = 9.
- Tool use. Humans and AI have different efficient operations. Humans use a web UI. AI uses an API. Many teams use simulations to generate tool-use data. Tool use also needs special multi-message formats. Llama 3 designed one with source/destination headers.
Single-Turn vs. Multi-Turn
Single-turn is simpler and cheaper. Multi-turn captures clarification, correction, and follow-up. Real-world tasks usually need both.
Data Curation Is Also About Removal
Example: chatbot users complain it adds unsolicited rewrites after fact-checks. You investigate and find some training annotations include unsolicited suggestions. Remove them, and add new examples without rewriting.
Three Criteria
Data curation is like cooking:
- Quality is ingredient quality (no spoiled food).
- Coverage is the ingredient mix (right amount of each).
- Quantity is how many ingredients.
2.1 Data Quality
Small high-quality data beats large noisy data. Yi model: 10K carefully crafted instructions beat hundreds of thousands of noisy ones. LIMA: 1,000 curated prompts on a 65B Llama produced answers preferred to GPT-4 in 43% of cases (still not robust enough for production).
The Llama 3 team found human-generated data is more error-prone than expected, especially for nuanced safety policies. They built AI-assisted annotation tools.
Six attributes of high-quality data:
- Relevant. Aligned with the task domain.
- Aligned with task requirements. Matches what the user actually needs, not just "correct."
- Consistent. Across examples and annotators.
- Correctly formatted. Clean. No stray HTML, trailing spaces, inconsistent casing.
- Sufficiently unique. Duplications cause biases and contamination.
- Compliant with internal policies, laws, and regulations (no PII if disallowed).
2.2 Data Coverage
Real users have diverse needs and writing styles. Coverage is diversity. If users mix detailed and short instructions, training data should include both. If queries have typos, training should too.
Different apps need different diversity axes. For French to English, language doesn't matter. Topic, length, and style do.
NVIDIA's Nemotron focused on task diversity, topic diversity, and instruction diversity (output formats, lengths, open-ended vs. yes/no).
Llama 3's gains over Llama 2 came from data quality and diversity plus more training scale, not from architecture changes.
Llama 3 Domain Mix Across Phases
| Domain | Pre-train | SFT | Preference |
|---|---|---|---|
| General knowledge (English) | 50% | 52.66% | 81.99% |
| Math and reasoning | 25% | 21.19% | 5.89% |
| Coding | 17% | 14.89% | 6.93% |
| Multilingual | 8% | 3.01% | 5.19% |
| Exam-like | — | 8.14% | — |
| Long context | — | 0.11% | — |
Math and code are around 50% during pre-training and SFT. Annealing on small high-quality math/code data near the end significantly boosts reasoning.
How to Pick a Mix?
- Match the real-world distribution.
- Scaling-law experiments. Train small models on candidate mixes. Predict large-model performance.
A 7B model finetuned on data that's both high-quality and diverse outperforms one finetuned on data that's only one or the other.
2.3 Data Quantity
How much data you need varies wildly:
- Howard & Whitaker showed LLMs can learn from a single example.
- Llama 2: 2T tokens. Llama 3: 16T tokens (roughly 1B and 15B examples at 2K tokens each).
If you have millions of examples, consider training from scratch. Finetuning can suffer from ossification where pre-training freezes weights against finetuning data. Smaller models are more susceptible.
Three other factors:
- Finetuning technique. Full needs orders more data than PEFT/LoRA.
- Task complexity. Sentiment classification < financial Q&A.
- Base model performance. Better base means fewer examples needed.
Strategy: start with a small dataset (50 examples). If finetuning shows clear improvement, more data will probably help. If not, more data rarely fixes it. Watch for hyperparameter and prompt issues.
Reducing High-Quality Data Demand
- Self-supervised to supervised. Finetune on raw legal docs first, then on (Q, A) pairs.
- Less-relevant to relevant. Pre-finetune on tweet sentiment, then product sentiment.
- Synthetic to real. Synthesize medical records, finetune, then finetune on real records.
Performance Gain Curve
Train on 25%, 50%, and 100% of your data, then plot. A steep slope means more data helps. A plateau means diminishing returns.
Number of tasks matters too. Going from 9 to 282 tasks improved Flan models substantially. Gains plateau past around 282.
2.4 Data Acquisition and Annotation
Best source is your own application data because it's perfectly relevant. That's why everyone obsesses over data flywheels.
A common dataset development workflow:
- Find an available dataset (10K examples, say).
- Filter low-quality instructions, leaving 9,000.
- Set aside 3,000 with low-quality responses, leaving 6,000 high quality.
- Manually write responses for those 3,000, totaling 9,000.
- Generate 2,000 synthetic instructions for an underrepresented topic.
- Manually annotate them, totaling 11,000.
Reality is messier: revising annotation guidelines mid-project, fact-checking annotations, etc.
Public Dataset Resources
- Hugging Face and Kaggle (hundreds of thousands of datasets).
- Google Dataset Search.
- Government open data: Data.gov, data.gov.in.
- University of Michigan ICPSR for social studies.
- UC Irvine ML Repository, OpenML.
- Open Data Network.
- Cloud providers (e.g., AWS Open Data).
- TensorFlow datasets, EleutherAI's
lm-evaluation-harness(~400 benchmarks, avg 2,000+ examples each). - Stanford Large Network Dataset Collection for graphs.
Always check licenses. Even commercial-use licenses can include sub-data with restrictions.
Annotation guidelines are often the hardest part. They double as evaluation guidelines (Chapter 4). Invest once, use twice.
Section 3: Data Augmentation and Synthesis
- Data augmentation creates new data from real data (flip a cat photo).
- Data synthesis mimics real data without using it (simulate bot mouse movements).
The book often uses data synthesis for both.
3.1 Why Data Synthesis
- Increase quantity when real-world data is scarce (rare weather, deep sea, self-driving accidents).
- Increase coverage by generating targeted edge cases, adversarial examples, and rare-class examples. TrueTeacher used LLM-generated factually inconsistent summaries to train a detector.
- Increase quality because humans miss patterns AI catches, and AI is more consistent for preference data.
- Mitigate privacy concerns in healthcare and insurance.
- Distill models by training a small student on a large teacher's outputs.
3.2 Traditional Synthesis Techniques
Rule-Based / Procedural Generation
Templates plus random generators (e.g., Faker). Example transaction template:
Transaction ID: [Unique Identifier]
Date: [MM/DD/YYYY]
Time: [HH:MM:SS]
Amount: [Transaction Amount]
Merchant Name: [Merchant/Store Name]
Merchant Category: [Category Code]
Location: [City, State, Country]
Payment Method: [Credit Card/Debit Card/Cash/Online Payment]
Transaction Status: [Completed/Pending/Failed]
Description: [Transaction Description]
DeepMind's AlphaGeometry trained on 100M synthetic Olympiad-level geometry examples.
Image transformations (rotate, crop, scale, erase). AlexNet famously used these for ImageNet.
Text transformations replace words with synonyms (via dictionary or embedding similarity). Useful for bias mitigation:
| Original | Augmented |
|---|---|
| She's a fantastic nurse. | He's a fantastic nurse. |
| The CEO of the firm, Mr. Alex Wang, … | The CEO of the firm, Ms. Alexa Wang, … |
| Today, my mom made a casserole. | Today, my dad made a casserole. |
Perturbation adds noise. The "One Pixel Attack" showed 67.97% of Kaggle CIFAR-10 test images can be misclassified by changing one pixel. Training on perturbed data improves robustness. BERT replaces 1.5% of tokens with random words.
Simulation
- Self-driving cars: CARLA, Waymo SimulationCity, Tesla's San Francisco simulation.
- Robotics: simulate joint movements, train only on successful trials.
- Sim2Real: adapt simulation-trained models to reality.
- Tool use: simulate action sequences, validate, use the most efficient.
Especially valuable for rare events like IPOs, bankruptcies, manufacturing defects, and climate scenarios.
3.3 AI-Powered Data Synthesis
API Simulation
StableToolBench uses AI to simulate API outcomes instead of making actual API calls.
Self-Play
- OpenAI's Dota 2 bot played around 180 years of games per day via self-play.
- AlphaGo used self-play for millions of Go games.
- Generalize to agents: AI-vs-AI customer-support negotiations.
Paraphrasing & Translation
For "How to reset my password?":
- "I forgot my password."
- "How can I change my password?"
- "Steps to reset passwords."
MetaMath rewrote 15K MATH/GSM-8K examples into 400K examples. Outperformed larger models.
Translation augments low-resource languages. Back-translation verifies quality: X to Y to X'. If X doesn't equal X', the translation is bad.
Llama 3's Synthesis Pipeline
- Generate problem descriptions.
- Generate solutions in different programming languages.
- Generate unit tests with AI.
- Run code through parsers and linters for syntax errors.
- Run unit tests for runtime errors.
- On failure, prompt the model to revise (with original problem, faulty code, and feedback).
- Translate code to other languages, filter failures.
- Generate code explanations and docs, filter via back-translation.
Result: 2.7M synthetic coding examples for Llama 3.1 SFT.
"About 20% of solutions were initially incorrect but self-corrected, indicating the model learned from execution feedback."
Instruction Data Synthesis
- Instruction generation. Start with a topic list or templates and generate.
- Response generation. One or many per instruction.
UltraChat (Ding et al., 2023): ChatGPT generated 30 topics, 30-50 subtopics each, then instructions and responses for each subtopic.
Alpaca (Stanford, 2023): 175 seed examples from Self-Instruct, then text-davinci-003 generated 52,000 (instruction, response) pairs.
Reverse Instruction
Take existing high-quality content (stories, books, Wikipedia) and ask AI to generate prompts that would elicit it. Avoids hallucination in responses.
Iterative: weak model, reverse instruction on high-quality content, finetune, repeat.
Long-Context Finetuning
To extend an 8K-token model to 128K:
- Split long docs into sub-8K chunks.
- Generate (Q, A) pairs per chunk.
- Use the full long doc as the context for each pair. Trains the model to use extended context.
Data Verification
For verifiable tasks like coding, use functional correctness (parsers, tests). Most Llama 3 synthetic data is verifiable.
For non-verifiable tasks, use AI judges, factual-consistency detectors, anomaly detection, and classifiers (predict if a generated paper looks NeurIPS-worthy, for example).
Heuristic filters from Self-Instruct:
- Repetitive examples
- Instructions too long or too short
- Same instruction with different responses
- Output is a repetition of input
3.4 Limitations to AI-Generated Data
- Quality control. Garbage in, garbage out.
- Superficial imitation. Students mimic teacher style without underlying capability. Forces hallucination on hard problems.
- Model collapse (Shumailov et al., 2023): recursively training on AI-generated data degrades models over iterations. Probable events get amplified, rare events forgotten. Mixing synthetic with real data avoids this.
- Obscure data lineage. Hard to know if your model regurgitates copyrighted or contaminated upstream content.
NVIDIA's Nemotron-4 340B-Instruct used 98% synthetic data for instruction and preference finetuning (one iteration).
3.5 Model Distillation
Knowledge distillation (Hinton et al., 2015): a small student is trained to mimic a large teacher.
- DistilBERT: 40% smaller than BERT, retains 97% of language understanding, 60% faster.
- Alpaca: Llama-7B finetuned on text-davinci-003 outputs (175B teacher).
- Nemotron-4-340B-Instruct: 340B student trained on data from Mixtral-8x7B (smaller teacher). Student outperforms teacher.
Model licenses often prohibit using outputs to train competing models.
Llama 3 paper: training on data from a more competent model improves performance. Indiscriminate self-generated data degrades it. Verification is the difference.
Section 4: Data Processing
Always do trial runs on a sample before applying scripts at scale. Never modify data in place. Keep originals.
4.1 Inspect Data
"Manual inspection of data has probably the highest value-to-prestige ratio of any activity in machine learning." — Greg Brockman
What to check:
- Where does the data come from? How was it processed?
- Distribution of tokens, input lengths, response lengths.
- Special tokens used.
- Topic and language distribution.
- Outliers and their causes.
- Per-annotator distributions (catches bias).
- Inter-annotator agreement.
GPT-4 has broader verb-noun pairings and longer responses. Manual inspection is irreplaceable. Staring at data for 15 minutes often saves hours.
4.2 Deduplicate Data
Duplications skew distributions, contaminate test sets, and waste resources. Anthropic found that repeating just 0.1% of data 100× dropped an 800M model's performance to 400M-level despite 90% of training tokens being unique.
Forms of duplication:
- Whole document duplicates.
- Intra-document duplicates (same paragraph twice).
- Cross-document duplicates (popular quotes everywhere).
Deduplication levels: document, paragraph, sentence, token. Define your similarity threshold.
Methods:
- Pairwise comparison (exact match, n-gram, fuzzy, semantic).
- Hashing: MinHash, Bloom filter.
- Dimensionality reduction plus pairwise.
Tools: dupeGuru, Dedupe, datasketch, TextDistance, TheFuzz, deduplicate-text-datasets, lazyNLP.
4.3 Clean and Filter Data
- Remove stray formatting like HTML/Markdown tags. Databricks: removing them improved accuracy 20% and cut input length 60%.
- Remove non-compliant content: PII, copyrighted, toxic.
- Remove low-quality data using verification techniques.
- Manual inspection uncovers patterns. Kern et al. (2024): annotations made in the second half of an annotation session are lower quality due to boredom and fatigue.
- Active learning and importance sampling to select the most valuable examples (Meta data pruning study).
4.4 Format Data
Get data into the model's expected chat template (Chapter 5). Wrong template means silent failure.
Going from prompt engineering to finetuning:
- Few-shot examples in the prompt become individual training examples.
- Instructions don't need lengthy task descriptions or examples once finetuned.
# Before finetuning (3-shot prompt with base model)
Label the following item as either edible or inedible.
Item: burger
Label: edible
Item: car
Label: inedible
Item: mushroom
Label: edible
Item: {INPUT}
Label:
# After finetuning, the prompt can be:
{INPUT} -->
When using the finetuned model, prompts have to match the format used during finetuning. Even small differences (missing arrow, extra space, unexpected prefix) can break it.
Summary
- Dataset engineering is the discipline of creating data so a model can learn the right behaviors. Data is the new differentiator as compute and architectures commoditize.
- Three criteria: quality, coverage (diversity), quantity. High-quality and diverse beats either alone.
- Acquisition combines public datasets, internal application data, manual annotation, and synthesis. Annotation guidelines are doubly valuable. They're also evaluation guidelines.
- Data augmentation and synthesis uses rule-based templates, simulations, and AI generation (paraphrasing, translation, self-play, reverse instruction, full pipelines like Llama 3's).
- AI-generated data has limits: quality control, superficial imitation, model collapse on recursive training, and obscured data lineage. Mix synthetic with real data and verify.
- Model distillation transfers a teacher's behavior to a smaller student via teacher-generated training data.
- Data processing is inspect (manually too), deduplicate, clean and filter, format. Always do trial runs and keep originals.
Previous chapter
FinetuningNext chapter
Inference Optimization