Pagefy

Pagefy

Back to AI Engineering

Extending Model Capabilities

Generative AI Design Patterns by Valliappa Lakshmanan & Hannes HapkeBuy the book
Sign in to save bookmarks, reading progress, and highlights.

Chapter 5: Extending Model Capabilities

Introduction

LLMs are statistical next-token predictors trained on whatever appears in their corpus. Tasks the model was trained on (grade-school math, common physics, popular cultural knowledge) generally work. Obscure or industry-specific tasks (writing investment-committee memos, adjudicating internal investigations, oil-and-gas calculations, expert bridge play) generally don't. This chapter covers four patterns for teaching foundational models tasks they weren't trained for: Chain of Thought (decompose into linear steps), Tree of Thoughts (explore multiple reasoning paths), Adapter Tuning (efficient fine-tuning with hundreds of examples), and Evol-Instruct (creating thousand-scale instruction-tuning datasets by evolving prompts).

A pretrained model knows enough to list primes between 100 and 110, or to convert 84 m² to ~903 ft² (with hallucinated arithmetic, which needs a calculator tool). The same model can quote the bridge maxim "eight ever, nine never" and still fail to apply it. Asked "Holding AKJxx opposite four small, how should you play the suit for no losers?", it recommends a finesse instead of playing for the drop. Reproducing standard advice isn't the same as applying it.

The specific errors get patched fast, but the underlying phenomenon (any task not well covered in training data will fail) is permanent for any specialized industry.


Pattern 13 — Chain of Thought

Chain of Thought (CoT) prompts the model to break complex problems into intermediate reasoning steps before answering.

Problem

Three failure modes that single-shot prompting can't fix.

Training data coverage: Claude solves "a 2 kg object slides down a 30° frictionless plane" zero-shot but refuses an oil-and-gas flow-rate question, filibustering about missing viscosity and pipe-roughness data even though it knows the equations.

Multistep reasoning: "You are allowed 50 kg if your final destination is the United States… What is the carry-on for SIN-DFW-YYZ?" Gemini answered 50 kg, treating DFW as the "final destination". Plausible-sounding doesn't mean correct.

Black-box answers: "If I drive 300 km west of Hyderabad, where will I end up?" GPT-4o-mini answered "Ahmadnagar or Pune" (wrong, it's Solapur). Asking "why" afterwards usually returns hallucinated reasoning.

Solution

Three CoT variants.

Variant 1: Zero-shot CoT

Add "Think step-by-step" to the prompt.

For the oil-and-gas question, this triggers the model to look up Texas Sweet density and viscosity, plug into the Hagen-Poiseuille equation, and produce a meaningful flow-rate answer. The lazy filibuster goes away.

Variant 2: Few-shot CoT

Provide examples of step-by-step reasoning in the prompt.

Few-shot CoT and RAG both add to the context but they add different things. RAG adds knowledge (data) that the answer should be grounded in. CoT adds logic (a reasoning template) that the answer should generalize from. Or, to use the standard analogy: RAG gives a fish, Few-shot CoT teaches how to fish.

For the baggage problem, providing two worked examples ("CDG-ATL-SEA → 50 kg", "CDG-LHR-NBO → 40 kg") before "SIN-DFW-YYZ" yields the correct "YYZ is in Canada → 40 kg". Few-shot CoT is more effective than zero-shot on complex, unfamiliar problems because it doesn't depend on existing pretraining capability.

Variant 3: Auto-CoT

Maintain an example store indexed by question keywords or embeddings. At inference, retrieve the closest examples and inject them as Few-shot CoT.

The bootstrap loop is sample a diverse question bank with single correct answers, use Zero-shot CoT (multiple models or settings) to generate answers, apply consistency or correctness checks, and add passing pairs to the example store.

Continuing the analogy: Auto-CoT shows the model many ways to fish (spearfishing, trapping, angling) and picks the right one for the scenario.

Examples

For the bridge problem, Few-shot CoT with two demonstrations of standard splits ("8 cards = finesse the jack", "9 cards = play for the drop") successfully guides the model to apply "eight ever, nine never" on AKJxx opposite four small.

Considerations

CoT has two real limits. The first is data gaps. "Drive 300 km west of Hyderabad" fails even with Zero-shot CoT: the model walks through correct geographic logic but hallucinates Aurangabad / Nanded as the destination because it lacks geographic data. Adding a road map (multimodal) lets the same logic produce Solapur. Fix data gaps with knowledge (RAG or images), not more reasoning. The second is nonsequential logic. Expert bridge plays involve cyclical reevaluation (updating likelihoods based on opponents' early discards, optimizing across multiple scenarios). CoT can't mimic that.

The alternatives are worth knowing. Modern reasoning models (Gemini 2.5, Claude 3.7, o3, DeepSeek-R1) classify questions and apply CoT internally as test-time compute / thinking mode, so Zero-shot CoT helps small or local models more than frontier ones, and Few-shot CoT helps for domains outside the model's pretraining. Set a calendar reminder to re-check every six months whether your hand-written CoT examples are still required: if a model upgrade obviates them, your prompts shrink and tokens drop. Agentic approaches let the model plan and act via Tool Calling (Pattern 21) when the plan must adapt to external systems, and Tool Calling combined with interleaved reasoning is ReAct. Tree of Thoughts is the answer for nonsequential or multi-path problems, and Multiagent Collaboration (Pattern 23) combines all of the above.

References: CoT (Wei et al. 2022); Zero-shot CoT (Kojima et al. 2022); Auto-CoT (Zhang et al. 2022).


Pattern 14 — Tree of Thoughts (ToT)

Tree of Thoughts explores multiple reasoning paths, evaluates them, prunes weaker branches, and backtracks if a path fails. It can solve nonlinear strategic problems CoT can't.

Problem

The four-quote essay problem: write four paragraphs, each ending with one of:

  1. "To be or not to be, that is the question."
  2. "Take me to your leader."
  3. "It is a truth universally acknowledged, that a single man in possession of a good fortune…"
  4. "The only thing we have to fear is fear itself."

Even with Zero-shot CoT, Claude produced paragraphs that ended with the right sentences but didn't gel. The philosophical first quote sent the whole essay in a direction that never accommodated the alien-encounter quote. CoT failed for three reasons: it got stuck on the initial path because the first decision constrained everything after, it followed a single linear path with no backtracking, and it had no intermediate evaluation so it couldn't course-correct based on partial progress.

Solution

ToT models problem-solving as tree search with explicit evaluation.

Four components: thought generation (at each step, generate multiple diverse next steps), path evaluation (score each path's promise), beam search (keep top-K paths, discard the rest), and summary generation (final synthesis from the best path).

The beam search here operates over reasoning steps, not over token sequences as in Logits Masking (Pattern 1).

Thought generation

def generate_thoughts(self, state, step):
    prompt = f"""{state}
You are solving a problem step-by-step using the Tree of Thoughts method.
Think about the problem state above and generate {self.num_thoughts_per_step}
distinct and diverse next steps. This is step {step} of up to {self.max_steps}.
Make each thought meaningfully different to explore diverse approaches."""
    ...
    return json.loads(content)

For the essay problem, step 1 generated three thoughts: a "decisions theme", an "alien narrative", and a "themes-first" approach.

Path evaluation

def evaluate_state(self, state, problem):
    prompt = f"""
        Problem: {problem}
        Reasoning path: {state}

        On a scale from 0 to 100, evaluate how promising this reasoning path is...
        Consider:
        1. Correctness  2. Progress  3. Insight  4. Potential

        Respond with a single integer score..."""
    ...
    return int(content) / 100.0

The "themes-first" thought scored 0.75, the others 0.60.

Tree search (beam)

candidates = []
thoughts = self.generate_thoughts(current_state, step)
for thought in thoughts:
    new_state = f"{current_state}\nStep {step}: {thought}"
    new_path = reasoning_path + [f"Step {step}: {thought}"]
    new_score = self.evaluate_state(new_state, problem)
    candidates.append((-new_score, new_state, new_path, step))

beam = []
for candidate in heapq.nsmallest(self.beam_width, candidates):
    score, state, path, s = candidate
    beam.append((-score, state, path, s))

if new_score > 0.9:
    best_final_states.append((new_score, new_state, new_path))

Summary generation

The best reasoning path is fed back to the LLM to produce the final answer. For the essay, the chosen path framed it as "human nature through literature": Shakespeare, sci-fi, Austen, Roosevelt. The result was an essay that actually gels.

Example: Supply Chain Optimization

The setup is 3 manufacturing locations, 4 distribution centers, 2 shipping methods, demand swings of plus or minus 20%, and recent Asian shipping disruptions.

After step 1, three thoughts (mapping transportation networks, defining location attributes, finding optimal pairs) score 0.65, 0.35, and 0.35.

The final reasoning path (map networks, then three configurations: cost-focused, speed-focused, resilience-focused, then scenario analysis, then sensitivity analysis) produces a thorough recommendation for Configuration C (resilience-focused), citing the Asian-shipping disruption.

The full solve required 41 LLM API calls and 93 seconds. Parallelism helps but is bounded by the level-by-level dependency.

Considerations

ToT is overkill for straightforward tasks. Start with CoT and escalate.

The overhead is real on three axes. Combinatorial explosion grows fast (beam width × depth × thoughts per step). Latency and cost run into minutes of wall clock and many API calls. And implementation complexity is meaningful: thought generation plus scoring plus state tracking plus beam, BFS, or DFS.

The alternatives. Reasoning models (o3, Opus, Gemini 2.5 Pro, DeepSeek-R1) have built-in thinking that delivers ToT-like behavior with one API call. Least-to-most (LtM) prompting decomposes into sequential subproblems where each step's answer informs the next, and it's combinable with ToT for complex subproblems. Reflection (Pattern 18) gives agentic self-critique that's linear but iterative. And wait-injection or budget forcing replaces the model's end-of-sequence token with Wait, forcing the model to continue and often producing a more reflective response (Muennighoff et al. 2025).

Andrew Ng's four agentic design patterns are Reflection, Tool Use, Planning, and Multiagent Collaboration. The book treats Reflection (18), Tool Calling (21), and Multiagent Collaboration (23) as standalone patterns. Deep Search (12), CoT (13), and ToT are examples of Planning. ToT touches all four: planning, reflection in evaluation, multiagent orchestration, and possible tool calls.

References: ToT (Yao et al. 2023); wait-injection (Muennighoff et al. 2025).


Pattern 15 — Adapter Tuning

Adapter Tuning efficiently fine-tunes a foundational model for a specialized task by training small add-on neural network layers on a small dataset (100 to 10,000 examples) while keeping the original weights frozen.

Problem

Asked "Suggest 3 ways to improve the flavor of ice cream", Gemini gives generic principles ("use high-quality ingredients", "enhance flavor depth"). Suppose your brand voice prefers concrete, actionable suggestions ("infuse with mint", "add roasted nuts", "flaky salt on top") and you have a few hundred demonstration pairs. Prompt engineering doesn't scale: prompts grow huge and small wording changes cause large performance differences. Few-shot learning means the examples have to be sent on every request, eating context window, latency, and cost, and they can't capture the full distribution.

Solution

Insert small adapter layers into the transformer block. Each one is a dense down-projection (e.g., 768 to 64 dimensions), a nonlinearity (ReLU), and a dense up-projection back to the original dimension (64 to 768).

Total adapter parameters: 768 × 64 × 2. Tiny compared to the foundational model's billions. The small inner dimension is why this is colloquially called LoRA (low-rank adaptation), even though strictly LoRA differs.

Three things to keep in mind. Adapter Tuning teaches specialized tasks (classification, summarization, extractive QA, brand-aligned chatbots) close to what the model already does. The foundational model weights are frozen, and training often runs on a single GPU in under an hour. Small datasets work because only the adapter weights train.

Adapter Tuning is not for industry jargon, new languages, or new facts. New jargon or language calls for continued pretraining (CPT, full fine-tuning). New knowledge calls for RAG (Chapter 3). New tasks that need new knowledge call for Evol-Instruct (next pattern), with the caveat of catastrophic forgetting.

Training pipeline

Quantize the base model (saves memory at training time), add the adapter config, run the SFT trainer, and save adapter weights only:

model_kwargs["quantization_config"] = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    ...
)
model = AutoModelForImageTextToText.from_pretrained(model_id, **model_kwargs)
processor = AutoProcessor.from_pretrained("google/gemma-3-4b-it")

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.05,
    r=16,
    bias="none",
    target_modules="all-linear",
    task_type="CAUSAL_LM",
    modules_to_save=["lm_head", "embed_tokens"],
)

sft_config = SFTConfig(
    output_dir="gemma-radiology",
    num_train_epochs=1,
    learning_rate=2e-4,
    ...
)

trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=messages,
    peft_config=peft_config,
    processing_class=processor,
    data_collator=collate_fn,
)
trainer.train()
trainer.save_model()

r=16 sets the LoRA rank. lora_alpha=r keeps the adapter weights as-is via alpha/r scaling. Smaller learning rate means less drift from base, more epochs means more drift toward your data.

When using QLoRA (quantized LoRA), the model is quantized to 4-bit and training is more memory-efficient (and slower). The trained model takes less space and runs faster than full LoRA.

Inference

SFT_OUTDIR = "gemma-radiology"
model = AutoModelForImageTextToText.from_pretrained(
    SFT_OUTDIR,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    attn_implementation="eager",
)
processor = AutoProcessor.from_pretrained(SFT_OUTDIR)

text = processor.apply_chat_template(messages)
inputs = processor(text=[text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**inputs, temperature=0.8)
output_text = processor.batch_decode(generated_ids)

Example: Radiology Image Captions

A multimodal Adapter-Tuning task: caption radiology images with concise anatomical descriptions, not full diagnoses.

Untuned Gemini caption: "Axial computed tomography (CT) image of the paranasal sinuses demonstrating a large, expansile, homogenous, low-attenuation mass occupying the entire left maxillary sinus, consistent with a mucocele. Note the thinning and bowing…". Too long, includes full diagnosis.

Desired caption: "Computed tomography scan in axial view showing obliteration of the left maxillary sinus."

Three reasons not to lean on prompt engineering here. Cost: long, detailed system instructions inflate token count to 2 to 3 pages. Location: the model has to run on premises or on the edge, which means a smaller model with weaker instruction following. Maintainability: every model upgrade requires re-validating the long prompt.

Dataset preparation

Each training example is a 3-message conversation: system role plus user (instruction + image) plus assistant (caption):

{'messages': [
  {'role': 'system',
   'content': [{'type': 'text', 'text': 'You are an expert researcher in radiology.'}]},
  {'role': 'user',
   'content': [
     {'type': 'text', 'text': 'Write a caption for this image explaining what it depicts...'},
     {'type': 'image', 'image': 'images/PMC2837471_IJD2009-150251.001.jpg'}]},
  {'role': 'assistant',
   'content': [{'type': 'text', 'text': 'Bacterial contamination occurred after completion of root canal treatment...'}]}
]}

Training

Tune Gemma-3-4B on 500 image-caption pairs, batch size 4, single epoch. Training loss drops from 14.8 to ~4.0 by batch 95 and stabilizes. 500 was sufficient.

For multimodal data, the collator has to load images at batch time:

for element in content:
    if isinstance(element, dict) and "image" in element:
        image = element["image"]
        image_inputs.append(Image.open(image).convert("RGB"))

Result

The tuned model captions a held-out test image as: "CT scan of the abdomen showing the size and density of the intra-abdominal mass." Concise, anatomy-focused, no diagnosis. Exactly matching the training-set style.

Considerations

To avoid loading two separate models at inference, merge adapter weights into the base model:

peft_model = PeftModel.from_pretrained(model, args.output_dir)
merged_model = peft_model.merge_and_unload()
merged_model.save_pretrained("merged_model", safe_serialization=True, max_shard_size="2GB")

For closed-weights models, Vertex AI, Azure OpenAI, and Bedrock all offer fully managed fine-tuning with adapter-style services. You supply the dataset and an endpoint comes out.

The alternatives to Adapter Tuning are few-shot learning (low data needs, no model management), CoT (Pattern 13), and Content Optimization (Pattern 5) when you don't have ready-made (input, output) pairs.

References: Wei et al. 2021; Li and Liang 2021; Lester et al. 2021; PeFT review (Xu et al. 2023); QLoRA (Dettmers et al. 2023).


Pattern 16 — Evol-Instruct

Evol-Instruct efficiently creates large instruction-tuning datasets by evolving an initial set of instructions to make them more complex, then continues SFT on the model.

Problem

Foundational models are trained on public consumer tasks. Enterprise tasks (writing investment memos, evaluating commercial real estate for warehouse use) are often confidential and not in training data, anchored to internal data the model provider doesn't have access to, and excluded from training under enterprise data-privacy contracts (Azure OpenAI, Anthropic, and Gemini all promise NOT to use prompts/completions for training). So consumer-style improvements over time don't help enterprise tasks.

Solution

Four steps: evolve instructions from a small initial dataset, generate answers, evaluate and filter, and instruction tune the target model. The first three build the dataset, and step 4 is just SFT.

Step 4: Instruction tuning (SFT)

Open-weights via Hugging Face Transformers. Load model, tokenize examples in ### Instruction: / ### Response: format, run trainer:

model_name = "meta-llama/Llama-3-8b-hf"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

def format_and_tokenize(example):
    return tokenizer(f"""### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}""" + tokenizer.eos_token)

tokenized_dataset = dataset.map(format_and_tokenize)

training_args = TrainingArguments(
    output_dir="./trained",
    learning_rate=2e-5,
    num_train_epochs=3,
    ...
)
trainer = Trainer(model=model, args=training_args,
                  train_dataset=tokenized_dataset['train'],
                  eval_dataset=tokenized_dataset['valid'])
trainer.train()
trainer.save_model()

Catastrophic forgetting risk pushes you toward a low learning rate (around 1e-5) but enough epochs to teach the new task.

PeFT-style instruction tuning with Unsloth needs more than standard LoRA. You also have to tune gate projection layers, embedding tokens, and attention head, with separate learning rates for embeddings:

from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
    full_finetuning=False,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj",
                    "embed_tokens", "lm_head"],
    lora_alpha=32,
    use_rslora=True,
)

training_args = UnslothTrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=64,
    num_train_epochs=10,
    learning_rate=5e-5*2,
    embedding_learning_rate=5e-5/2,
)

Steps 1–3: Building the dataset

Step 1: Evolve instructions. Use a foundational model to deepen, concretize, or complicate an initial instruction. The original WizardLM paper's generic prompt:

I want you to act as a Prompt Rewriter. Your objective is to rewrite a given
prompt into a more complex version to make those famous AI systems (e.g.,
ChatGPT and GPT4) a bit harder to handle.
...
Please replace general concepts with more specific concepts.
#The Given Prompt#: {instruction}
#Rewritten Prompt#:

Customize for your domain. WizardCoder used "Replace a commonly used requirement…with a less common and more specific one" and "Provide a piece of erroneous code as a reference to increase misdirection".

Step 2: Generate answers. Five options here. Human experts are the gold standard if you can afford it. Industry tools (simulators, mapping tools, calculators) used by experts. Reflection (Pattern 18), where the model generates, an evaluator validates, feedback drives a retry up to N times, which works when an automated evaluator exists (compilers, sandbox runners, math validators). RAG, where you generate from enterprise data via RAG and then bake into the model for inference contexts where RAG isn't available (edge or air-gapped). And teacher-student training, where a strong reasoning model generates and a smaller model distills.

Step 3: Evaluate and filter. Quality beats quantity ("Textbooks Are All You Need", Gunasekar et al. 2023). Use LLM-as-Judge (Pattern 17) to score and keep only high-quality pairs.

Example: Business Strategy Consultant

The goal is to train Gemma 3 1B to answer S&P 500 business strategy questions like Claude Sonnet 3.7 does.

The untuned tiny model on "How should Morgan Stanley adapt to a competitor expanding ultra-HNW wealth management?" produces generic word salad. Replace "Morgan Stanley / wealth management" with "McDonald's / hamburgers" in the response and it still mostly applies. That's the mark of a non-insightful answer.

Bootstrap from SEC filings

EDGAR Item 7 (management discussion) is a goldmine for strategy: 500 companies × 4 years = 2000 filings.

You are a professor in an MBA program.
You will be given a passage from an SEC filing from {company}...

Create {num_questions} analytical questions suitable for students of a class on
company strategy based on this filing.

Good questions should:
* Be standalone (include company, product, year)
* Avoid factual numerical info (revenue, capex)
* Ask "how", "why", "compare"

Example question: How might Google's (GOOG) reorganization of its hardware
divisions affect its ability to grow Pixel phones' market share in 2023?

This produces seed questions like "Air Products (APD) is investing heavily in gasification, carbon capture, and hydrogen projects. How might the cyclical nature of the energy market impact the long-term profitability and strategic viability of these capital-intensive projects, particularly given the company's reliance on long-term contracts and customer relationships as of 2021?"

Evolve

Three transformations applied to each seed. Deeper: "Add constraints based on current market conditions and competitor actions; add hypotheticals such as cost overruns or failed acquisitions." More concrete: "Instead of asking 'why,' ask for 3 reasons why; instead of 'how,' ask for the steps; ask why a specific outcome is not larger or smaller." More reasoning: "Combine two of the questions so both have to be answered implicitly."

Per filing: 3 seed × 10 evolved = 13 questions. With 2000 filings, that's roughly 26,000 raw questions.

Generate answers (teacher = Gemini)

With the filing in context (grounding):

You are a top student in a highly ranked MBA program.
You are given an SEC filing from {company}...
Use that filing to answer the following questions, but if some information is not
in the filing, answer based on your general market insights and knowledge of
business strategy.
Do not refuse to answer as that will give you zero points on the exam.
Each answer should be 2-3 sentences.

Filter via LLM-as-Judge

You are a journalist who interviewed a number of Wall Street analysts...
Reply with a score of 1-5...
1 = obvious or wrong
5 = genuinely insightful
Explain your reasoning.

Keep only scores 4 and 5, and split 90/10 train/eval. Result: roughly 11,000 high-quality training examples.

Train Gemma 3 1B

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/gemma-3-1b-it-bnb-4bit",
    max_seq_length=4096,
    load_in_4bit=True)
model = FastLanguageModel.get_peft_model(model, ...)

Format: Q: {question}\n\nA: {answer}<eos>. Three epochs on an L4 GPU, around 3 hours.

Result

The tuned 1B Gemma now produces a grounded, insightful answer to the Morgan Stanley question, comparable to (and arguably better than) Claude Sonnet 3.7 because it directly addresses client acquisition and retention strategies that Claude glossed over.

Considerations

A few rules of thumb on dataset size: roughly 10,000 examples for moderately complex tasks at 1B params; for an x-billion-parameter model you can scale down to roughly 1/x examples (so a 10B model needs ~1,000); scale up for highly diverse or complex tasks.

LoRA-based instruction tuning is more demanding than Adapter Tuning. You have to tune gate projections, embeddings, and attention head. You need a larger dataset and multiple epochs. And catastrophic forgetting is real: the tuned model is good only for the narrow task you trained it for. Don't reuse outside that scope.

The cost adds up quickly. Each evolved example involves three or more LLM calls (evolve, answer, evaluate). 10,000 examples is 30,000+ calls. Training takes hours on a GPU. Production with many tuned models is expensive.

Use Evol-Instruct only if the task is complex and a frontier model can't do it, or if you can't use a frontier model and smaller models alone don't work. Otherwise use few-shot CoT or Adapter Tuning first.

References: WizardLM (Xu et al. 2023); WizardCoder (Luo et al. 2025); Textbooks Are All You Need (Gunasekar et al. 2023); instruction-tuning survey (Zhang et al. 2023); Biderman et al. 2024 on LoRA; Unsloth.


Summary

PatternProblemSolutionWhen to use
Chain of Thought (13)Multistep reasoning fails; black-box answers; lazy refusalsPrompt for step-by-step reasoning (zero-shot, few-shot, or auto)Complex math, logical deductions, sequential reasoning
Tree of Thoughts (14)Strategic / nonlinear problems with multiple solution pathsTree search: generate thoughts, evaluate, beam-prune, backtrackStrategic planning, creative writing, optimization with constraints
Adapter Tuning (15)Fine-tune for a specialized task without full retrainingTrain small adapter layers; freeze base weightsClassification, summarization, brand-aligned generation; 100 to 10K examples
Evol-Instruct (16)Create instruction-tuning datasets for novel enterprise tasksEvolve initial instructions, generate answers, LLM-as-judge filter, instruction-tuneDomain-specific tasks not in pretraining; especially with confidential data

CoT works best for tasks where logic is the bottleneck. Data gaps need RAG or multimodal grounding, not more reasoning. ToT is overkill for straightforward problems, so try CoT first and consider modern reasoning models (o3, Opus, Gemini 2.5 Pro, DeepSeek-R1) before implementing ToT yourself. Adapter Tuning is for specialization (style, output format, narrow tasks), not for new vocabulary, new languages, or new facts. Evol-Instruct is the heaviest pattern: it teaches new tasks at scale but expects catastrophic forgetting, so the tuned model becomes single-purpose. Always start cheaper.