Why this matters
Seq2seq fine-tuning powers real products: summarization for customer support notes, translation for multilingual FAQs, headline generation for news, and task assistants that turn instructions into actions. As an NLP Engineer, you'll adapt pre-trained encoder-decoder models (like T5, BART, mT5) to your domain, reduce hallucinations, and ship reliable text generation systems.
Who this is for
- Beginners who know Transformers basics and want practical seq2seq skills.
- Practitioners switching from classification to generation tasks.
- Engineers aiming to ship summarization, translation, or data-to-text features.
Prerequisites
- Comfort with Python and training loops.
- Basics of Transformers (attention, encoder/decoder).
- Understanding of tokenization and batching.
Concept explained simply
Seq2seq models read an input sequence and generate an output sequence one token at a time. During training, the model sees the correct previous token (teacher forcing) and learns to predict the next token. During inference, it must use its own previous outputs and a decoding strategy (greedy, beam, or sampling) to generate the full sequence.
Mental model
Think of the encoder as a careful reader that compresses the source text into clues, and the decoder as a writer that uses those clues plus everything it has written so far. Fine-tuning teaches the writer your task’s style and constraints (e.g., short summaries, formal translations).
Key components of seq2seq fine-tuning
- Data pairs: (input_text, target_text). Clean alignment is critical.
- Special tokens: <pad>, <s>/<bos>, </s>/<eos> must be consistent.
- Label shifting: decoder_input_ids are the target shifted right; labels align to next-token predictions.
- Loss masking: ignore loss on padding (and often on special tokens) using ignore_index (commonly -100).
- Attention masks: separate encoder and decoder masks; ensure no attention to padding.
- Decoding: greedy (fast), beam search (quality), sampling (diversity via top-k/top-p/temperature).
- Metrics: ROUGE for summarization, BLEU/chrF for translation, Exact Match/F1 for tasks like QA; human review to catch hallucinations.
- Efficiency: LoRA/adapters/PEFT to train a small fraction of parameters, mixed precision, gradient accumulation.
- Regularization: label smoothing, dropout, early stopping on validation metric.
Tip: choosing a base model
- Summarization: BART or T5 family.
- Translation: mBART/mT5 for multilingual.
- Instruction following: T5/FLAN-T5, BART variants trained on instructions.
Worked examples
Example 1 — News summarization with T5
- Format: prepend a task tag to inputs, e.g., "summarize: [article]".
- Tokenize inputs and targets with max lengths (e.g., 512/128).
- Create decoder_input_ids by shifting target right; set labels to target ids with padding tokens replaced by -100.
- Train with AdamW, linear warmup, early stopping on ROUGE-L.
- Decode with beam search (beam_size 4), length_penalty 1.0, no_repeat_ngram_size 3.
Minimal training loop sketch
# Pseudocode
for batch in dataloader:
out = model(input_ids=batch.input_ids,
attention_mask=batch.attn_mask,
decoder_input_ids=batch.decoder_input_ids,
labels=batch.labels)
loss = out.loss
loss.backward()
optimizer.step(); scheduler.step(); optimizer.zero_grad()Example 2 — Paraphrase generation with BART
- Data: pairs of (original, paraphrase). Optionally add a control tag like "paraphrase: " for consistency.
- Regularize with label smoothing (e.g., 0.1) to reduce copying artifacts.
- Use top-k=50, top-p=0.9, temperature=0.9 for more diverse outputs in inference.
Quality guardrails
- Add no_repeat_ngram_size=3.
- Penalize length extremes with min_length/max_length.
- Filter training pairs with high overlap to avoid trivial copying.
Example 3 — Low-resource translation with mT5 + LoRA
- Normalize and tokenize bilingual pairs; optionally tag language: "translate en to xx: [text]".
- Apply LoRA to attention projections; freeze base weights to reduce VRAM.
- Use label smoothing 0.1 and smaller learning rate for stability.
- Evaluate with BLEU and chrF; sample human spot checks for adequacy and fluency.
LoRA intuition
LoRA learns small low-rank matrices that nudge attention layers toward your task. You train far fewer parameters with minimal quality loss.
Practical setup checklist
- Data cleanliness: targets truly match inputs; remove duplicates and leakage.
- Tokenization: consistent special tokens; truncate thoughtfully.
- Batching: pad on the right; build attention masks.
- Labels: shift, set padding to -100.
- Monitoring: track training loss and validation ROUGE/BLEU.
- Decoding defaults: start with beam_size=4, no_repeat_ngram_size=3; then tune.
- Efficiency: mixed precision, gradient accumulation, LoRA/adapters when needed.
Exercises
Complete these before the quick test. Note: the quick test is available to everyone; only logged-in users will have progress saved.
- Exercise 1 (Label shifting and masking)
Given target tokens (with ids): ["<s>"=1, "Sky"=101, "is"=102, "blue"=103, "."=104, "</s>"=2, "<pad>"=0, "<pad>"=0]. Build:
- decoder_input_ids (shifted right, starting with <s>)
- labels (next-token ids) with padding replaced by -100.
Expected: see the Exercises panel below for exact arrays. - Exercise 2 (Decoding configuration)
You need concise, factual summaries. Propose decoding settings to reduce repetition and hallucination while keeping readability. Include: strategy, beams or sampling params, length controls, repetition penalty or n-gram block.
Self-check
- Did you set -100 for all padding label positions?
- Are your min_length/max_length consistent with your dataset lengths?
- For summarization, did you use ROUGE on validation and not on the training set?
Common mistakes and how to self-check
- Forgetting label shift: Loss won’t drop as expected. Self-check: print first few decoder_input_ids and labels; they should be offset by one.
- Not masking pads in loss: Model learns to predict padding. Self-check: ensure -100 in labels for padded positions.
- Using greedy decoding for creative tasks: Outputs are dull or repetitive. Self-check: compare with beam or sampling on a small dev set.
- Data leakage: Validation ROUGE/BLEU suspiciously high. Self-check: hash pairs and confirm no overlap between train/val/test.
- Overlong outputs: Missing EOS or too long. Self-check: set max_length and enable early stopping with EOS token id.
- Hallucinations in summarization: Model invents facts. Self-check: increase n-gram blocking, use constrained length, consider higher beams and coverage penalties (if available), and add factuality checks in evaluation.
Practical projects
- Summarize customer tickets into 1–2 sentence action items; evaluate with ROUGE-L and human spot checks.
- Translate a small bilingual FAQ (2–5k pairs) with mT5 + LoRA; report BLEU and sample error analysis.
- Paraphrase product descriptions for A/B tests; compare engagement across variants.
Learning path
- Refine tokenization and batching for seq2seq.
- Train a baseline T5/BART on a tiny dataset; verify label shift and masks.
- Tune decoding (beam vs sampling) and length penalties.
- Add PEFT (LoRA/adapters) for efficiency.
- Evaluate with ROUGE/BLEU and write a 1-page error analysis.
Next steps
- Apply instruction-style prompts (e.g., task tags) to improve generalization.
- Add domain-specific terminology via small curated data or constrained decoding (if available).
- Automate evaluation + generation scripts for reproducible experiments.
Mini challenge
Take 100 articles from a public domain dataset. Fine-tune a small T5 to produce 1-sentence summaries. Compare greedy vs beam search (beam=4) with no_repeat_ngram_size=3. Report ROUGE-1/2/L and 5 qualitative examples with notes on faithfulness.