Why this matters
Nearly every NLP task depends on understanding order and context across tokens. As an NLP Engineer, you will:
- Predict the next word for autocomplete or text generation.
- Tag tokens with labels (e.g., POS, NER) where each decision depends on surrounding words.
- Map an input sequence to an output sequence (translation, summarization) where output choices depend on prior outputs and the entire input.
- Choose decoding methods (greedy vs beam search) that trade speed for quality.
Strong sequence intuition helps you design model inputs/outputs, pick the right architecture, and debug weird errors (like repetitive text or broken agreement).
Concept explained simply
Sequence modeling predicts the next step using what has already happened. For text, it means modeling the probability of each next token given previous tokens (and sometimes the input sequence).
- Language modeling: predict the next word given previous words.
- Sequence labeling: assign a label to each token using context.
- Sequence-to-sequence: generate an output sequence conditioned on an input sequence.
Mental model
Imagine reading a sentence from left to right with a highlighter. At each new word, you predict what comes next. Your confidence changes based on what you have already seen. Sequence models do this, step by step.
Core building blocks
- Order matters: "bank of the river" vs "bank transfer" are disambiguated by surrounding words.
- Context window: models have a limit on how much context they effectively use; attention helps focus on relevant tokens.
- Alignment types:
- Many-to-one: whole sequence to one label (e.g., sentiment).
- Many-to-many aligned: input and output share length (e.g., POS, NER).
- Many-to-many unaligned (seq2seq): input and output lengths differ (e.g., translation, summarization).
- Probabilistic view: A sequence’s score is the product of step-wise conditional probabilities. Lower perplexity means better predictions on average.
- Decoding: Greedy picks the best next token now; beam search keeps the top-k partial sequences to consider better futures.
Try it: Spot the dependency
Which word depends more on long-range context?
1) "The cat that the dogs chased was fast." (verb agreement) — long-range
2) "He put the book on the table." — local context
Worked examples
Example 1 — Next-word prediction
Input: "I need to book a"
Reasoning: Likely nouns follow: "flight", "hotel". The verb "book" cues a reservation sense.
Output: Next token candidates: flight > hotel > table.
Example 2 — Sequence labeling (NER)
Input: "We met Sarah Connor in Boston"
Task: Tag each token as PER/LOC/O.
Reasoning: Capitalization and context help. "Sarah Connor" spans two tokens; "Boston" is a location.
Output: [We:O, met:O, Sarah:PER, Connor:PER, in:O, Boston:LOC]
Example 3 — Sequence-to-sequence (translation)
Source: "Je suis très content" (French)
Goal: English sentence. Output choices depend on earlier outputs: once you pick "I am", the next tokens like "very" and "happy" follow naturally.
Output: "I am very happy"
Example 4 — Decoding intuition
Greedy: choose the single best next token each time. Risk: early mistakes snowball.
Beam (k=2): keep 2 candidates at each step, combine modest choices that lead to better sentences overall.
Practical heuristics
- Use many-to-one when you need one label for the whole text.
- Use aligned many-to-many for token-level tags.
- Use seq2seq when output length differs or you paraphrase.
- Short prompts: greedy often works; longer texts benefit from beam or nucleus sampling (for creative generation).
- When debugging, read outputs step-by-step to spot where the model drifts.
Practice: Exercises
Do these before the quick test. Open the solutions only after you commit to an answer.
Exercise 1 — Match tasks to sequence setups
Choose one per task: many-to-one, many-to-many aligned, many-to-many unaligned (seq2seq).
- Sentiment classification of a review
- POS tagging
- Named Entity Recognition
- Machine translation
- Next-word prediction (predict the very next token)
Show solution
Sentiment: many-to-one
POS: many-to-many aligned
NER: many-to-many aligned
Machine translation: many-to-many unaligned (seq2seq)
Next-word prediction: many-to-one (per step), auto-regressive overall
Exercise 2 — Compare sequence probabilities
Given step-wise probabilities, pick the better 3-token output for the same context.
- Y1: tokens [i, like, apples] with probs [0.50, 0.60, 0.30]
- Y2: tokens [i, love, apples] with probs [0.40, 0.70, 0.25]
Which is more probable overall?
Show solution
Multiply products:
Y1 = 0.50 Ă— 0.60 Ă— 0.30 = 0.090
Y2 = 0.40 Ă— 0.70 Ă— 0.25 = 0.070
Winner: Y1
Self-check checklist
- [ ] I can pick the right sequence setup for common NLP tasks.
- [ ] I can compare two candidate sequences using step-wise probabilities.
- [ ] I can explain greedy vs beam search in one sentence each.
Common mistakes and how to self-check
- Mistake: Treating word order as irrelevant. Self-check: Swap two words in a sentence; does the meaning change?
- Mistake: Using many-to-one for tagging tasks. Self-check: Do you need one label or one label per token?
- Mistake: Comparing sequences by averaging per-step probability. Self-check: Use products (or sum of log-probabilities); longer sequences need length-aware comparisons.
- Mistake: Assuming greedy decoding is always fine. Self-check: Try beam k=2 or 4; do outputs improve coherently?
Mini challenge
Design inputs/outputs for a voicemail-to-action assistant:
- Input: "Hey, move our meeting to Friday 2pm and email the agenda."
- Define: Is this many-to-one, aligned, or seq2seq? What are reasonable outputs? How would you decode?
Example approach
Use seq2seq to generate a structured action sequence (e.g., JSON-like text with two actions). Decode with beam=3 for reliability; constrain tokens to a small schema vocabulary if possible.
Who this is for
Early-stage NLP engineers, data scientists, and ML practitioners who want a strong, practical grasp of sequences before diving into architectures.
Prerequisites
- Basic Python literacy (for future practice).
- Comfort with probabilities (products, logs).
- Familiarity with tokens and tokenization.
Learning path
- Grasp sequence setups (this lesson).
- Study language modeling basics (n-grams to modern models).
- Learn decoding strategies and their trade-offs.
- Move to attention and Transformers.
- Practice with real tasks: tagging and generation.
Practical projects
- Build a tiny next-word predictor using a small dataset; compare greedy vs top-k sampling.
- Create a POS tagger baseline with a simple model; inspect errors around ambiguous words.
- Implement a toy seq2seq character-level translator for a synthetic mapping (e.g., reversing strings) to understand decoding.
Next steps
- Take the quick test to confirm intuition.
- Then move to attention mechanisms and positional encodings.
- Start a small tagging or generation project to solidify concepts.
Quick Test access
Everyone can take the test. Log in to save your progress and see it in your learning path.