Why this matters
Rule-based features are fast, transparent signals you handcraft from text. In real NLP work, they let you boost classical models (logistic regression, SVM, CRFs) and even improve neural systems with high-precision cues. You will use them to:
- Detect entities with consistent patterns (emails, URLs, dates, IDs).
- Flag behaviors in support tickets (complaints, refunds, escalation risk).
- Improve sentiment/intent accuracy with negation and intensifier rules.
- Handle compliance filters (PII detection) and spam indicators.
Professional scenarios
- Customer support routing: keyword+regex features for product names, order numbers, and refund intents.
- Moderation: all-caps shouting, profanity lexicon hits, repeated punctuation.
- Search/query classification: slot-like features for locations, times, and price mentions.
Concept explained simply
Rule-based features are yes/no or numeric flags computed by deterministic patterns: “Does the text contain a URL?”, “How many uppercase tokens?”, “Is there a month name followed by a number?”. You add them as columns to your feature matrix alongside bag-of-words, n-grams, or embeddings.
Mental model
Imagine a dashboard of tiny sensors. Each sensor lights up if a rule is met. Your model learns how much to trust each sensor. You keep sensors that are precise, robust, and complementary.
Core building blocks
- Regex and text patterns: emails, URLs, dates, currency, IDs, repeated punctuation, word shapes (Aa, Aaaa, dddd).
- Lexicons/gazetteers: curated lists (months, countries, product names, sentiment words).
- Token properties: is_capitalized, is_upper, is_titlecase, has_digit, prefix/suffix, length, punctuation-only.
- Context windows: features within ±k tokens of a keyword (e.g., refund within 5 tokens of order).
- Counts and ratios: count_uppercase_tokens, ratio_digits, number_of_exclamation_runs.
- Simple syntax tags (optional if available): POS tags for patterns like ADJ before NOUN.
Worked examples
Example 1: Spam detection signals
Text: "WIN BIG!!! Visit http://promo.example NOW."
- has_url = 1 (regex match for URL)
- exclamation_runs_ge2 = 1 ("!!!")
- has_all_caps_token = 1 ("WIN", "NOW")
- num_calls_to_action >= 1 (lexicon: {"win", "visit", "now", "click"})
Why it works
These features are high-precision spam markers. Even a simple logistic regression can separate spam vs. ham when these fire together.
Example 2: Negation-aware sentiment
Text: "Not happy with the recent update."
- negation_present = 1 (lexicon: {not, never, no, n't})
- negated_positive = 1 (positive word "happy" within 3 tokens after negation)
- final_sentiment_hint = negative (rule-only hint feature; keep binary/numeric for the model)
Why it works
Pure bag-of-words might read "happy" as positive. The negation window corrects it.
Example 3: Date-like entity flag for NER
Text: "Schedule on 12 March 2025."
- month_gaz_hit = 1 ("March" in months list)
- day_number_before_month = 1 (\b\d{1,2}\b before month)
- year_four_digits_after = 1 (\b\d{4}\b after month)
- is_probable_date_span = 1 if all three above fire
Why it works
Combining simple cues yields a strong, human-readable candidate date feature for CRF or sequence labeling.
How to build good rule-based features (step-by-step)
- List signals: Brainstorm 5–10 patterns tied to your label. Prioritize ones that are precise and common enough.
- Define rules: Write regex, lexicon checks, and token rules. Keep names explicit (e.g., has_url, exclamation_runs_ge2).
- Unit test: Create tiny test strings per rule to confirm expected firing.
- Feature ablation: Train baseline, add features one group at a time, measure impact.
- Harden: Add Unicode, case, and spacing variants. Reduce false positives with boundaries and context windows.
- Maintain: Keep lexicons versioned. Review drift quarterly.
Implementation tips without external libraries
- Normalize text minimally: strip extra spaces, lowercase copies alongside original to preserve case-based features.
- Use conservative regex with word boundaries (\b) and anchors where helpful.
- For windows, index token positions and check distance constraints.
Evaluation and ablation
- Start with a simple baseline (e.g., unigrams).
- Add one feature group at a time (regex, lexicons, context). Track validation F1/accuracy.
- Remove groups to confirm they truly help.
- Inspect top learned weights to confirm features align with intuition.
Exercises
Do these hands-on tasks to solidify the concepts. Then take the Quick Test. Note: the test is available to everyone; only logged-in users get saved progress.
Exercise 1 — Multi-pattern spam cues
Given short messages, create binary features: has_url, has_phone, exclamation_runs_ge2, has_all_caps_token, call_to_action_hit. Use simple regex and small lexicons.
Sample inputs
- "Call NOW!!! 555-123-4567"
- "Update available at https://app.example"
- "Thanks for your help"
Exercise 2 — Negation window for sentiment
Build features: negation_present, pos_in_neg_scope, neg_in_neg_scope, window_size = 3 tokens after negator. Apply to 3 sentences.
Sample inputs
- "I am not satisfied with delivery"
- "I am happy, not upset"
- "Never truly amazing"
Checklist before you move on
- Rules are named clearly and tested on minimal strings.
- Regex use word boundaries to avoid partial matches.
- Negation scope uses a fixed, small window (e.g., 3).
- Each feature has at least one passing and one failing example.
Common mistakes and self-check
- Overfitting regex to your sample. Self-check: run on fresh data; estimate false positives.
- Data leakage. Self-check: ensure rules do not directly encode the label or target metadata.
- Ignoring Unicode/case. Self-check: test with accented text and mixed case.
- Too many overlapping features. Self-check: remove highly correlated ones; watch stability.
- Negation windows too large. Self-check: keep small (2–4) and validate.
- No ablation. Self-check: measure incremental gains per feature group.
Practical projects
- Support intent classifier: refund/complaint/praise with 10–20 rule-based features + unigrams.
- Lightweight PII detector: flags for emails, phone numbers, order IDs, and names using gazetteers.
- Review sentiment booster: add negation and intensifier features to a baseline classifier.
Who this is for
- Aspiring NLP Engineers building classic models.
- Data Scientists needing quick, interpretable wins.
- ML practitioners improving downstream accuracy with precise cues.
Prerequisites
- Comfort with tokenization, bag-of-words, and basic ML (logistic regression/SVM/CRF).
- Basic regex knowledge and text preprocessing practices.
Learning path
- Master token features and regex patterns.
- Add lexicon and windowed context features.
- Integrate with classical models; run ablation and iterate.
Next steps
- Extend rules to handle edge cases and multilingual text.
- Blend with statistical features and compare performance.
- Prepare tidy feature docs so teammates can maintain them.
Mini challenge
Design 8–12 rule-based features to detect refund intent in messages. Include at least: amount mention (currency), order/reference pattern, refund/return lexicon, negation handling near "satisfied"/"happy". Evaluate on 30 labeled messages and report which three features contributed most.
Before the Quick Test: remind yourself that everyone can take it for free; only logged-in users get saved progress.