luvv to helpDiscover the Best Free Online Tools
Topic 7 of 8

Rule Based Features

Learn Rule Based Features for free with explanations, exercises, and a quick test (for NLP Engineer).

Published: January 5, 2026 | Updated: January 5, 2026

Why this matters

Rule-based features are fast, transparent signals you handcraft from text. In real NLP work, they let you boost classical models (logistic regression, SVM, CRFs) and even improve neural systems with high-precision cues. You will use them to:

  • Detect entities with consistent patterns (emails, URLs, dates, IDs).
  • Flag behaviors in support tickets (complaints, refunds, escalation risk).
  • Improve sentiment/intent accuracy with negation and intensifier rules.
  • Handle compliance filters (PII detection) and spam indicators.
Professional scenarios
  • Customer support routing: keyword+regex features for product names, order numbers, and refund intents.
  • Moderation: all-caps shouting, profanity lexicon hits, repeated punctuation.
  • Search/query classification: slot-like features for locations, times, and price mentions.

Concept explained simply

Rule-based features are yes/no or numeric flags computed by deterministic patterns: “Does the text contain a URL?”, “How many uppercase tokens?”, “Is there a month name followed by a number?”. You add them as columns to your feature matrix alongside bag-of-words, n-grams, or embeddings.

Mental model

Imagine a dashboard of tiny sensors. Each sensor lights up if a rule is met. Your model learns how much to trust each sensor. You keep sensors that are precise, robust, and complementary.

Core building blocks

  • Regex and text patterns: emails, URLs, dates, currency, IDs, repeated punctuation, word shapes (Aa, Aaaa, dddd).
  • Lexicons/gazetteers: curated lists (months, countries, product names, sentiment words).
  • Token properties: is_capitalized, is_upper, is_titlecase, has_digit, prefix/suffix, length, punctuation-only.
  • Context windows: features within ±k tokens of a keyword (e.g., refund within 5 tokens of order).
  • Counts and ratios: count_uppercase_tokens, ratio_digits, number_of_exclamation_runs.
  • Simple syntax tags (optional if available): POS tags for patterns like ADJ before NOUN.

Worked examples

Example 1: Spam detection signals

Text: "WIN BIG!!! Visit http://promo.example NOW."

  • has_url = 1 (regex match for URL)
  • exclamation_runs_ge2 = 1 ("!!!")
  • has_all_caps_token = 1 ("WIN", "NOW")
  • num_calls_to_action >= 1 (lexicon: {"win", "visit", "now", "click"})
Why it works

These features are high-precision spam markers. Even a simple logistic regression can separate spam vs. ham when these fire together.

Example 2: Negation-aware sentiment

Text: "Not happy with the recent update."

  • negation_present = 1 (lexicon: {not, never, no, n't})
  • negated_positive = 1 (positive word "happy" within 3 tokens after negation)
  • final_sentiment_hint = negative (rule-only hint feature; keep binary/numeric for the model)
Why it works

Pure bag-of-words might read "happy" as positive. The negation window corrects it.

Example 3: Date-like entity flag for NER

Text: "Schedule on 12 March 2025."

  • month_gaz_hit = 1 ("March" in months list)
  • day_number_before_month = 1 (\b\d{1,2}\b before month)
  • year_four_digits_after = 1 (\b\d{4}\b after month)
  • is_probable_date_span = 1 if all three above fire
Why it works

Combining simple cues yields a strong, human-readable candidate date feature for CRF or sequence labeling.

How to build good rule-based features (step-by-step)

  1. List signals: Brainstorm 5–10 patterns tied to your label. Prioritize ones that are precise and common enough.
  2. Define rules: Write regex, lexicon checks, and token rules. Keep names explicit (e.g., has_url, exclamation_runs_ge2).
  3. Unit test: Create tiny test strings per rule to confirm expected firing.
  4. Feature ablation: Train baseline, add features one group at a time, measure impact.
  5. Harden: Add Unicode, case, and spacing variants. Reduce false positives with boundaries and context windows.
  6. Maintain: Keep lexicons versioned. Review drift quarterly.
Implementation tips without external libraries
  • Normalize text minimally: strip extra spaces, lowercase copies alongside original to preserve case-based features.
  • Use conservative regex with word boundaries (\b) and anchors where helpful.
  • For windows, index token positions and check distance constraints.

Evaluation and ablation

  • Start with a simple baseline (e.g., unigrams).
  • Add one feature group at a time (regex, lexicons, context). Track validation F1/accuracy.
  • Remove groups to confirm they truly help.
  • Inspect top learned weights to confirm features align with intuition.

Exercises

Do these hands-on tasks to solidify the concepts. Then take the Quick Test. Note: the test is available to everyone; only logged-in users get saved progress.

Exercise 1 — Multi-pattern spam cues

Given short messages, create binary features: has_url, has_phone, exclamation_runs_ge2, has_all_caps_token, call_to_action_hit. Use simple regex and small lexicons.

Sample inputs
  • "Call NOW!!! 555-123-4567"
  • "Update available at https://app.example"
  • "Thanks for your help"

Exercise 2 — Negation window for sentiment

Build features: negation_present, pos_in_neg_scope, neg_in_neg_scope, window_size = 3 tokens after negator. Apply to 3 sentences.

Sample inputs
  • "I am not satisfied with delivery"
  • "I am happy, not upset"
  • "Never truly amazing"

Checklist before you move on

  • Rules are named clearly and tested on minimal strings.
  • Regex use word boundaries to avoid partial matches.
  • Negation scope uses a fixed, small window (e.g., 3).
  • Each feature has at least one passing and one failing example.

Common mistakes and self-check

  • Overfitting regex to your sample. Self-check: run on fresh data; estimate false positives.
  • Data leakage. Self-check: ensure rules do not directly encode the label or target metadata.
  • Ignoring Unicode/case. Self-check: test with accented text and mixed case.
  • Too many overlapping features. Self-check: remove highly correlated ones; watch stability.
  • Negation windows too large. Self-check: keep small (2–4) and validate.
  • No ablation. Self-check: measure incremental gains per feature group.

Practical projects

  • Support intent classifier: refund/complaint/praise with 10–20 rule-based features + unigrams.
  • Lightweight PII detector: flags for emails, phone numbers, order IDs, and names using gazetteers.
  • Review sentiment booster: add negation and intensifier features to a baseline classifier.

Who this is for

  • Aspiring NLP Engineers building classic models.
  • Data Scientists needing quick, interpretable wins.
  • ML practitioners improving downstream accuracy with precise cues.

Prerequisites

  • Comfort with tokenization, bag-of-words, and basic ML (logistic regression/SVM/CRF).
  • Basic regex knowledge and text preprocessing practices.

Learning path

  • Master token features and regex patterns.
  • Add lexicon and windowed context features.
  • Integrate with classical models; run ablation and iterate.

Next steps

  • Extend rules to handle edge cases and multilingual text.
  • Blend with statistical features and compare performance.
  • Prepare tidy feature docs so teammates can maintain them.

Mini challenge

Design 8–12 rule-based features to detect refund intent in messages. Include at least: amount mention (currency), order/reference pattern, refund/return lexicon, negation handling near "satisfied"/"happy". Evaluate on 30 labeled messages and report which three features contributed most.

Before the Quick Test: remind yourself that everyone can take it for free; only logged-in users get saved progress.

Practice Exercises

2 exercises to complete

Instructions

Create the following binary features for each message:

  • has_url: matches http(s):// or www. with a domain
  • has_phone: matches patterns like 555-123-4567 or (555) 123 4567
  • exclamation_runs_ge2: two or more consecutive !
  • has_all_caps_token: any token of length >= 2 that is all uppercase
  • call_to_action_hit: lexicon hit in {buy, click, visit, win, now, free}

Apply to these texts:

  • "Call NOW!!! 555-123-4567"
  • "Update available at https://app.example"
  • "Thanks for your help"
Expected Output
A small table or list per message with 5 binary values indicating which rules fired.

Rule Based Features — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Rule Based Features?

AI Assistant

Ask questions about this tool