luvv to helpDiscover the Best Free Online Tools
Topic 6 of 9

Naive Bayes Basics

Learn Naive Bayes Basics for free with explanations, exercises, and a quick test (for Data Scientist).

Published: January 1, 2026 | Updated: January 1, 2026

Why this matters

Naive Bayes gives you a fast, reliable baseline for classification. It shines when data is high-dimensional and you need something explainable and quick to deploy.

  • Spam filtering: classify emails by word presence/counts.
  • Sentiment tagging: positive vs. negative reviews.
  • Medical triage: risk flags from symptom checklists.
  • Support automation: route tickets by topic using text.
Real task snapshot

You receive thousands of customer messages daily. A simple Multinomial Naive Bayes can categorize messages into topics (billing, tech support, sales) with surprisingly strong accuracy and minimal compute.

Who this is for

  • Data Scientist learners who want a quick, explainable classifier.
  • Engineers needing a strong text baseline.
  • Anyone preparing for ML interviews and practical projects.

Prerequisites

  • Basic probability: conditional probability, Bayes' rule.
  • Understanding of features and classes.
  • Comfort with multiplication, logs, and simple ratios.
Nice to have
  • Text preprocessing basics (tokenization, stopwords).
  • Train/test split and cross-validation understanding.

Concept explained simply

Naive Bayes predicts a class by combining how likely each feature is under that class and multiplying by the class prior. The "+naive+" assumption is that features are conditionally independent given the class.

Decision rule (proportional form): P(C|x) ∝ P(C) × Π P(x_i | C)

Mental model

Imagine each feature votes for a class with a strength based on how typical it is for that class. Multiply all votes with the class prior; the strongest total wins. In practice we add logs of votes to avoid underflow: log P(C|x) = log P(C) + Σ log P(x_i | C) + constant.

Which variant to use?
  • Multinomial NB: word counts in text (bag-of-words/TF).
  • Bernoulli NB: binary presence/absence features.
  • Gaussian NB: continuous features assumed normal.
About smoothing

Laplace/Lidstone smoothing adds a small pseudo-count to avoid zero probabilities for unseen features. This prevents an unseen word from zeroing the whole product.

Worked examples

Example 1: Spam vs. Ham (Multinomial NB)

P(Spam)=0.4, P(Ham)=0.6. Likelihoods for words: P(free|Spam)=0.7, P(meeting|Spam)=0.1; P(free|Ham)=0.05, P(meeting|Ham)=0.3. Email words: [free, free, meeting].

  • Score(Spam) ∝ 0.4 × 0.7 × 0.7 × 0.1 = 0.0196
  • Score(Ham) ∝ 0.6 × 0.05 × 0.05 × 0.3 = 0.00045

Normalize: total=0.02005 ⇒ P(Spam|x)≈0.978, P(Ham|x)≈0.022. Predict Spam.

Why counts repeat?

Multinomial NB multiplies P(word|class) once per occurrence; repeated words increase influence proportionally.

Example 2: Sentiment (Bernoulli NB)

P(Pos)=0.5, P(Neg)=0.5. For presence features {great, boring}: P(great|Pos)=0.6, P(boring|Pos)=0.1; P(great|Neg)=0.2, P(boring|Neg)=0.7. Review has both words present.

  • Score(Pos) ∝ 0.5 × 0.6 × 0.1 = 0.03
  • Score(Neg) ∝ 0.5 × 0.2 × 0.7 = 0.07

Predict Negative.

Absent features

Bernoulli NB can also include absent terms via (1 - P(word|class)). Here we considered presence-only for simplicity, which is common in practice.

Example 3: Medical triage (Bernoulli NB)

Classes: Flu vs Cold. Priors: P(Flu)=0.2, P(Cold)=0.8. Likelihoods (presence): P(fever|Flu)=0.9, P(cough|Flu)=0.7; P(fever|Cold)=0.3, P(cough|Cold)=0.8. Patient: fever=1, cough=1.

  • Score(Flu) ∝ 0.2 × 0.9 × 0.7 = 0.126
  • Score(Cold) ∝ 0.8 × 0.3 × 0.8 = 0.192

Predict Cold. Interpretation: despite fever strongly indicating Flu, the higher prior for Cold and strong cough likelihood tip the decision.

How to build a Naive Bayes classifier (step-by-step)

  1. Define the problem. Choose target classes and feature type (counts, binary, or continuous).
  2. Prepare data. Split into train/test. For text: tokenize, normalize, optional stopword removal.
  3. Estimate priors. P(C)=count(C)/N.
  4. Estimate likelihoods. Multinomial: P(word|C)=(count(word in C)+α)/(total words in C + α·V). Bernoulli: presence rates. Gaussian: mean/variance per feature per class.
  5. Score. Use log-sum: log P(C) + Σ log P(x_i|C).
  6. Predict. Argmax over classes.
  7. Evaluate. Accuracy, precision/recall, F1; use cross-validation.
  8. Iterate. Tune α (smoothing), vocabulary, n-grams, or feature selection.
Tip: use log space

Underflow is common when multiplying many small probabilities. Always compute in log space to keep numbers stable.

Exercises

These mirror the exercises below. Do them here, then compare with the solutions.

Exercise 1: Spam score comparison

Given P(Spam)=0.4, P(Ham)=0.6. P(free|Spam)=0.7, P(meeting|Spam)=0.1; P(free|Ham)=0.05, P(meeting|Ham)=0.3. Email words: [free, free, meeting]. Which class wins? Also estimate the normalized posterior for the winning class.

  • Expected: class label and approximate posterior.
Show solution

Score(Spam)=0.4×0.7×0.7×0.1=0.0196; Score(Ham)=0.6×0.05×0.05×0.3=0.00045. Normalize: total=0.02005 ⇒ P(Spam|x)≈0.978, P(Ham|x)≈0.022. Predict Spam.

Exercise 2: Symptom-based classification

Priors: P(Flu)=0.2, P(Cold)=0.8. Likelihoods (presence): P(fever|Flu)=0.9, P(cough|Flu)=0.7; P(fever|Cold)=0.3, P(cough|Cold)=0.8. Patient: fever=1, cough=1. Which class is predicted?

  • Expected: Flu or Cold.
Show solution

Score(Flu)=0.2×0.9×0.7=0.126; Score(Cold)=0.8×0.3×0.8=0.192. Predict Cold.

Hints
  • Multiply priors by each present-feature likelihood.
  • Normalize only if you need posteriors; argmax works with unnormalized scores.

Practice checklist

  • I can compute Naive Bayes scores and pick the argmax.
  • I understand when to use Multinomial vs Bernoulli vs Gaussian.
  • I know why smoothing is needed and how to apply it.
  • I can work in log space to avoid underflow.

Common mistakes and self-check

  • Zero probabilities. Forgetting smoothing makes any unseen feature zero the score. Self-check: does any test item with unseen words get probability zero? Add α (e.g., 1.0 or 0.1).
  • Wrong variant. Using Multinomial on binary presence or Gaussian on skewed counts hurts accuracy. Self-check: match variant to feature type.
  • Ignoring class imbalance. Priors matter. Self-check: compute P(C)=count(C)/N; verify impact on decisions.
  • Not using logs. Underflow leads to all zeros. Self-check: monitor min probability; switch to log-sum if very small.
  • Data leakage. Building vocabulary on full dataset. Self-check: fit vocab only on training data.
Quick audit
  • Did I compute priors from train only?
  • Is smoothing applied consistently across classes?
  • Are evaluation metrics stratified by class (precision/recall)?

Practical projects

  1. Toy spam filter. Build a Multinomial NB on a small email-like dataset. Acceptance: >85% accuracy on a held-out set; inspect top indicative words for Spam vs Ham.
  2. News topic tagger. Classify short articles into 3–5 topics using unigram+bigrams. Acceptance: F1≥0.75; show top 10 words per topic.
  3. Medical symptom triage. Bernoulli NB over binary symptoms. Acceptance: Confusion matrix with per-class recall≥0.7; document effect of changing priors.
Stretch goals
  • Try different smoothing α and compare performance.
  • Use feature selection (chi-square) to prune vocabulary.
  • Compare NB vs logistic regression baseline.

Learning path

  • Right now: Naive Bayes basics, hand calculations, smoothing.
  • Next: Model evaluation (precision/recall, ROC), feature engineering for text.
  • Then: Logistic regression for linear decision boundaries; compare with NB.
  • Later: Regularization, SVMs, tree-based models; ensemble baselines.

Mini challenge

You have two classes (Bug, Feature request). Prior P(Bug)=0.7, P(Feature)=0.3. Words and likelihoods (Multinomial):

  • P(crash|Bug)=0.5, P(crash|Feature)=0.05
  • P(request|Bug)=0.02, P(request|Feature)=0.4
  • P(new|Bug)=0.03, P(new|Feature)=0.3

Ticket text tokens: [crash, request, new]. Which class wins? Compute scores and the winning posterior (approximate).

Peek answer

Score(Bug)=0.7×0.5×0.02×0.03=0.00021; Score(Feature)=0.3×0.05×0.4×0.3=0.0018 ⇒ Feature wins; posterior≈0.0018/(0.0018+0.00021)≈0.896.

Next steps

  • Finish the exercises above, then take the quick test below.
  • Document assumptions (variant, α, preprocessing) for reproducibility.
  • Compare NB baseline against at least one alternative model on the same split.

Quick test

Take the Naive Bayes Basics — Quick Test below to check your understanding. Available to everyone; only logged-in users get saved progress.

Practice Exercises

2 exercises to complete

Instructions

Given P(Spam)=0.4, P(Ham)=0.6. P(free|Spam)=0.7, P(meeting|Spam)=0.1; P(free|Ham)=0.05, P(meeting|Ham)=0.3. Email words: [free, free, meeting]. Which class wins? Also estimate the normalized posterior for the winning class.

Expected Output
Class: Spam. Posterior ≈ 0.978 for Spam.

Naive Bayes Basics — Quick Test

Test your knowledge with 7 questions. Pass with 70% or higher.

7 questions70% to pass

Have questions about Naive Bayes Basics?

AI Assistant

Ask questions about this tool