Menu

Topic 2 of 8

API Ingestion Patterns

Learn API Ingestion Patterns for free with explanations, exercises, and a quick test (for Data Engineer).

Published: January 8, 2026 | Updated: January 8, 2026

Who this is for

You are a Data Engineer who needs to pull reliable data from third-party APIs into a warehouse, lake, or event system. You want patterns that scale, remain resilient under rate limits, and keep data correct and deduplicated.

Prerequisites

  • [ ] Comfort with HTTP basics (methods, headers, status codes)
  • [ ] Experience with JSON and pagination concepts
  • [ ] Basic understanding of batch vs. stream data ingestion
  • [ ] Ability to read logs and handle errors/retries

Why this matters

Real tasks you will face:

  • Sync customer, orders, and usage data from SaaS APIs daily/hourly without missing or duplicating records.
  • Handle rate limits and pagination so you don’t get blocked mid-ingestion.
  • React to real-time events (webhooks) safely and at-least-once, while avoiding downstream duplicates.
  • Recover from failures by resuming from a checkpoint rather than starting over.

Concept explained simply

An API ingestion pattern is a repeatable way to pull or receive data from an API reliably. Think of it as a recipe that covers: how to fetch, how to pace requests, how to continue from where you left off, and how to avoid duplicates.

Mental model

Imagine a conveyor belt of records moving past a scanner:

  • The belt speed is the API rate limit.
  • The scanner’s last inspected position is your checkpoint (watermark).
  • Packages occasionally repeat on the belt (duplicate events) or arrive out of order; your scanner needs ID checks to skip duplicates and store only the newest version.
Common building blocks (open to read)
  • Pagination: offset, page, or cursor (next token).
  • Incremental filters: updated_after, since, since_id.
  • Idempotency: dedupe by stable IDs or hashes to ensure upserts are safe.
  • Rate limiting: respect 429s, Retry-After, and exponential backoff.
  • Checkpointing: persist the last processed cursor/timestamp/ID.
  • Schema evolution: handle new fields without breaking pipelines.

Core patterns

1) Scheduled pull with pagination

Use for resources that change slowly or when the API has no push mechanism.

  • Run on a schedule (e.g., hourly).
  • Page through results until exhaustion.
  • Respect rate limits; store a checkpoint.

2) Incremental sync with watermarks

Use when the API supports updated_since, since_id, or similar filters.

  • Query only new/changed records since the last watermark.
  • Upsert into a destination to avoid duplicates.
  • Move the watermark forward only after successful write.

3) Webhook + queue for near real-time

Use when the provider can push events to your endpoint.

  • Receive the event, verify signature, enqueue.
  • Workers fetch full details (if needed) and upsert.
  • Use event IDs for idempotency; retry safely.

4) Change capture replay

If the API exposes event streams or historical cursors, consume in order, keeping a replay cursor. Ideal for event-driven platforms.

Worked examples

Example 1: Poll a REST API with cursor-based pagination and rate limits

  1. Start with last_cursor (or empty for full load).
  2. GET /items?limit=500&cursor=last_cursor
  3. If 429, sleep using Retry-After header or exponential backoff and retry.
  4. Write page to staging, upsert to final table.
  5. Update last_cursor only after successful upsert.
  6. Repeat until no next_cursor.
{"state": {"cursor": null}}
loop:
  resp = GET /items?limit=500&cursor=state.cursor
  if resp.status == 429:
    wait(resp.headers.get('Retry-After', backoff))
    continue
  write_to_staging(resp.data)
  upsert(resp.data, key='id', updated_at='updated_at')
  state.cursor = resp.next_cursor
  if not state.cursor: break
Why it works

Cursor pagination is resilient to inserts/deletes between pages. Checkpoints protect from rework after failures.

Example 2: Webhook ingestion with idempotency

  1. Expose /webhook endpoint that validates a shared secret or signature header.
  2. Push the raw payload to a durable queue.
  3. Worker pulls messages, dedupes by event_id, fetches full record if necessary, then upserts.
  4. On failure, retry with exponential backoff; dedupe prevents double-writes.
on_webhook(event):
  if not verify_signature(event): return 401
  enqueue(event)

worker(msg):
  if seen(msg.event_id): ack; return
  record = maybe_fetch_details(msg)
  upsert(record, key='id')
  mark_seen(msg.event_id)
Why it works

Queues smooth spikes, retries handle transient issues, and idempotency ensures safe reprocessing.

Example 3: Incremental sync using updated_at watermark

  1. Keep watermark T0 (ISO timestamp).
  2. GET /orders?updated_after=T0&limit=1000
  3. Process pages; track max(updated_at) seen, but do not commit it yet.
  4. Upsert all rows; commit watermark to T1 (the max) only after success.
  5. Next run uses T1.
watermark = load_watermark()
max_seen = watermark
for page in paginate('/orders', updated_after=watermark):
  upsert(page.rows, key='order_id', updated_at='updated_at')
  max_seen = max(max_seen, max(page.updated_at))
commit_watermark(max_seen)
Why it works

Moving the watermark only after success prevents gaps. Upserts handle late-arriving updates.

Design checklists

Before you start

  • [ ] Identify pagination type: offset, page, or cursor
  • [ ] Identify rate limit rules and headers (429, Retry-After)
  • [ ] Find stable IDs for dedupe
  • [ ] Find incremental parameters (updated_after, since_id)
  • [ ] Decide on checkpoint storage (file, table, key-value store)
  • [ ] Define retries, backoff, and max attempts
  • [ ] Plan for schema changes (nullable new fields, versioning)

During implementation

  • [ ] Log request ID, page/cursor, and response status
  • [ ] Validate and sanitize payloads
  • [ ] Use upsert semantics in the destination
  • [ ] Treat 5xx as retryable; 4xx (except 429) as bugs/config issues

Step-by-step: Build a robust API ingestor

  1. Choose pattern: pull with pagination, incremental watermark, webhook, or a hybrid.
  2. Define schema and upsert keys (natural or API-provided IDs).
  3. Implement fetch with retry/backoff and rate-limit handling.
  4. Write to staging first; validate; then upsert to final tables.
  5. Checkpoint after successful batch.
  6. Monitor with metrics: rows_ingested, duration, error_rate.
Mini task: simulate a 429

Pretend the API returned 429. Decide how long to wait. If a Retry-After of 5 is present, sleep 5 seconds; otherwise backoff (e.g., 1s, 2s, 4s...).

Exercises

Exercise: Design a polling plan for a rate-limited API

You have an endpoint /events with cursor pagination and a limit of 1000 req/hour. Create a plan that handles pagination, checkpoints, dedupe, and 429 responses.

  • Deliverables: request flow, retry/backoff strategy, checkpoint format, upsert key(s).
Show solution

Flow:

  1. Load state: last_cursor (nullable) and last_success_ts.
  2. GET /events?limit=500&cursor=last_cursor with headers for auth.
  3. If 429: if Retry-After present, sleep that many seconds; else exponential backoff starting at 1s up to 60s. Retry with jitter.
  4. Validate JSON, write to staging, upsert to final using event_id as key and updated_at for conflict resolution.
  5. Set last_cursor = response.next_cursor after successful upsert.
  6. Stop when next_cursor is null; save state atomically.

Checkpoint format:

{"last_cursor": "abc123", "last_success_ts": "2025-04-12T09:33:00Z"}

Dedupe:

  • Use event_id as idempotency key; ignore duplicates.
  • For late updates, compare updated_at to keep the newest.

Common mistakes and self-check

  • Mistake: Advancing watermark before writes finish. Self-check: Crash midway; can you resume without gaps?
  • Mistake: Ignoring rate-limit headers. Self-check: Simulate 429; do logs show backoff and eventual success?
  • Mistake: Using offset pagination on changing datasets. Self-check: Are rows skipped/duplicated between pages?
  • Mistake: No idempotency. Self-check: Can a retried webhook create duplicates?
  • Mistake: Hardcoding secrets in code. Self-check: Are keys loaded from environment variables or secure configuration?

Practical projects

  • Build a daily incremental sync for orders using updated_after, persisting a watermark table.
  • Implement a webhook receiver that validates a signature and enqueues events for processing.
  • Create a generic pagination module supporting offset and cursor, with automatic backoff on 429/5xx.
  • Add a validation step that rejects records missing required fields and logs counts by error type.

Learning path

  1. Master HTTP, JSON, and pagination patterns.
  2. Implement incremental sync with watermarks and upserts.
  3. Add webhooks for near real-time updates.
  4. Introduce queues and idempotency for reliability at scale.
  5. Harden with monitoring, schema evolution handling, and backpressure.

Next steps

  • Pick one API you use at work and convert a full-load job into an incremental sync with a proper watermark.
  • Add a retry/backoff utility you can reuse across all API clients.
  • Run the quick test below to check your understanding. Progress is saved for logged-in users; the test is available to everyone.

Mini challenge

You have a cursor-based API and observe occasional out-of-order updates. Propose a two-part solution that ensures both correctness and progress. Hint: upserts by ID with updated_at comparison, and commit the watermark only after successful writes.

Practice Exercises

1 exercises to complete

Instructions

You have an endpoint /events with cursor pagination and a limit of 1000 req/hour. Create a plan that handles pagination, checkpoints, dedupe, and 429 responses. Provide a short outline of request flow, retry/backoff, checkpoint format, and upsert keys.

Expected Output
A concise plan including pagination flow, backoff strategy for 429, a JSON example of checkpoint state, and a clear idempotency key.

API Ingestion Patterns — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about API Ingestion Patterns?

AI Assistant

Ask questions about this tool