Who this is for
You are a Data Engineer who needs to pull reliable data from third-party APIs into a warehouse, lake, or event system. You want patterns that scale, remain resilient under rate limits, and keep data correct and deduplicated.
Prerequisites
- [ ] Comfort with HTTP basics (methods, headers, status codes)
- [ ] Experience with JSON and pagination concepts
- [ ] Basic understanding of batch vs. stream data ingestion
- [ ] Ability to read logs and handle errors/retries
Why this matters
Real tasks you will face:
- Sync customer, orders, and usage data from SaaS APIs daily/hourly without missing or duplicating records.
- Handle rate limits and pagination so you don’t get blocked mid-ingestion.
- React to real-time events (webhooks) safely and at-least-once, while avoiding downstream duplicates.
- Recover from failures by resuming from a checkpoint rather than starting over.
Concept explained simply
An API ingestion pattern is a repeatable way to pull or receive data from an API reliably. Think of it as a recipe that covers: how to fetch, how to pace requests, how to continue from where you left off, and how to avoid duplicates.
Mental model
Imagine a conveyor belt of records moving past a scanner:
- The belt speed is the API rate limit.
- The scanner’s last inspected position is your checkpoint (watermark).
- Packages occasionally repeat on the belt (duplicate events) or arrive out of order; your scanner needs ID checks to skip duplicates and store only the newest version.
Common building blocks (open to read)
- Pagination: offset, page, or cursor (next token).
- Incremental filters: updated_after, since, since_id.
- Idempotency: dedupe by stable IDs or hashes to ensure upserts are safe.
- Rate limiting: respect 429s, Retry-After, and exponential backoff.
- Checkpointing: persist the last processed cursor/timestamp/ID.
- Schema evolution: handle new fields without breaking pipelines.
Core patterns
1) Scheduled pull with pagination
Use for resources that change slowly or when the API has no push mechanism.
- Run on a schedule (e.g., hourly).
- Page through results until exhaustion.
- Respect rate limits; store a checkpoint.
2) Incremental sync with watermarks
Use when the API supports updated_since, since_id, or similar filters.
- Query only new/changed records since the last watermark.
- Upsert into a destination to avoid duplicates.
- Move the watermark forward only after successful write.
3) Webhook + queue for near real-time
Use when the provider can push events to your endpoint.
- Receive the event, verify signature, enqueue.
- Workers fetch full details (if needed) and upsert.
- Use event IDs for idempotency; retry safely.
4) Change capture replay
If the API exposes event streams or historical cursors, consume in order, keeping a replay cursor. Ideal for event-driven platforms.
Worked examples
Example 1: Poll a REST API with cursor-based pagination and rate limits
- Start with last_cursor (or empty for full load).
- GET /items?limit=500&cursor=last_cursor
- If 429, sleep using Retry-After header or exponential backoff and retry.
- Write page to staging, upsert to final table.
- Update last_cursor only after successful upsert.
- Repeat until no next_cursor.
{"state": {"cursor": null}}
loop:
resp = GET /items?limit=500&cursor=state.cursor
if resp.status == 429:
wait(resp.headers.get('Retry-After', backoff))
continue
write_to_staging(resp.data)
upsert(resp.data, key='id', updated_at='updated_at')
state.cursor = resp.next_cursor
if not state.cursor: breakWhy it works
Cursor pagination is resilient to inserts/deletes between pages. Checkpoints protect from rework after failures.
Example 2: Webhook ingestion with idempotency
- Expose /webhook endpoint that validates a shared secret or signature header.
- Push the raw payload to a durable queue.
- Worker pulls messages, dedupes by event_id, fetches full record if necessary, then upserts.
- On failure, retry with exponential backoff; dedupe prevents double-writes.
on_webhook(event): if not verify_signature(event): return 401 enqueue(event) worker(msg): if seen(msg.event_id): ack; return record = maybe_fetch_details(msg) upsert(record, key='id') mark_seen(msg.event_id)
Why it works
Queues smooth spikes, retries handle transient issues, and idempotency ensures safe reprocessing.
Example 3: Incremental sync using updated_at watermark
- Keep watermark T0 (ISO timestamp).
- GET /orders?updated_after=T0&limit=1000
- Process pages; track max(updated_at) seen, but do not commit it yet.
- Upsert all rows; commit watermark to T1 (the max) only after success.
- Next run uses T1.
watermark = load_watermark()
max_seen = watermark
for page in paginate('/orders', updated_after=watermark):
upsert(page.rows, key='order_id', updated_at='updated_at')
max_seen = max(max_seen, max(page.updated_at))
commit_watermark(max_seen)Why it works
Moving the watermark only after success prevents gaps. Upserts handle late-arriving updates.
Design checklists
Before you start
- [ ] Identify pagination type: offset, page, or cursor
- [ ] Identify rate limit rules and headers (429, Retry-After)
- [ ] Find stable IDs for dedupe
- [ ] Find incremental parameters (updated_after, since_id)
- [ ] Decide on checkpoint storage (file, table, key-value store)
- [ ] Define retries, backoff, and max attempts
- [ ] Plan for schema changes (nullable new fields, versioning)
During implementation
- [ ] Log request ID, page/cursor, and response status
- [ ] Validate and sanitize payloads
- [ ] Use upsert semantics in the destination
- [ ] Treat 5xx as retryable; 4xx (except 429) as bugs/config issues
Step-by-step: Build a robust API ingestor
- Choose pattern: pull with pagination, incremental watermark, webhook, or a hybrid.
- Define schema and upsert keys (natural or API-provided IDs).
- Implement fetch with retry/backoff and rate-limit handling.
- Write to staging first; validate; then upsert to final tables.
- Checkpoint after successful batch.
- Monitor with metrics: rows_ingested, duration, error_rate.
Mini task: simulate a 429
Pretend the API returned 429. Decide how long to wait. If a Retry-After of 5 is present, sleep 5 seconds; otherwise backoff (e.g., 1s, 2s, 4s...).
Exercises
Exercise: Design a polling plan for a rate-limited API
You have an endpoint /events with cursor pagination and a limit of 1000 req/hour. Create a plan that handles pagination, checkpoints, dedupe, and 429 responses.
- Deliverables: request flow, retry/backoff strategy, checkpoint format, upsert key(s).
Show solution
Flow:
- Load state: last_cursor (nullable) and last_success_ts.
- GET /events?limit=500&cursor=last_cursor with headers for auth.
- If 429: if Retry-After present, sleep that many seconds; else exponential backoff starting at 1s up to 60s. Retry with jitter.
- Validate JSON, write to staging, upsert to final using event_id as key and updated_at for conflict resolution.
- Set last_cursor = response.next_cursor after successful upsert.
- Stop when next_cursor is null; save state atomically.
Checkpoint format:
{"last_cursor": "abc123", "last_success_ts": "2025-04-12T09:33:00Z"}Dedupe:
- Use event_id as idempotency key; ignore duplicates.
- For late updates, compare updated_at to keep the newest.
Common mistakes and self-check
- Mistake: Advancing watermark before writes finish. Self-check: Crash midway; can you resume without gaps?
- Mistake: Ignoring rate-limit headers. Self-check: Simulate 429; do logs show backoff and eventual success?
- Mistake: Using offset pagination on changing datasets. Self-check: Are rows skipped/duplicated between pages?
- Mistake: No idempotency. Self-check: Can a retried webhook create duplicates?
- Mistake: Hardcoding secrets in code. Self-check: Are keys loaded from environment variables or secure configuration?
Practical projects
- Build a daily incremental sync for orders using updated_after, persisting a watermark table.
- Implement a webhook receiver that validates a signature and enqueues events for processing.
- Create a generic pagination module supporting offset and cursor, with automatic backoff on 429/5xx.
- Add a validation step that rejects records missing required fields and logs counts by error type.
Learning path
- Master HTTP, JSON, and pagination patterns.
- Implement incremental sync with watermarks and upserts.
- Add webhooks for near real-time updates.
- Introduce queues and idempotency for reliability at scale.
- Harden with monitoring, schema evolution handling, and backpressure.
Next steps
- Pick one API you use at work and convert a full-load job into an incremental sync with a proper watermark.
- Add a retry/backoff utility you can reuse across all API clients.
- Run the quick test below to check your understanding. Progress is saved for logged-in users; the test is available to everyone.
Mini challenge
You have a cursor-based API and observe occasional out-of-order updates. Propose a two-part solution that ensures both correctness and progress. Hint: upserts by ID with updated_at comparison, and commit the watermark only after successful writes.