Why this matters
As a Data Architect, you turn business and data needs into clear designs others can build safely and efficiently. Two core tools help you communicate and drive decisions: architecture diagrams and RFCs (Request for Comments). Diagrams give a shared visual model of systems and data flows. RFCs document context, options, trade-offs, and decisions. Together they reduce misunderstandings, speed reviews, and create a durable record.
- Real tasks you will do: map ingestion pipelines, document trust boundaries for PII, present options for warehouse vs. lakehouse, align teams on migration plans, and justify tool choices with trade-offs.
- Outcome: faster approvals, fewer reworks, and predictable delivery.
Concept explained simply
Diagrams show what exists and how it connects. RFCs explain why choices were made.
- Diagrams: pictures of components, data flows, and boundaries to align engineers, analysts, and stakeholders.
- RFCs: short, structured documents proposing a change, comparing options, and recording a decision and its consequences.
Mental model
Think of your work as a map and a travel note:
- The map (diagram) helps everyone navigate.
- The travel note (RFC) explains why you chose a route, what you traded off, and what to watch out for.
Core artifacts you will create
- C4 model levels (use what helps): System Context, Container, Component.
- Data flow diagrams: sources, transformations, sinks; directions and protocols.
- Lineage views: how a dataset is derived; include trust zones and PII tags.
- Supporting: sequence diagrams for interactions, ERDs for data structures.
- Purpose: propose or change an architecture with clear trade-offs.
- Sections: Context, Problem, Goals/Non-goals, Constraints, Options, Decision, Consequences, Risks, Rollout, Open questions.
- Lifecycle: Draft → Review → Decision → Implement → Record learnings.
Standards and conventions
- Naming: consistent product and dataset names; avoid ambiguous acronyms.
- Legend: shapes, colors, and line styles explained in a small legend box.
- Boundaries: draw trust zones (Public, Corp, Restricted) and note PII/sensitive tags.
- Arrows: show direction and protocol (e.g., JDBC, CDC, HTTPS). Label frequency and latency (e.g., hourly, sub-second).
- Versioning: store source diagrams and RFCs in version control; put a version and date on the document.
- Single source of truth: one canonical diagram per scope; link variants from it (scope-specific views).
- Accessibility: simple shapes, limited color palette, descriptive text; avoid tiny fonts.
Worked examples
Example 1: Batch warehouse ingestion
Show design
Goal: Nightly ingest from App DB to Warehouse.
- Containers: App DB (Postgres), Orchestrator (Airflow), Object Storage (S3), Compute (Spark), Warehouse (BigQuery/Redshift), BI Tool.
- Flow: App DB → Extract to S3 (CSV/Parquet) → Spark transform → Load to Warehouse → BI dashboards refresh.
- Labels: Frequency: nightly; SLT: 6 AM; Protocols: JDBC, S3, Warehouse API.
- Trust zones: App DB and S3 in Corp zone, Warehouse in Restricted zone; PII masked in transform step.
Key trade-offs: Simple and cost-effective vs. not real-time.
Example 2: Real-time CDC with schema governance
Show design
Goal: Sub-second ingestion for real-time metrics.
- Containers: Source DB (MySQL), CDC (Debezium), Kafka, Schema Registry, Stream Processor (Flink), Lakehouse (Delta/Iceberg), Serving Store (Elasticsearch), Metrics Service.
- Flow: MySQL binlog → Debezium → Kafka (with Schema Registry) → Flink transforms → Lakehouse and Serving Store.
- Labels: Latency: < 2s; Data contracts enforced via schemas; PII hashing in Flink.
- Boundaries: Internet ingress isolated; Restricted zone for PII.
Trade-offs: Low latency and fresh data vs. higher ops complexity.
Example 3: Governance and PII controls
Show design
Goal: Ensure access control and masking for sensitive data.
- Add a Data Catalog + Policy Engine (e.g., tags, row/column-level policies).
- Diagram layers: ingest, storage, processing, access; overlay tags: PII, PCI.
- Data lineage: Raw → Clean → Curated; policy enforcement on Curated only.
Trade-offs: Strong compliance vs. added complexity in data access paths.
Example 4: RFC excerpt—Warehouse vs. Lakehouse
Show RFC snippet
Context: Analytics team needs scalable storage and open file formats. Current warehouse has high cost at peak.
- Goals: lower cost, maintain SQL usability, support ML-ready files.
- Options: A) Stay on warehouse; B) Lakehouse with open table format; C) Hybrid (warehouse for BI, lakehouse for ML).
- Evaluation: Cost, performance, governance, complexity, migration risk.
- Decision: C) Hybrid for 12 months, reevaluate after adoption metrics.
- Consequences: Two systems to operate; lower storage cost for ML; keep BI speed.
- Risks: Skill gaps; Mitigation: training and phased rollout.
Exercises
Do these in order. They mirror the graded exercises below. Use the checklist to self-review.
Exercise 1: Draw a batch pipeline container and data flow diagram
Scenario: Move data nightly from an app database to a warehouse with transforms.
- Include: source, extract, staging, transform, load, access.
- Add labels: frequency, SLT, protocols, trust boundaries, PII handling.
- Deliverables: A one-page diagram and a 5-bullet rationale.
Exercise 2: Write a short RFC comparing two ingestion options
Scenario: Choose Kafka vs. Managed streaming service for CDC.
- Fill sections: Context, Goals/Non-goals, Constraints, Options (A/B), Decision, Consequences, Risks, Rollout, Open questions.
- Keep it under 2 pages; use objective trade-offs (cost, latency, ops, vendor lock-in).
Exercise checklist
- [ ] Diagram has a legend and labeled arrows.
- [ ] Trust boundaries and PII are clearly marked.
- [ ] Frequency/latency labels match requirements.
- [ ] RFC states goals and non-goals explicitly.
- [ ] At least two options with trade-offs are compared with evidence.
- [ ] Decision and consequences are clear and testable.
- [ ] Risks have mitigations and a phased rollout.
Common mistakes and self-check
- Too much detail: Diagram shows every table/topic. Self-check: Can a new engineer grasp it in 60 seconds? If not, simplify or layer.
- No legend: Readers guess meanings. Add a small legend for shapes/colors/arrows.
- Missing boundaries: Security reviewers need trust zones. Draw them and tag sensitive data.
- Unstated non-goals: RFCs drift in scope. Add Non-goals to keep focus.
- Option bias: Only one option presented. Always include at least two and compare against the same criteria.
- No consequences: Decisions seem costless. List trade-offs so stakeholders accept impact.
- Stale artifacts: Diagrams diverge from reality. Put version/date and update when key changes happen.
Practical projects
- Project 1: Batch analytics stack. Create diagrams and an RFC to move nightly data from OLTP to a cloud warehouse with masking. Include a cost vs. latency trade-off.
- Project 2: Real-time metrics. Propose a CDC pipeline using streaming; include schema governance and a phased rollout.
- Project 3: Data governance overlay. Add access policies and lineage to an existing diagram; write an RFC for policy rollout with test plan.
Learning path
- Step 1: Learn diagram layers (context, container, data flow). Practice on a small system.
- Step 2: Add security and governance overlays (trust zones, PII tags).
- Step 3: Write short RFCs (1–2 pages). Start with a real change request.
- Step 4: Facilitate reviews. Invite feedback, capture decisions and open questions.
- Step 5: Version and maintain. Update artifacts as systems evolve.
Who this is for
- Data Architects and Senior Data Engineers who need to communicate designs.
- Analytics Engineers shaping data models and pipelines.
- Tech Leads coordinating across platform, data, and product teams.
Prerequisites
- Basic understanding of data platforms (OLTP vs. OLAP, batch vs. streaming).
- Familiarity with one cloud or data stack.
- Comfort writing concise technical documents.
Next steps
- Complete the exercises and then take the quick test below.
- Progress note: The quick test is available to everyone; only logged-in users get saved progress.
- After passing, pick a Practical project and get a peer review on your RFC.
Mini challenge
In one page, redesign a current pipeline diagram to show trust boundaries and PII flows. Add a 5-bullet RFC addendum listing risks, mitigations, and a 2-week rollout plan. Keep it crisp and decision-ready.