Why this matters
On a shared streaming platform, many teams publish and consume events at once. Without governance, a single noisy tenant can cause outages, sensitive data can leak, and costs can spiral. Data Platform Engineers use governance to keep the platform reliable, secure, and predictable while letting teams move fast.
- Real tasks you will handle: approving topic creation requests, setting quotas and ACLs, defining schema compatibility rules, auditing access, and guiding safe schema evolution.
- Outcomes: fewer incidents, easier onboarding, lower costs, and clear accountability.
Concept explained simply
Multi-tenant streaming governance is a set of guardrails that lets multiple teams share one platform safely. It defines who can create and use topics, how messages are structured, how long data is retained, how much traffic each team can produce/consume, and how changes are reviewed.
Mental model
Think of your streaming platform like an apartment building:
- Tenants (teams) rent units (namespaces/topics).
- House rules (policies) keep everyone safe: noise limits (quotas), keys (ACLs), trash schedule (retention), renovations (change process), and security cameras (audit logs).
- The building manager (you) ensures rules are clear, enforced, and continuously improved.
Core governance building blocks
Tenancy model
- Namespaces/projects per team or domain.
- Topic ownership recorded (team name, on-call, business purpose).
- Environment isolation: dev, stage, prod with stricter policies in prod.
Naming conventions
- Pattern: domain.team.purpose.event.version (e.g., payments.core.transactions.v1)
- Include environment via separate cluster/namespace, not in topic name when possible.
- Document owner, data classification, and SLA in metadata.
Access control
- Identity: service accounts for apps, user groups for humans.
- Authorization: least privilege. Separate produce and consume roles.
- Audit: log who created topics, changed ACLs, and accessed data.
Quotas and limits
- Per-tenant produce/consume MB/s limits and connection caps.
- Per-topic partition count limits and retention caps.
- Burst policy and throttling behavior documented.
Retention and cleanup
- Time-based (e.g., 7 days) and/or size-based limits.
- Compaction for key-based latest state topics.
- Dead-letter topics with separate retention for debugging.
Schema governance
- Registry required. Subject naming strategy documented.
- Compatibility mode: typically backward for values, full for critical contracts.
- Schema review for PII, required fields, and proper data types.
Data classification & privacy
- Label topics: Public, Internal, Confidential, Restricted (PII).
- Restricted data requires extra controls: encryption at rest/in transit, limited ACLs, masked payloads in logs.
Change management
- Self-service with policy checks: naming, quotas, schema compatibility.
- Approval workflow for high-risk changes (PII, retention increases, cross-region links).
- Rollback procedures documented.
Observability & audit
- Per-tenant dashboards: throughput, lag, error rates, quota usage.
- Audit trails: who changed what and when.
- Alerting on SLA breach (e.g., lag, dropped messages).
Reliability & DR
- Replication and cluster linking policies.
- Idempotent producers and exactly-once semantics where feasible.
- Disaster recovery runbooks tested regularly.
Cost management
- Tag resources by tenant and environment.
- Chargeback/showback based on partitions, storage, egress.
- Review retention and partition counts quarterly.
Worked examples
Example 1: Onboard a new tenant safely
Team: Payments. Event: transactions. Non-PII, high throughput, 7-day retention.
- Naming: payments.core.transactions.v1
- Partitions: target ~10 MB/s per partition; expected 80 MB/s produce -> 8–10 partitions (choose 10 for headroom).
- Retention: 7d, cleanup=delete.
- Quotas: 100 MB/s produce, 120 MB/s consume, 500 connections.
- ACLs: payments-writer (produce), payments-reader (consume).
- Schema: Avro/JSON with backward compatibility; include idempotent event_id.
- Monitoring: produce rate, consumer lag per group, DLQ rate.
Why these choices?
- 10 partitions balances throughput and consumer parallelism without excessive overhead.
- Backward compatibility enables safe consumer rollouts.
- Non-PII classification reduces access friction but still enforces least privilege.
Example 2: Safe schema evolution
Need to add optional field promo_code and keep consumers working.
- Compatibility: backward.
- Plan: add optional field with default. Do not rename existing fields in-place.
- Rollout: canary producers -> 10% -> 100%. Monitor consumer error rate and schema registry compatibility.
What if we must rename a field?
Use additive approach: add new field amount, keep total for one deprecation cycle. Consumers migrate, then remove total in a later version with a planned window.
Example 3: Handling PII in events
Team: Identity. Event: user_profile_updates with email.
- Classification: Restricted (PII).
- Controls: encryption in transit/at rest, restricted ACLs, masked logs, short retention (e.g., 3 days) unless justified.
- Schema: separate PII into a dedicated topic with stricter controls if possible.
- Audit: enable detailed access logging and quarterly reviews.
Example 4: Quota breach response
Tenant exceeds produce quota.
- Platform throttles producer connections per policy.
- Alert fires: "Tenant payments at 95% quota for 10m".
- Response: contact owner, analyze traffic, consider temporary burst allowance and long-term partition/throughput plan.
Governance policies you can copy
Topic naming policy
- Format: domain.team.event.vN (e.g., retail.catalog.product.v1)
- Must register owner, contact, classification, SLA before creation.
- New major version only when breaking changes are necessary.
Schema policy
- Default compatibility: backward for values, none or backward for keys depending on usage.
- Allowed changes: add optional fields with defaults; disallowed: remove required fields or change field types without aliases/compat plan.
- PII requires explicit approval and data minimization.
Retention policy
- Default: 7 days; max: 30 days without approval.
- Compaction: allowed for "latest-state" topics only; must include stable keys.
- DLQ retention: 14–30 days with access restricted to owning team and platform SREs.
Access & quota policy
- Service accounts per application; no shared credentials.
- Least privilege ACLs; humans read via approved tooling only.
- Default quotas per tenant; increases need usage justification and capacity review.
Step-by-step: Creating a topic safely
- Define purpose and data classification (Public/Internal/Confidential/Restricted).
- Choose name using convention domain.team.event.vN.
- Estimate throughput and set partitions (rule of thumb: target ~10 MB/s per partition).
- Set retention and cleanup policy (delete or compact).
- Register schema and set compatibility mode.
- Create service accounts and ACLs (produce/consume).
- Configure quotas (produce/consume MB/s, connections).
- Add monitoring and alerts for throughput, lag, DLQ.
- Document owner, on-call, and SLA.
Common mistakes and how to self-check
- Too few partitions leading to hot partitions. Self-check: is per-partition throughput consistently > 15 MB/s? Add partitions.
- Breaking schema changes. Self-check: run registry compatibility checks and consumer contract tests before rollout.
- Overlong retention inflating costs. Self-check: compare retention to actual consumer lag and business need.
- Weak ACLs (broad wildcards). Self-check: review effective permissions per topic monthly.
- No DLQ or retry strategy. Self-check: error rate vs. DLQ volume; ensure replay plan exists.
Who this is for
- Data Platform Engineers building or operating shared streaming clusters.
- Data Engineers and SREs responsible for event pipelines.
- Security/Compliance partners defining data controls.
Prerequisites
- Basic understanding of event streaming concepts (topics, partitions, producers/consumers, consumer groups).
- Familiarity with schemas (Avro/JSON/Protobuf) and compatibility modes.
- Intro knowledge of RBAC/ACLs and service accounts.
Learning path
- Before this: Streaming fundamentals, topic/partition design, schema basics.
- This lesson: Multi-tenant governance guardrails and practical rollout.
- Next: Cross-cluster replication/DR, advanced cost/chargeback, policy-as-code automation.
Practical projects
- Build a tenant onboarding template and run 2 mock onboardings.
- Create a dashboard for per-tenant throughput, lag, and quota usage.
- Implement a schema evolution playbook with canary deployment steps.
Exercises
Do the exercises below. Answers are available to everyone; log in to save your progress.
Exercise 1: Design a tenant-safe topic lifecycle policy
Scenario: New team "FraudDetection" will publish non-PII fraud signals at up to 100 MB/s, 7-day retention. Draft a policy.
- Include: topic name, partitions, retention, cleanup, schema subject & compatibility, ACLs, service accounts, quotas, DLQ, monitoring KPIs, cost tag, and owner contact.
Exercise 2: Safe schema evolution request
Scenario: Topic orders.v1. You need to add optional field "promoCode" and effectively rename field "total" to "amount" without breaking consumers.
- Provide: compatibility mode, exact schema change, rollout steps (canary), deprecation plan, and rollback signals.
Exercise completion checklist
- Names follow convention and include version.
- Partition count justified by throughput.
- Retention and cleanup chosen with reasoning.
- Schema compatibility mode set and changes are additive.
- ACLs are least privilege and tied to service accounts.
- Quotas specified for produce/consume.
- Monitoring metrics and alerts listed.
Mini challenge
Your consumer team wants to increase retention from 7 to 21 days on a high-traffic topic. In one paragraph, decide yes/no and outline the conditions (cost impact, storage headroom, audit need, alternative of tiered storage, and sunset plan).
Quick Test
Available to everyone; log in to save progress.
Next steps
- Automate checks (naming, quotas, schema compatibility) in your CI/CD.
- Schedule quarterly tenant reviews for quotas, retention, and ACLs.
- Create runbooks for quota breaches, schema rollbacks, and DR failover tests.