Why this matters
Platform Service Catalogs give engineers a single place to discover services, understand ownership, find runbooks, and follow golden paths. As a Platform Engineer, you’ll use the catalog to reduce cognitive load, standardize service creation, and make production safer.
- Real tasks you’ll handle:
- Define the service data model (name, owner, tier, lifecycle, SLOs, dependencies).
- Onboard teams via templates and golden paths.
- Automate checks: ownership set, on-call configured, runbooks linked, SLOs defined.
- Enable discoverability and self-service with accurate metadata.
Concept explained simply
A Service Catalog is the source of truth for everything running in your platform: services, data stores, jobs, and their owners. It’s like a company-wide directory where each service has a profile page and a health passport.
Mental model
Think of the catalog as a library:
- Book = Service (title, author/owner, edition/version, status).
- Library index = Metadata schema (consistent fields across all books).
- Librarian rules = Scorecards and checks (ensuring quality and findability).
- New book form = Templates/golden paths (onboard new services consistently).
Core components of a service catalog
- Entity types: service, website, job, library/package, database, team, system, domain.
- Identity: unique ID, name, description, tags, domain/system grouping.
- Ownership: team, on-call group, escalation policy.
- Lifecycle: draft → experimental → production → deprecated → retired.
- Tier/Criticality: P0/P1/P2 (affects SLO rigor and on-call).
- Operational metadata: runbooks, dashboards, alerts, SLOs/SLIs, error budget.
- Dependencies: upstream/downstream services and data stores.
- Compliance/security: data classification, auth model, PII handling, approval status.
- Scorecards: automated checks (e.g., owner set, SLO present, runbook link exists).
- Templates/Golden paths: pre-baked scaffolds that produce catalog-ready services.
Glossary (open me)
- Golden path: a guided, supported way to create/run a service with best practices baked in.
- Scorecard: a set of rules that grade a service’s metadata and operational readiness.
- Entity: any item in the catalog (service, system, team, etc.).
Worked examples
Example 1: Minimal service entity
# service.yaml (catalog manifest)
apiVersion: catalog/v1
kind: Service
metadata:
name: payments-api
description: Handles card authorizations and captures
tags: [payments, api, critical]
spec:
owner: team-core-payments
lifecycle: production
tier: P0
system: payments-platform
links:
runbook: internal://runbooks/payments-api
dashboard: internal://dashboards/payments-api
dependencies:
- service: fraud-checker
- database: payments-db
slos:
availability: 99.9%
latency_p95_ms: 200
Example 2: Scorecard checks
# scorecard.yaml
apiVersion: scorecards/v1
kind: Scorecard
metadata:
name: production-readiness-v1
spec:
checks:
- id: owner-set
rule: metadata.owner is defined
- id: lifecycle-valid
rule: spec.lifecycle in [production, experimental, deprecated]
- id: runbook-present
rule: spec.links.runbook exists
- id: slo-availability-present
rule: spec.slos.availability exists
- id: oncall-configured
rule: annotations.oncall.rotation exists
Example 3: Golden path template fields
# template.yaml (inputs for a new service)
inputs:
- name: service_name
required: true
- name: team_owner
required: true
- name: language
enum: [go, java, node, python]
- name: tier
enum: [P0, P1, P2]
- name: lifecycle
default: experimental
outputs:
- scaffolded repo with service + service.yaml
- CI pipeline with basic checks
- default SLO placeholders and runbook template
Build a minimal service catalog (step-by-step)
- Define your entity schema: name, owner, lifecycle, tier, system, dependencies, links, SLOs.
- Choose a source of truth for manifests (e.g., store a service.yaml in each repo).
- Adopt clear IDs for teams and systems (e.g., team-core-payments, system-observability).
- Create a starter scorecard (5–8 checks). Keep it passable for most services.
- Ship a golden path template that generates a runnable service + filled service.yaml.
- Ingest manifests on PR merge; surface errors in CI to keep metadata fresh.
- Review 10 existing services; add missing owners, lifecycle, and runbook links.
- Announce: "New services must use the template; existing services get a grace period."
Exercises
Do these now. You can compare with the solutions below each exercise.
Exercise 1 — Model a production service entity
Create a catalog manifest for a service named payments-api with:
- owner: team-core-payments
- lifecycle: production; tier: P0; system: payments-platform
- dependencies: fraud-checker (service), payments-db (database)
- links: runbook and dashboard placeholders
- SLOs: availability 99.9%, p95 latency 200ms
Show solution
apiVersion: catalog/v1
kind: Service
metadata:
name: payments-api
description: Handles card authorizations and captures
tags: [payments, api, critical]
spec:
owner: team-core-payments
lifecycle: production
tier: P0
system: payments-platform
links:
runbook: internal://runbooks/payments-api
dashboard: internal://dashboards/payments-api
dependencies:
- service: fraud-checker
- database: payments-db
slos:
availability: 99.9%
latency_p95_ms: 200
Exercise 2 — Draft a scorecard
Write five checks that encourage good metadata hygiene for production services.
- At least one ownership check
- At least one observability/runbook check
- At least one SLO check
Show solution
checks:
- id: owner-set
rule: metadata.owner is defined
- id: lifecycle-prod
rule: spec.lifecycle == "production"
- id: runbook-present
rule: spec.links.runbook exists
- id: dashboard-present
rule: spec.links.dashboard exists
- id: slo-availability
rule: spec.slos.availability exists
Self-check checklist
- Does each service have a unique name and an explicit owner?
- Is lifecycle set from the allowed set?
- Are runbook and dashboard links present and reachable internally?
- Do dependencies reference existing catalog entities?
- Are SLOs present and measurable (availability or latency)?
Common mistakes and how to avoid them
- Optional ownership: Make owner mandatory in templates and scorecards.
- Stale metadata: Validate manifests in CI; fail fast on missing critical fields.
- Overcomplicated schemas: Start minimal; add fields only when used.
- Ignoring non-services: Catalog jobs, databases, and libraries for full lineage.
- No feedback loop: Show scorecard results where engineers work (PR comments or dashboards).
Practical projects
- Project 1: Migrate 10 legacy services into the catalog. Add owners, lifecycle, and runbooks. Report the before/after scorecard grades.
- Project 2: Build a golden path that scaffolds a service with a manifest, CI, and default SLO placeholders. Measure adoption.
- Project 3: Implement a dependency map using the catalog data to visualize upstream/downstream impact for one system.
Who this is for
- Platform Engineers introducing or improving an internal developer platform.
- Backend Engineers contributing to service metadata and reliability.
- Tech Leads who want better ownership and visibility.
Prerequisites
- Basic understanding of microservices and CI/CD.
- Familiarity with YAML or JSON configuration.
- Access to example services to practice on.
Learning path
- Learn the entity schema and required fields.
- Create a minimal scorecard and run it against a few services.
- Introduce a golden path template for new services.
- Automate validation in CI and display scorecard results.
- Expand to systems/domains and non-service entities.
Mini challenge
Pick one critical service and raise its scorecard grade by one level this week. Add the missing owner, runbook, and SLOs. Document before/after.
Ready to test your knowledge?
Take the quick test below. Everyone can take it for free. Only logged-in users will have their progress saved.
Next steps
- Roll out your first scorecard to one team and gather feedback.
- Iterate on the golden path template to reduce manual fields.
- Plan your next subskill: service reliability automation and SLO governance.