Who this is for
This lesson is for MLOps Engineers, platform engineers, and security-minded data scientists who need to keep ML training and inference traffic safe, predictable, and compliant.
- You deploy or manage ML systems (training clusters, feature stores, model registries, inference services).
- You must minimize data leakage and control which services can talk to each other and to the internet.
- You collaborate with security and compliance teams and need clear, auditable network boundaries.
Prerequisites
- Basic networking: IPs, CIDR, subnets, routing.
- Know what your ML components are: data sources, training jobs, model registry, CI/CD, inference endpoints.
- Familiar with container orchestration (e.g., Kubernetes) at a high level.
Why this matters
Real MLOps tasks that depend on network isolation:
- Let training clusters pull datasets from a secure data lake without exposing them to the public internet.
- Stop inference services from calling unknown external APIs that could leak PII or model parameters.
- Ensure only CI/CD runners can push models to the registry via a private path.
- Demonstrate compliance (e.g., restricting data egress, logging, separation of environments) during audits.
Concept explained simply
Network isolation means drawing clear boundaries around your ML components so only the minimum, approved network traffic can flow in or out. You choose who can talk to whom, and how.
Mental model
Imagine your ML platform as a building with rooms:
- Subnets are rooms. Only specific doors connect rooms.
- Firewalls and security groups are the door policies (who can enter/exit).
- Private endpoints are internal doors to shared utilities (like a private elevator to storage).
- Default-deny is locking every door until you add a key for a valid person.
Key components you will use
- VPC/VNet and Subnets: define isolated IP ranges and segments (e.g., private vs public).
- Routing + NAT/Egress Gateways: control outbound paths; allow private resources to reach approved external services without being directly reachable from the internet.
- Firewalls / Security Groups / Network ACLs: permit/deny specific traffic by source/destination/port.
- Private Endpoints/Service Endpoints: connect to managed services (storage, databases, registries) privately.
- Kubernetes NetworkPolicies / Service Mesh Policies: restrict pod-to-pod and namespace-to-namespace traffic.
- Proxy/Allowlist: egress proxy that allows only explicitly permitted domains/IPs.
- DNS Control: resolve only approved internal names; prevent exfiltration via rogue DNS.
Common isolation patterns for ML
- Air-gapped training: no internet; artifacts mirrored internally; data ingress via controlled channels.
- Hub-and-spoke: central shared services in a hub; environments (dev/test/prod) in spokes with controlled peering.
- Zero Trust for services: identity-aware access, default deny between services, explicit allow by identity.
- Kubernetes namespace isolation: per-team or per-stage namespaces with NetworkPolicies and RBAC.
Worked examples
Example 1: Isolated training cluster with controlled egress
- Create a private training subnet with no inbound from the internet.
- Attach an egress-only path (NAT or approved proxy).
- Allow outbound only to: artifact registry, model registry, internal data lake.
- Use a private endpoint to reach the data lake; block all other storage endpoints.
- Add NetworkPolicies so training pods can reach only node-local services and the approved endpoints.
- Log all allowed egress for audit; alert on any denied attempts.
Result: training can fetch data and artifacts without exposing the cluster to unapproved internet access.
Example 2: Model registry reachable only from CI/CD
- Place the registry behind a private endpoint or in a subnet with no public route.
- Assign a service identity to CI/CD runners and allow only that identity's subnet or IPs.
- Block all inbound from other subnets; require TLS and mutual authentication.
- Set egress allowlist on CI/CD to the registry only; deny general internet egress from runners.
Result: only your pipeline can push/promote models; workstations and training nodes cannot directly modify the registry.
Example 3: Inference service exposed safely
- Deploy inference pods in a private subnet.
- Expose via a load balancer or API gateway placed at the edge; only the gateway is public.
- Gateway forwards traffic to the private service on approved ports; no direct inbound to pods.
- Default-deny egress from pods; allow only to feature store, telemetry endpoint, and model registry (if required).
- Use NetworkPolicies to limit namespace-to-namespace traffic.
Result: clients reach the gateway, not your pods; egress is tightly controlled to prevent data leakage.
Minimal viable isolation plan (hands-on)
- List ML components: training, feature store, data lake, registry, CI/CD, inference.
- Mark which need internet access and why. Challenge every item.
- Set default-deny inbound and egress for each component.
- Add only the smallest allow rules needed for operations.
- Introduce private endpoints for data and registry access.
- Enable flow logs/firewall logs and review weekly.
Checklist (tick as you go):
- Training subnet is private; egress is via approved gateway/proxy.
- Inference reachable only through gateway/load balancer; no direct pod ingress.
- Private endpoints for data and model registry configured.
- Kubernetes NetworkPolicies applied per namespace.
- Egress allowlist documented and enforced.
- Network logging enabled and reviewed.
Common mistakes and how to self-check
- Mistake: Allowing “temporary” broad egress rules that never get tightened. Self-check: scan for any 0.0.0.0/0 egress allows.
- Mistake: Public endpoints for storage/registry. Self-check: verify private endpoints and DNS resolve to private IPs.
- Mistake: No pod-level policies. Self-check: confirm NetworkPolicies exist and are enforced in each namespace.
- Mistake: Exposing training nodes directly to the internet. Self-check: inbound rules deny all except internal sources.
- Mistake: Missing logs. Self-check: ensure firewall/VPC flow logs are on and reachable for audits.
Exercises
Do the exercise below. Your progress is available to everyone; only logged-in users will have their progress saved.
Exercise 1: Design a minimal isolated topology for an ML inference service
Goal: expose an inference API to customers while preventing data exfiltration from pods.
- Draw or describe the subnets, gateway/load balancer, and private service placement.
- List inbound rules: who can call the API and on which ports.
- List egress allowlist for the service (feature store, telemetry, registry if needed).
- Write 2–3 NetworkPolicy rules that enforce the above.
- Describe how you will log and audit flows.
Deliverable: a short architecture note (5–10 bullet points) and the allow/deny rules.
- I created the architecture note.
- I defined inbound rules with least privilege.
- I defined an egress allowlist.
- I specified NetworkPolicies.
- I included logging/auditing steps.
Practical projects
- Build a sandbox: one VPC/VNet with three subnets (ingress, app, data). Deploy a toy inference app and prove that only the gateway reaches it.
- Private data path: configure a private endpoint to object storage; show that public endpoints are blocked and training still runs.
- Egress proxy: deploy an allowlist proxy and demonstrate that calls to unapproved domains fail while approved domains succeed.
Learning path
- Start with VPC/VNet, subnets, routing, and NAT/egress fundamentals.
- Learn firewall/security group and NetworkPolicy basics.
- Add private endpoints and identity-aware access between services.
- Integrate logs and monitoring; define alert thresholds.
- Harden with zero-trust principles and routine policy reviews.
Next steps
- Run the minimal isolation plan in a dev environment.
- Schedule a joint review with security to validate allowlists.
- Automate policy deployment via infrastructure as code to avoid drift.
Mini challenge
You discover a new outbound connection from inference pods to an analytics site. In one paragraph, describe how you would: block it immediately, investigate who initiated it, and prevent similar issues in the future.
Ready to test?
Take the quick test below. Anyone can take it; only logged-in users will have their progress saved.