Series · Cloud Computing · Chapter 5

Cloud Computing (5): Cloud Network Architecture and SDN

VPC, load balancers, CDN, SDN/NFV, and BGP -- a deep tour of cloud networking from packet to planet, with the production knobs that matter.

Cloud Computing (5): Cloud Network Architecture and SDN — Chapter overview

A cloud platform is essentially a network with attached computers. The compute layer scales by adding servers; the storage layer scales by adding disks; the network layer integrates these into a single, coherent system. Get the network right, and the rest of the stack feels effortless. Get it wrong — a missing route, a 5-tuple mismatch in a security group, or an under-provisioned load balancer — and the whole platform goes dark.

This article maps the cloud networking stack from the packet up: how a VPC carves an isolated network from shared infrastructure, what changes when load balancers move from L4 to L7, how a CDN converts geography into latency savings, why SDN reshaped the data center, and how BGP stitches it all together across regions.


What You Will Learn#

  1. VPC internals — subnets, route tables, gateways, endpoints, and the encapsulation that makes them isolated
  2. Load balancing — L4 vs L7, algorithms, health checks, sticky sessions, draining
  3. CDN architecture — edge PoPs, cache hierarchies, TLS termination, dynamic acceleration
  4. SDN — control / data plane split, OpenFlow / P4Runtime, NFV and service chains
  5. VPN, Direct Connect, and Transit Gateway — when to choose which connectivity model
  6. Security — security groups vs NACLs, flow logs, zero-trust micro-segmentation
  7. BGP and global routing — AS-paths, ECMP, anycast, multi-region failover

Prerequisites#

  • IP addressing and CIDR notation
  • Familiarity with at least one cloud console (AWS / GCP / Aliyun)
  • Parts 1-3 of this series

Virtual Private Cloud (VPC)#

A VPC is a software-defined slice of the cloud provider’s physical network that behaves like your own private datacentre: you choose the IP space, draw the subnets, install the gateways and write the firewall rules. Behind the scenes the provider implements isolation through VXLAN (or a proprietary equivalent) — every packet carries a tenant tag so two customers can both use 10.0.0.0/16 without ever seeing each other’s traffic.

Multi-AZ VPC architecture

Anatomy of a production VPC#

The diagram above is a textbook 3-tier deployment. The pieces:

ComponentLayerWhat it doesFree?
VPCnetworkThe 10.0.0.0/16 envelope. One per environment per region.yes
SubnetnetworkA CIDR carved out of the VPC, pinned to one AZ. Public/Private/Isolated by route table, not by name.yes
Internet Gateway (IGW)edgeBidirectional NAT-free path to the public internet. One per VPC.yes
NAT GatewayegressLets private instances reach the internet outbound; blocks inbound. Per-AZ for HA.hourly + GB
Route TablecontrolMaps destination CIDR -> target (IGW, NAT, VPCe, peering, TGW). One per subnet.yes
Security Groupinstance FWStateful, allow-only, attached to ENIs.yes
Network ACLsubnet FWStateless, allow + deny, ordered. Belt-and-braces with SGs.yes
VPC Endpointdata planePrivate path from VPC to a cloud service (S3, DynamoDB, Secrets Manager, …).gateway free / interface hourly
Transit GatewayhubN-to-N VPC + VPN + Direct Connect interconnect.hourly + GB

The “public vs private subnet” distinction is purely about routes. A subnet whose route table contains 0.0.0.0/0 -> igw-... is public; a subnet whose default route is 0.0.0.0/0 -> nat-... is private; a subnet with no internet route at all is isolated. This sounds trivial but is the source of half of all “I cannot reach the internet from my Lambda” tickets.

Terraform: a real, multi-AZ VPC#

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
locals {
  azs = ["us-east-1a", "us-east-1b", "us-east-1c"]
}

resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true
  tags = { Name = "prod-vpc" }
}

# One public + one private subnet per AZ
resource "aws_subnet" "public" {
  for_each                = toset(local.azs)
  vpc_id                  = aws_vpc.main.id
  cidr_block              = cidrsubnet(aws_vpc.main.cidr_block, 8, index(local.azs, each.key))
  availability_zone       = each.key
  map_public_ip_on_launch = true
  tags = { Name = "public-${each.key}", Tier = "public" }
}

resource "aws_subnet" "private" {
  for_each          = toset(local.azs)
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(aws_vpc.main.cidr_block, 8, index(local.azs, each.key) + 10)
  availability_zone = each.key
  tags = { Name = "private-${each.key}", Tier = "private" }
}

resource "aws_internet_gateway" "main" { vpc_id = aws_vpc.main.id }

# One NAT GW per AZ for HA (a single NAT GW is a per-AZ SPOF)
resource "aws_eip"         "nat" { for_each = aws_subnet.public }
resource "aws_nat_gateway" "nat" {
  for_each      = aws_subnet.public
  allocation_id = aws_eip.nat[each.key].id
  subnet_id     = each.value.id
}

# Routes
resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id
  route { cidr_block = "0.0.0.0/0"  gateway_id = aws_internet_gateway.main.id }
}

resource "aws_route_table" "private" {
  for_each = aws_subnet.private
  vpc_id   = aws_vpc.main.id
  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.nat[each.key].id
  }
}

# Gateway endpoint for S3 -- avoids NAT GW data-processing fees
resource "aws_vpc_endpoint" "s3" {
  vpc_id          = aws_vpc.main.id
  service_name    = "com.amazonaws.us-east-1.s3"
  route_table_ids = [for rt in aws_route_table.private : rt.id]
}

Three production-relevant choices encoded above:

  • One NAT GW per AZ. A single NAT GW is a per-AZ failure boundary; a region-wide NAT GW is a region-wide outage waiting to happen.
  • Gateway endpoint for S3 / DynamoDB. Without it, every byte to S3 from a private subnet flows through the NAT GW and bills at $0.045/GB. The endpoint is free.
  • map_public_ip_on_launch only on public subnets. Stops you from accidentally giving a private DB a public IP.
PatternTopologyBest forCaveats
VPC peeringpoint-to-point2-5 VPCs, simple intra-regionnon-transitive: A↔B and B↔C does not give A↔C
Transit Gatewayhub-and-spoke5+ VPCs, VPN + DX integration, central inspectionhourly + per-GB charge
PrivateLinkservice exposureone VPC publishes a service to many consumer VPCsone-way; no need to peer entire networks
Cloud WANglobal meshmulti-region, multi-account meshesnewest, simplest at scale, costs accordingly

The decision driver is mostly the topology you need, not the bandwidth — all of these can handle 10s of Gbps when sized correctly.


Load Balancing#

A load balancer turns a fleet into a service. It hides individual instance failures, spreads load, terminates TLS, and — in its L7 form — lets you reshape traffic by path, host or header without touching the apps.

L4 vs L7 load balancers

L4 vs L7#

FeatureNetwork LB (L4)Application LB (L7)
Operates onTCP / UDP / TLSHTTP / HTTPS / gRPC
Routing decision5-tuple (src IP/port, dst IP/port, protocol)path, host, header, cookie, JWT claim
Latency addedtens of µslow ms
Throughput per LB100s of Gbps10s of Gbps
TLS terminationoptional (passthrough or terminate)almost always terminate
WebSocket / HTTP/2 / gRPCpassthrough onlyfirst-class
Source IP preservationyes (with proxy protocol)via X-Forwarded-For
Best forGame servers, IoT, MQTT, low-level TCPWeb apps, microservices, public APIs

A practical pattern in modern stacks: L4 (NLB) -> L7 (Envoy/ALB) -> services. The L4 layer absorbs DDoS and gives you a stable anycast IP; the L7 layer does smart routing and authentication. Each layer does one thing well.

Algorithms#

1
2
3
4
5
6
7
Round robin           A, B, C, A, B, C, ...
Weighted RR           A(3), B(1)  ->  A, A, A, B, A, A, A, B, ...
Least connections     pick the backend with fewest in-flight requests (best for variable-cost requests)
Least response time   pick the backend with the lowest EWMA of response latency
Power of two choices  pick 2 backends at random; route to the lighter one (90% of "perfect" load balance, O(1) cost)
Consistent hash       hash(client IP / path) -> backend (sticky for cache locality)
Maglev hash           Google's bounded-disruption consistent hash (used in Cloud Load Balancing)

For most stateless web traffic, power-of-two-choices (P2C) is empirically near-optimal and trivial to implement. For cache-sensitive workloads (anything in front of a key-value store), use a consistent or Maglev hash so the same key keeps hitting the same backend.

ALB with path + host routing (Terraform)#

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
resource "aws_lb" "app" {
  name                             = "app-alb"
  load_balancer_type               = "application"
  security_groups                  = [aws_security_group.alb.id]
  subnets                          = [for s in aws_subnet.public : s.id]
  enable_cross_zone_load_balancing = true
  enable_deletion_protection       = true
  idle_timeout                     = 60
  drop_invalid_header_fields       = true     # mitigate request smuggling
}

resource "aws_lb_target_group" "web" {
  name        = "web-tg"
  port        = 8080
  protocol    = "HTTP"
  vpc_id      = aws_vpc.main.id
  target_type = "ip"      # pods/Fargate-friendly

  health_check {
    path                = "/healthz"
    interval            = 15
    timeout             = 5
    healthy_threshold   = 2
    unhealthy_threshold = 3
    matcher             = "200-299"
  }

  deregistration_delay = 30   # match graceful shutdown window
  stickiness {
    type            = "app_cookie"
    cookie_name     = "JSESSIONID"
    cookie_duration = 3600
    enabled         = true
  }
}

resource "aws_lb_listener" "https" {
  load_balancer_arn = aws_lb.app.arn
  port              = 443
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-TLS13-1-2-2021-06"
  certificate_arn   = aws_acm_certificate.main.arn
  default_action { type = "forward"  target_group_arn = aws_lb_target_group.web.arn }
}

# Path-based routing: /api/* to a different target group
resource "aws_lb_listener_rule" "api" {
  listener_arn = aws_lb_listener.https.arn
  priority     = 100
  action       { type = "forward"  target_group_arn = aws_lb_target_group.api.arn }
  condition    { path_pattern { values = ["/api/*"] } }
}

The non-obvious knobs:

  • deregistration_delay should match your application’s graceful-shutdown window. The default 300 s blocks deploys; setting it lower than your in-flight request timeout drops connections.
  • drop_invalid_header_fields = true mitigates HTTP request-smuggling attacks (CVE-2019-18860 class).
  • enable_cross_zone_load_balancing ensures even spread when AZs have unequal target counts. Off by default on NLB, on by default on ALB.

Health checks that do not lie#

A health check that hits / and runs a database query does tell you when the database is down — but it also marks the entire fleet unhealthy on the first DB blip, taking the whole site offline. The pattern that survives production:

  • /livez — “the process is alive” (no dependencies). LB checks this.
  • /readyz — “this instance can serve traffic right now” (cache warmed, DB pool open). Kubernetes checks this.
  • Application metrics, traces, and logs check the rest — not the load balancer.

Content Delivery Networks (CDN)#

A CDN is geography turned into a cache. Static content is replicated to edge PoPs near users; the origin (often an S3 bucket) is hit only on a cache miss. Latency drops from 200 ms to under 30 ms; origin egress drops by 10x or more.

CDN edge distribution

What actually happens on a request#

  1. DNS — the user resolves cdn.example.com to an anycast IP that lands at the closest healthy PoP.
  2. TLS termination — the TLS handshake completes at the edge (much shorter RTT) and the PoP holds a session ticket.
  3. Cache lookup — the PoP keys on URL + Vary headers. On HIT, return immediately.
  4. MISS path — the PoP fetches from a parent (regional cache) which may itself fetch from origin (the hierarchical/tiered cache reduces origin load even for unique content).
  5. Cache fill — the PoP stores the response per its TTL (Cache-Control: max-age, s-maxage, stale-while-revalidate).

Cache headers that work#

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Static, content-addressed assets (hash in filename) -- cache forever
location ~* \.[a-f0-9]{8,}\.(js|css|woff2|png|jpg|svg)$ {
    add_header Cache-Control "public, max-age=31536000, immutable";
}

# API responses -- short TTL, allow stale on origin trouble
location /api/products/ {
    add_header Cache-Control "public, max-age=60, s-maxage=300, stale-while-revalidate=86400";
}

# User-specific responses -- never cache
location /api/me {
    add_header Cache-Control "no-store";
}

Three rules of thumb:

  • immutable is the single biggest win on static assets (no revalidation traffic at all).
  • stale-while-revalidate lets the CDN serve a stale copy while it asynchronously refreshes — origin pain becomes invisible to users.
  • Vary on the bare minimum. Vary: Accept-Encoding is fine; Vary: User-Agent blows up the cache key space and your hit ratio.

Beyond static: dynamic acceleration#

Modern CDNs also accelerate uncacheable traffic. The mechanisms:

  • TLS / TCP termination at the edge cuts the client-side handshake from 4 × RTT_to_origin to 4 × RTT_to_edge.
  • Pre-warmed long-haul connections between PoPs and origin reuse keepalive (no re-handshake per request).
  • Anycast DNS + BGP optimisation routes the client to the lowest-latency PoP, not just the closest by mileage.

The combination — CloudFront’s “CloudFront Functions”, Aliyun DCDN, Cloudflare Workers — moves the line between “static” and “dynamic” further into the application than most teams realise.


Software-Defined Networking (SDN)#

Traditional networks bound the control (routing, ACLs, QoS) tightly to each device. SDN separates the control plane (decisions, programmable from above) from the data plane (line-rate forwarding). The result is a network that you can manage like software: declarative, version-controlled, observable.

SDN control plane vs data plane

The split#

PlaneResponsibilityImplementation
Control planeBuild the topology, compute paths, enforce policy, expose APIsLogically centralised controller (ONOS, OpenDaylight, hyperscaler proprietary)
Data planeMatch -> action on every packet at line rateASIC switches with a flow table, or eBPF/XDP/DPDK in software
Southbound APIController pushes flow rules to switchesOpenFlow 1.3+, NETCONF/YANG, gNMI, P4Runtime
Northbound APIApps program the controllerREST / gRPC, GraphQL

Inside a hyperscaler’s region, everything is SDN: Hyper-V Virtual Switch, Open vSwitch, Cisco ACI, AWS’s “Mapping Service” + “Hyperplane” implement what you experience as a VPC. Your security-group rules are compiled into ACL entries pushed to the host’s vswitch on the path your packet takes — not enforced at the destination instance.

Why operators love it#

  • Centralised intent. “All traffic from PCI-tagged subnets must pass through inspection” is one policy, applied everywhere, instead of 200 device configs.
  • Traffic engineering. A controller seeing the whole topology can route around congestion that distributed routing protocols (OSPF, IS-IS) react to with a 30-second delay.
  • Programmable packet processing. P4 lets the data plane parse new protocols (e.g. SRv6, custom in-band telemetry) without an ASIC respin.
  • Fast failure recovery. SDN-precomputed alternate paths + BFD failure detection reduces MTTR to milliseconds.

Network Functions Virtualisation (NFV)#

NFV is SDN’s sibling: replace dedicated network appliances with software running on commodity x86 servers. A firewall, a load balancer, a WAN optimiser — each becomes a VNF that you can spin up, scale, and chain like any other workload.

1
2
Traditional:   [Router HW] -> [Firewall HW] -> [LB HW] -> [WAN-opt HW]
NFV:           x86 server: [Router VNF] -> [Firewall VNF] -> [LB VNF] -> [WAN-opt VNF]

A representative VNF — HAProxy as the load balancer in a service chain:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
frontend public_https
    bind *:443 ssl crt /etc/haproxy/star.example.com.pem alpn h2,http/1.1
    http-request set-header X-Forwarded-Proto https
    default_backend api_pool

backend api_pool
    balance               leastconn
    option                httpchk GET /healthz HTTP/1.1\r\nHost:\ api.example.com
    http-check expect     status 200
    default-server        check inter 2s fall 3 rise 2 maxconn 256
    server api1 10.0.10.11:8080
    server api2 10.0.11.11:8080
    server api3 10.0.12.11:8080

Service chaining (sometimes called service function chaining, SFC) is the topology by which traffic walks through the VNFs in order. In a Kubernetes mesh, this is what an Envoy sidecar + mesh policy does for east-west traffic.


Hybrid connectivity: VPN, Direct Connect, Transit Gateway#

Most enterprises live in hybrid topologies: cloud workloads need to reach on-prem databases, identity providers, or partner networks. There are three layers of connectivity tooling.

VPN vs Direct Connect#

FeatureSite-to-site VPNDirect Connect / Express Connect / Cloud Interconnect
PathEncrypted tunnel over the public internetDedicated physical fibre to a provider POP
Bandwidthup to ~10 Gbit/s, jitter-prone1 / 10 / 100 Gbit/s, deterministic
Latencyvariable (tens of ms + jitter)low, single-digit ms within metro, stable
Setup timeminutesweeks (carrier provisioning)
Costhourly + per-GB egressport hourly + per-GB egress (cheaper at scale)
EncryptionIPsec mandatoryoptional MACsec; usually you run IPsec on top for confidentiality
Best forlow / variable volume, dev environmentsproduction data flows, low-latency sync, regulatory needs

A common layered approach: DX as the primary with a VPN as the encrypted backup that activates on DX failure. Both terminate on the same Virtual Private Gateway / Transit Gateway.

Transit Gateway#

A Transit Gateway (TGW) is the regional hub: every VPC, every VPN, every DX connection attaches once and gains reachability to everything else, governed by route tables on the TGW itself. It replaces the N²-peering problem with N attachments.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
resource "aws_ec2_transit_gateway" "main" {
  description                     = "prod-tgw"
  default_route_table_association = "disable"
  default_route_table_propagation = "disable"
}

resource "aws_ec2_transit_gateway_vpc_attachment" "prod" {
  transit_gateway_id = aws_ec2_transit_gateway.main.id
  vpc_id             = aws_vpc.prod.id
  subnet_ids         = [for s in aws_subnet.tgw_private : s.id]   # one /28 per AZ
}

# Two route tables: prod and shared services
resource "aws_ec2_transit_gateway_route_table" "prod"   { transit_gateway_id = aws_ec2_transit_gateway.main.id }
resource "aws_ec2_transit_gateway_route_table" "shared" { transit_gateway_id = aws_ec2_transit_gateway.main.id }

# Prod can reach shared, shared can reach prod, prod cannot reach prod
resource "aws_ec2_transit_gateway_route" "prod_to_shared" {
  destination_cidr_block         = "10.10.0.0/16"
  transit_gateway_route_table_id = aws_ec2_transit_gateway_route_table.prod.id
  transit_gateway_attachment_id  = aws_ec2_transit_gateway_vpc_attachment.shared.id
}

The pattern — disable default association/propagation, then write the routes you want — is how you turn a TGW from a flat-mesh footgun into an enforceable segmentation boundary.


Security in the network#

Security Groups vs Network ACLs#

FeatureSecurity GroupNetwork ACL
Attached toENI / instanceSubnet
StateStateful (return traffic auto-allowed)Stateless (must allow both directions)
RulesAllow onlyAllow + Deny
DefaultDeny all inbound, allow all outboundAllow all both ways
Rule evaluationAll rules unionFirst match wins (numeric order)
Quota (AWS)60 inbound + 60 outbound, 5 SGs per ENI20 inbound + 20 outbound rules per NACL
Best forApp-to-app authorisationBlast-radius limits, deny-listing IPs

The robust pattern is SG-to-SG references: instead of opening port 3306 from a CIDR, open port 3306 from aws_security_group.app.id. The rule remains correct as the app fleet scales and IPs churn; auditors see intent, not infrastructure.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
resource "aws_security_group" "db" {
  vpc_id = aws_vpc.main.id
  ingress {
    from_port       = 3306
    to_port         = 3306
    protocol        = "tcp"
    security_groups = [aws_security_group.app.id]   # not a CIDR
    description     = "MySQL from app tier"
  }
  egress {
    from_port = 0  to_port = 0  protocol = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

VPC Flow Logs#

Flow Logs record every accepted/rejected 5-tuple to S3 or CloudWatch. Three queries you will run within a week of enabling them:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
-- 1. Top REJECTed source IPs in the last hour (probably scanners / misconfigured clients)
SELECT srcaddr, COUNT(*) c
FROM "vpc_flow_logs"
WHERE action = 'REJECT' AND start_time > now() - interval '1' hour
GROUP BY srcaddr ORDER BY c DESC LIMIT 20;

-- 2. Egress to unexpected destinations (data exfiltration hunting)
SELECT dstaddr, dstport, SUM(bytes) bytes_out
FROM "vpc_flow_logs"
WHERE direction = 'egress' AND dstaddr NOT LIKE '10.%'
GROUP BY dstaddr, dstport ORDER BY bytes_out DESC LIMIT 20;

-- 3. Cross-AZ traffic (often unintentional, always billed)
SELECT srcsubnet, dstsubnet, SUM(bytes) bytes
FROM "vpc_flow_logs"
WHERE az_id_src != az_id_dst
GROUP BY srcsubnet, dstsubnet ORDER BY bytes DESC LIMIT 20;

Zero trust at the network layer#

The “soft chewy interior” model — a hard perimeter, trusted internal network — is gone. Modern designs assume the network is hostile and enforce identity per request. Practical building blocks:

  • mTLS everywhere (service-to-service via a mesh: Istio, Linkerd, App Mesh).
  • Short-lived workload identity (SPIFFE/SPIRE, IAM Roles for Service Accounts, GCP Workload Identity).
  • Per-flow policy evaluated by the SDN (security groups + Cilium NetworkPolicies + service-mesh L7 RBAC layered together).

BGP and global routing#

When a request leaves your region and crosses the public internet — or when you fail over from one region to another — the system that gets it there is BGP, the Border Gateway Protocol. BGP is the routing protocol between autonomous systems (ASes); within a region your provider runs OSPF or IS-IS, but the moment you step out, BGP is in charge.

BGP across multiple regions

The route-selection algorithm (simplified)#

Given multiple paths to the same prefix, BGP picks one by walking down a long tiebreak list. The first five rungs cover 99% of cases:

  1. Highest LOCAL_PREF — “internal preference”; how we want to leave our network. Operator-set.
  2. Shortest AS_PATH — fewer ASes to traverse.
  3. Lowest MED — “I’d prefer you enter my AS this way”; honoured between peers, not always between providers.
  4. eBGP > iBGP — prefer routes learned from external peers over internal.
  5. Lowest IGP cost to the next hop — shortest internal path to the chosen exit.

ECMP and anycast#

  • ECMP (Equal-Cost Multi-Path) — when several paths tie on all the above, hash the 5-tuple of each flow and spread them across the equal-cost neighbours. ECMP is how a 100 Gbit/s logical link is built from ten 10 Gbit/s physical links, and how a multi-region anycast service spreads load across its PoPs.
  • Anycast — announce the same IP from many locations; BGP delivers each user to the topologically closest one. Anycast is the foundation of every CDN, every DNS root, every public DoH resolver.

Multi-region failover#

A common pattern for global services:

  • Primary region serves writes; secondary regions serve reads.
  • The DNS name is anycast or a Route 53 / Aliyun GTM failover record set keyed off health checks.
  • Database replication is async cross-region; on failover the standby is promoted (RPO of seconds to minutes).
  • On recovery, traffic is manually shifted back — automatic flap-back has burned almost everyone who has tried it.

The BGP details rarely intrude on the application — but when something is mis-announced upstream (the Facebook 2021 outage or any of the hundreds of route leaks per year) every layer above goes dark. RPKI route-origin validation and a careful prefix list at the ingress edge are mandatory hygiene at any meaningful scale.


Troubleshooting in production#

The triage order#

When traffic is broken, walk up from the wire:

  1. Reachability — can the source ARP/ND the gateway? ip neigh, arping. If not, it is L2 / SG / NACL.
  2. Routing — does the route table have a path? aws ec2 describe-route-tables, ip route.
  3. Filters — is the SG / NACL allowing the 5-tuple? Check both directions if NACL is involved.
  4. DNS — is the name resolving to the IP you think? dig +short, beware of split-horizon DNS.
  5. TLSopenssl s_client -connect host:443 -servername host. Half of “API down” is a stale cert.
  6. Applicationcurl -vvv https://.... Log shows 200 OK? It’s not the network.

The Swiss-army CLI#

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Where are packets actually going?
mtr -rwn -c 100 example.com           # combine traceroute + ping with stats
ss -tunap                              # what is listening / connected and to whom
ss -i                                  # per-socket congestion window, RTT
tcpdump -ni any -s 96 'port 443 and tcp[tcpflags] & (tcp-syn|tcp-fin) != 0'   # SYN/FIN summary
nft list ruleset                       # current netfilter rules (modern iptables)

# DNS
dig +trace example.com
dig @8.8.8.8 example.com               # bypass local resolver to isolate split-horizon

# Path MTU
ip -6 route get 2606:4700::6810:84e5    # confirm next hop + MTU
ping -M do -s 1472 8.8.8.8              # detect black-holed PMTUD

The five most expensive misconfigurations#

SymptomUsual causeFix
Lambda in VPC cannot reach the internetDefault route missing or NAT GW down0.0.0.0/0 -> nat-... on the function’s subnet
Sudden NAT GW bill spikeApp pulled large object from S3 over NAT instead of via Gateway EndpointAdd aws_vpc_endpoint.s3
Random TCP resets after secondsSG / NACL allows new flows but stateful tracking dropped (idle timeout)Tune keepalive, increase tcp-keepalive-time
Cross-AZ data transfer cost dwarfs computeLB cross-zone off, or services not AZ-awareEnable cross-zone LB; topology-aware routing in K8s
Asymmetric routing on multi-homed VPCSG stateful tracking sees only one directionRoute both directions through the same TGW attachment

FAQ#

Q: Should we run our own load balancer (HAProxy / NGINX) or use the cloud LB?

Use the cloud LB unless you have a specific reason not to (custom routing logic, sticky on a header the cloud LB does not support, tight latency budget). Cloud LBs are HA, auto-scaling, integrated with TLS / WAF / IAM; you get back the engineer who would have run HAProxy.

Q: NLB or ALB?

ALB for HTTP/gRPC services (path/host routing, OIDC auth, redirects). NLB for non-HTTP TCP/UDP, for static-IP requirements (each NLB has a per-AZ EIP), or when you need extreme PPS.

Q: How many subnets per VPC?

At least one public + one private + one DB subnet per AZ. A 3-AZ VPC therefore has 9 subnets. More if you separate by tier (web/app/db/mgmt) or by sensitivity (PCI/non-PCI). Keep /16 per VPC and /22-/24 per subnet so you have room to grow.

Q: When does my workload need its own SDN controller (vs the cloud’s)?

Almost never on the cloud — you are renting the controller and you cannot replace it. On-prem with OpenStack or VMware NSX, yes. Edge / 5G operators who build their own DC fabric, yes. Most application teams should not even think about SDN below the VPC abstraction.

In this series

Cloud Computing 8 parts

  1. 01 Cloud Computing (1): Fundamentals and Architecture
  2. 02 Cloud Computing (2): Virtualization Technology Deep Dive
  3. 03 Cloud Computing (3): Cloud-Native and Container Technologies
  4. 04 Cloud Computing (4): Cloud Storage Systems and Distributed Architecture
  5. 05 Cloud Computing (5): Cloud Network Architecture and SDN you are here
  6. 06 Cloud Computing (6): Cloud Security and Privacy Protection
  7. 07 Cloud Computing (7): Cloud Operations and DevOps Practices
  8. 08 Cloud Computing (8): Multi-Cloud and Hybrid Architecture

Liked this piece?

Follow on GitHub for the next one — usually one a week.

GitHub