Understanding AWS Networking: A Guide for Network Engineers

AWS networking presents unique challenges, even for experienced network professionals transitioning to cloud environments. This post provides a simplified abstraction of core AWS network elements and reveals insights into AWS's underlying network architecture. By examining these components and design principles, network engineers can better understand AWS service behaviors, inherent limitations, and key features. This knowledge is crucial for effectively implementing and managing connectivity within and to AWS environments.

The following blog post presents an abstracted model of AWS networking services, mapped to a network topology. This conceptual representation is designed to aid in understanding AWS networking principles and does not reflect the actual implementation of AWS infrastructure. Network engineers should use this abstraction as a mental model to comprehend AWS networking concepts, service interactions, and design considerations, while recognizing that the underlying AWS architecture may differ significantly from traditional on-premises network topologies.

VPC and Internet connectivity

I want to begin with a simple setup. The Virtual Private Cloud (VPC) forms the foundational network construct in AWS, analogous to a data center network. Even though the first picture below looks simple it contains a lot of complexity under the hood. Each VPC is assigned to an AWS Region and each region has at least three Availability Zones (AZs). You can see an AZ as a physical vessel containing compute and network infrastructure in a minimal geographical area. AWS also sticks to these laws of physics and has segmented a VPC into subnets living inside an AZ. However, since a VPC is a regional construct and the network is purely based on a software-defined approach you may face additional components logically stretched over multiple AZs. One example is a route table which can be attached to one or more subnets inside a VPC.

While AWS provides its own set of symbols for network representation, this post adopts a different approach. Given the assumed networking background of the audience, AWS networking concepts have been translated into a topology using traditional network components such as routers or switches. This translation serves two primary purposes:

Facilitates easier comprehension for those with conventional networking expertise.
Enables a more effective explanation of AWS network features and limitations by mapping them to familiar networking constructs.

This approach allows for a direct comparison between AWS networking principles and traditional network architectures, highlighting both similarities and key differences.

The VPC is abstracted as a "VPC Router," while subnets are represented as switching devices. Since AWS controls all of the underlying infrastructure and implements its networking stack a user has no CLI access to the network infrastructure and is limited in its configuration. Let me summarize the most important facts:

VPC CIDR Ranges: AWS will not stop you from provisioning 2 VPCs within the same CIDR Range. You need to take care of non-overlapping networks when VPCs are interconnected.
Subnet as VRF: A VRF can be seen as a VLAN working at OSI layer 3 - this means that each AWS subnet has a different routing table instance.
Each subnet must be part of one or more CIDR range(s) assigned to a VPC. VPC CIDR ranges are always propagated into each subnet (VRF).
The VPC router always provides the following services for each subnet
1. First address inside the subnet: Default Gateway
2. Second address inside the subnet: DNS Server (forwarding requests to a route53 resolver)
3. Third address inside the subnet: Reserved for future use
4. Last address inside the subnet: Broadcast address (even though broadcast is not supported)
The DNS service is configurable
1. you can enable/disable the DNS server
2. you can enable/disable the automatic management of pubic DNS entries for workloads inside your VPC.
Per default, AWS has a source/destination check active on each subnet. This means AWS will drop your packet if neither the source nor the destination of a processed packet matches with one of the IP addresses available on your workload. If you need to install a routing device inside your VPC you must be sure to disable this check.

Whenever a virtual machine (= EC2 instance on AWS) is deployed it will always communicate over a virtual network interface called "Elastic Network Interface (ENI)" on AWS. Even though most people associate the ENI with a network interface card are more accurately conceptualized as a plug of a switching device. I´ve chosen the plug as it´s possible to move ENIs from one EC2 instance to a different one (if it´s not the primary ENI) running inside the same subnet.

Great - so far we´re able to run a network on AWS. Now let us explore how this can be extended by introducing internet connectivity:

This picture introduces more complex elements of AWS networking. Let's break it down for network engineers. The Internet Gateway (IGW) attached to a VPC can be seen as a local internet breakout for the VPC. The NAT Gateway service implements what the name indicates. You can choose to use a nat instance (a Linux machine running iptables) or the AWS service (where only an ENI without any additional configuration possibilities is available). Also, the number of routing tables has increased from one to three. This routing configuration creates a distinction between private subnets (blue) and public subnets (green).

As you can see the IGW can be seen as a router connected to the VPC router. The Nat gateway is a routing device connected to a subnet switch. The internet gateway router has a 1-to-1 NAT configured for each of our NAT GW vpc ip´s. I personally like this concept as it allows us to introduce flexible public instances powered by Elastic IP´s:

An elastic IP is a public IP attached to an ENI. The assignment of the public IP can be seen as a 1-to-1 nat configuration on the internet gateway router it´s easy to move an EIP between any of our instances running inside the same AWS region.

EIPs can be moved inside the same AWS region accross AZ boundaries.

Flow control - Introducing Load Balancing, Network Access Control Lists (NACL) and Security Groups (SG)

AWS provides several services to manage flow control and achieve the elasticity expected from a cloud provider. Key among these is Load Balancing and automated horizontal scaling, typically implemented through Auto Scaling Groups (ASGs) or integrated within managed services like ECS. AWS offers three main types of load balancers - Network Load Balancer (NLB) at Layer 4, Application Load Balancer (ALB) at Layer 7, and Gateway Load Balancer (GWLB) at Layer 3 - each tailored for different use cases but fundamentally serving the same purpose. This blog post will focus on the ALB. Additionally, AWS provides stateless Network Access Control Lists (NACLs) for managing traffic between subnets within a VPC and stateful Security Groups (SGs) bound to Elastic Network Interfaces (ENIs) for more granular traffic control. Here's the topology:

Application Load Balancer: The Application Load Balancer (ALB) is not owned by your VPC; it is a managed AWS service with access to your VPC via one or more dedicated Elastic Network Interfaces (ENIs). This setup means that AWS manages part of the infrastructure, with a segment integrated into your network. While some configuration options are available to you, much of the underlying complexity is managed by AWS. Key functions of the ALB include Layer 7 load balancing for HTTP traffic, processing HTTP headers to optimize load balancing, supporting Server Name Indication (SNI), enabling TLS termination at the edge, and dynamically scaling ENIs within your VPC based on traffic load.

Did you know that your NLB can preserve your clients source IP when forwarding traffic to a backend service?

Network Access Control Lists: NACLs stay the same as in traditional networking. You could say it´s a configuration made on the VPC router. As this can be configered per VRF (or AWS subnet) it´s perfect to mitigate the default propagation of subnets. Due to their stateless nature, NACLs require rules for both inbound and outbound traffic.

Security Groups: Security Groups are essential for dynamic scaling, acting as "identity attributes" that can define policies beyond just IP addresses. For example, when an auto-scaling service launches a new resource and attaches it to a load balancer target group, you only need to assign a Security Group that permits traffic from the load balancer's Security Group. Since load balancers can deploy one or more ENIs with non-deterministic IPs in your VPC, Security Groups provide the necessary flexibility for secure communication.

In comparison to other software defined solutions like Cisco SDA for branch networking, multipe security groups can be attached to one ENI. This is why I see the security group more as an attribute of an identiy rather than the identity itself. Cisco´s approach with Secure Groups and policies (SGACLs) is more limited as it only allows one group per resource.

AWS Private link - How to connect a VPC to services provided inside the AWS Cloud

AWS hosts more than 200 services on a global footprint. Traffic destined for an AWS service or for a service hosted inside the AWS network can be sent without having any connection to the internet. While this sounds logical to us it still needs to be implemented into a VPC. To enable this within a VPC, AWS offers the "PrivateLink" service, which allows you to securely access AWS services or your own services hosted in AWS without traversing the internet. A possible implementation is shown here:

AWS offers two primary methods to interact with the PrivateLink network: Gateway Endpoints and Interface Endpoints.

Gateway Endpoints: Suitable for services like S3 or DynamoDB, which use deterministic public IP ranges. These endpoints connect your VPC to AWS PrivateLink via the VPC router, using AWS-maintained prefix lists (routes). You simply attach the prefix list to your VPC router. However, this approach offers less fine-grained control compared to Interface Endpoints, as it doesn't allow for Layer 7 properties like filtering based on the origin ENI or AWS User.

Interface Endpoints: These act like bastion hosts, routing traffic to specific AWS services within the same region. For instance, an Interface Endpoint could route traffic to "secretsmanager.eu-central-1.amazonaws.com." You must provision at least one endpoint per AWS service. For high availability, you can deploy endpoints across multiple Availability Zones, which also reduces cross-AZ costs, as each endpoint has its own DNS entry.

Interface Endpoints for existing services (ie. all AWS services) are dependent on route53. AWS hosts a private DNS Zone for each deployed VPC which is only resolvable locally. If you decide to use a DHCP Optionset and overwrite the AWS default DNS server you need to apply proper DNS entries to have the same routing behavior. Explore the picture above for more details.

AWS Transit Gateway - How to interconnect VPCs and external networks

We've covered key elements of AWS connectivity, but larger organizations often need to connect hundreds of VPCs and external networks, such as branch offices and remote workers. AWS Transit Gateway serves as a central hub, functioning similarly to a route reflector. It enables you to connect various networks—including VPCs, VPNs, and DirectConnect (a Layer 2 connection between AWS and a customer or cloud exchange provider)—through Transit Gateway attachments. Understanding two core functions is crucial:

Transit Gateway Association: This is similar to connecting your network device to the Transit Gateway. Each Transit Gateway attachment can be associated with one routing table or none. The association determines which routing table within the Transit Gateway will handle traffic from the peered network device.

Transit Gateway Propagation: Transit Gateway supports multiple routing tables, allowing you to manage route visibility. Propagation advertises routes learned from a Transit Gateway attachment into a routing table. You can configure propagations for multiple routing tables within the same Transit Gateway, enabling granular control over route distribution.

To further illustrate how Transit Gateway works, the following topology is provided:

The Transit Gateway functions as a highly available logical router. Each transit gateway attachment can use BGP to exchange routes. The connected network learns all routes from the associated transit gateway route table. The transit gateway learns all routes shared by you.

BGP properties cannot be changed on the transit gateway side. However, it is still possible to configure your side. For example: AS prepend can be used to attract traffic. If you run bigger network (especially with the direct connect service) I recommend to review the quotas in order to prevent issues. Example: AWS will shut down BGP if you propagate too many routes.

For VPN connections AWS requires some on-premise information like your public IP, BGP ASN. This information is stored within the "customer gateway" data endpoint in your AWS console. Since 2020 AWS has launched a new connection type called "connect attachment". Since this one has a special use case I want to describe the implementation with the following topology:

Hosting firewalls or routing instances, such as SD-WAN routers, within AWS is common. This raises the question: why use a VPN to connect AWS workloads to the Transit Gateway? A VPN sends traffic over the public internet making encryption a must, which can be costly (more computing power and more expensive traffic cost). The Connect Attachment addresses this by using an existing VPC Attachment as transport. This enables a transitioning from IPsec to GRE, and makes encryption optional. The direct connection to the Transit Gateway improves the performance of routers and network capacity, offering higher throughput—Connect Attachments support up to 5 Gbps, compared to the 1.25 Gbps limit of VPN attachments.

Direct Connect - A Layer2 service into the AWS Backbone

In the past, I’ve used AWS Direct Connect to link corporate networks with the AWS backbone through third-party cloud exchange providers like Equinix. Direct Connect offers highly available, high-throughput connections, but comes with a significant cost. Given the price, enterprises should consider alternative solutions unless they require 100 Gbps or more. The complexity of implementing Direct Connect is better suited for advanced use cases, which I may cover in a future post. For now, let´s not delve deeper into Direct Connect.

How AWS Networking works under the hood (optional content for interested readers)

With all the knowledge above you may wonder how this "logical" topology can be mapped with reality. The good news: AWS has shared some knowledge about their network in re:invent sessions. Let me try to explain the most important parts to you.

First of all, the AWS network consists of multiple (probably several hundred) physical servers per AZ. These servers also have embedded network elements. You could see the AWS backbone as a huge distributed System. Within this framework, the "VPC" is a logical construct that spans all available AWS hardware. Consequently, EC2 instances within the same VPC are typically distributed across different physical hosts.

AWS uses plain IP packets for transport across physical hosts but has developed its own networking stack and packet encapsulation to enhance routing and switching efficiency. This custom approach enables optimized packet handling. AWS's networking architecture includes four key components, which I’ve illustrated in the following picture:

Physical Host and Mapping Service: The physical Host is the heart of AWS. This is the device where for example EC2 instances are hosted. Any physical host needs to be aware of AWS native elements such as VPCs. It is necessary to "share" state between all physical devices within the same AZ or even Region. The mapping service maintains and shares the state information, serving as the single source of truth within the AWS ecosystem.

HyperPlane Node: These nodes are highly performant NATtin devices and are also hosted on the physical host explained above. HyperPlane Nodes are used for various AWS services, including NAT Gateway, Network Load Balancer (NLB), and Elastic File System (EFS), among others.

Blackfoot Edge Device: Since AWS uses its own network stack the Blackfoot edge devices connect all outside networks with the AWS network. It is responsible for routing traffic from/to the internet, direct connect or services like S3.

Closing Statement

This post described various basic elements of AWS networking and their function. In addition to traditional AWS Architectures, I have also provided an abstraction in the form of a network topology. The combination of service explanation and the diagrams should help any network engineer entering the AWS world in their first projects. In addition to the explained basic elements AWS offers more network-related services such as CloudFront, a content distribution network. The last paragraph has shown how AWS works under the hood to enhance the learning experience even more. I hope you have enjoyed reading this post - feel free to give me a thumbs up if you like what I have written.