02a-AWS

AWS (Amazon Web Services):

AWS Cloud Computing:

Six advantages of cloud computing:

Trade capital expense (CAPEX) for operational expense (OPEX) (Trade fixed expense for variable expense):
- Pay On-Demand: don’t own hardware;
- Reduced Total Cost of Ownership (TCO) & Operational Expense (OPEX);
Benefit from massive economies of scale:
- Prices are reduced as AWS is more efficient due to large scale;
Stop guessing capacity:
- Scale based of actual measured usage;
Increase speed and agility;
Stop spending money running and maintaining data centers;
Go global in minutes: leverage the AWS global infrastructure.

Problems solved by the cloud:

Flexibility: change resource types when needed;
Cost-Effectiveness: pay as you go, for what you use;
Scalability: accommodate larger loads by making hardware stronger or adding additional nodes;
Elasticity: ability to scale out and scale-in when needed;
High-availability and fault-tolerance: build across data centers;
Agility: rapidly develop, test and launch software applications.

Well-Architected Framework:

General Guiding Principles:

Stop guessing your capacity needs;
Test systems at production scale;
Automate to make architectural experimentation easier;
Allow for evolutionary architectures;
- Design based on changing requirements;
Drive architectures using data;
Improve through game days;
- Simulate applications for flash sales days.

AWS Cloud Best Practices - Design Principles:

Scalability: vertical & horizontal;
Disposable Resourcs: servers should be disposable & easily configured;
Automation: Serverless, Infrastructure as a Service, Auto Scaling..;
Loose Coupling: change or failure in one component should not cascade to other;
Services, not Servers: use managed services, databases and serverless, instead of just EC2.

Well-Architected Framework (6 Pillars):

Operational Excellence: Prepare, Operate, Evolve. Includes the ability to run and monitor systems to deliver business value and to continually improve supporting processes and procedures:
- Perform operations as code (IaaS);
- Make frequent, small, reversible changes: So in a case of failure, it can be reversed;
- Refine operations procedures frequently: Ensure that team members are familiar with it;
- Anticipate failure;
- Learn from all operational failures;
- Use managed services: to reduce operational burden;
- Implement observability for actionable insights: performance, reliability, cost, etc;
Security: Identity and Access Management, Detective Controls, Infrastructure Protection, Data Protection, Incident Response. Includes the ability to protect information, systems and assets while delivering business value through risk assessments and mitigation strategies:
- Implement a strong identity foundation: Centralize privilege management and reduce reliance on long-term credentials. Principle of least privilege;
- Enable traceability: Integrate logs and metrics with systems to automatically respond and take action;
- Apply security at all layers: Like edge network, VPC, subnet, load balancer, every instance, operating system and application;
- Automate security best practices;
- Protect data in transit and at rest: Encryption, tokenization and access control;
- Keep people away from data: Reduce incident response simulations and use tools with automation to increase your speed for detection, investigation and recovery;
- Shared Responsibility Model;
Reliability: Foundations, Change Management, Failure Management. Ability of a system to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand and mitigate disruptions such as misconfigurations or transient network issues:
- Test recovery procedures: Use automation to simulate different failures or to recreate scenarios that led to failures before;
- Automatically recover from failure: Anticipate and remediate failures before they occur;
- Scale horizontally to increase aggregate system availability: Distribute requests across multiple, smaller resources to ensure that they don’t share a common point of failure;
- Stop guessing capacity: Maintain the optimal level to satisfy demand without over or under provisioning: use auto scaling;
- Manage change in automation: Use automation to make changes to infrastructure;
Performance Efficiency: Selection, Review, Monitoring, Tradeoffs. Includes the ability to use computing resources efficiently to meet system requirements, and to maintain that efficiency as demand changes and technologies evolve:
- Democratize advanced technologies: Advance technologies become services and hence you can focus more on product development;
- Go global in minutes: Easy deployment in multiple regions;
- Use serverless architectures: Avoid burden of managing servers;
- Experiment more often: Easy to carry our comparative testing;
- Mechanical sympathy: Be aware of all AWS services;
Cost Optimization: Expenditure Awareness, Cost Efective Resources, Matching supply and demand, Optimizing Over Time. Includes the ability to run systems to deliver business value at the lowest price point:
- Adopt a consumption mode: Pay only for what you use;
- Measure overall efficiency: use CloudWatch;
- Stop spending money on data center operations: AWS does the infrastructure part and enables customer to focus on organization projects;
- Analyze and attribute expenditure: Accurate identification of system usage and costs, helps measure return on investment (ROI): Make sure to use tags;
- Use managed and application level services to reduce cost of ownership: As managed services operate at cloud scale, they can offer a lower cost per transaction or service;
Sustainability: EC2 Auto Scaling, Lambda and Fargate, Cost Explorer, EC2 Spot Instances, EFS-IA, Amazon S3 Glacier, Amazon Data Lifecycle Manager, Read Local, Write Global (RDS Replical, DynamoDB Global table, CloudFront). The sustainability pillar focuses on minimizing the environmental impact of running cloud workloads:
- Understand your impact: establish performance indicators, evaluate improvements;
- Establish sustainability goals: Set long-term goals for each workload, model return on investments (ROI);
- Maximize utilization: Right size each workload to maximize the energy efficiency of the underlying hardware and minimize idle resources;
- Anticipate and adopt new, more efficient hardware and software offering: and design for flexibility to adopt new technologies over time;
- Use managed services: Shared services reduce the amount of infrastructure; Managed services help automate sustainability best practices as moving infrequent accessed data to cold storage and adjusting compute capacity;
- Reduce the downstream impact of your cloud workloads: Reduce the amount of energy or resources required to use your services and reduce the need for your customers to upgrade their devices. They are not something to balance or trade-offs, they are a synergy.

AWS Customer Carbon Footprint Tool: track, measure, review and forecast the Carbon emissions generated from your AWS usage. Helps you meet your own sustainability goals.

AWS Cloud Adoption Framework (AWS CAF) helps you build and then execute a comprehensive plan for your digital transformation through innovating use of AWS.

AWS Cloud Adoption Framework (AWS CAF)

The AWS Cloud Adoption Framework (AWS CAF) leverages AWS experience and best practices to help you digitally transform and accelerate your business outcomes through innovative use of AWS. AWS CAF identifies specific organizational capabilities that underpin successful cloud transformations. These capabilities provide best practice guidance that helps you improve your cloud readiness. Six perspectives:

Business Perspective: helps ensure that your cloud investments accelerate your digital transformation abitions and business outcomes;
People Perspective: serve as a bridge between technology and business, accelerating the cloud journey to help organizations more rapidly evolve to a culture of continuous growth, learning and where change becomes business-as-normal, with focus on culture, organizational structure, leadership and workforce;
Governance Perspective: helps you orchestrate your cloud initiatives while maximizing organizational benefits and minimizing transformation-related risks;
Platform Perspective: helps you build an enterprise-grade, sclable, hybrid cloud platform; modernize existing workloads; and implement new cloud-native solutions;
Security Perspective: helps you achive the confidentiality, integrity and availability of your data and cloud workloads;
Operations Perspective: helps ensure that your cloud services are delivered at a level that meets the needs of your business.

AWS CAF - Transformation Domains:

Technology: using the cloud to migrate and modernize legacy infrastructure, aplications, data and analyticas platforms;
Process: digitizing, automating and optimizing your business operations:
- Leveraging new data ad analytics platforms to create actionable insights;
- Using machine learning (ML) to improve your customer service experience;
Organization: Reimagining your operating model:
- Organizing your teams around products and value streams;
- leveraging agile methods to rapidly iterate and evolve;
Product: reimagining your business model by creating new value propositions (products & services) and revenue models.

AWS Right sizing: is the process of matching instance types and sizes to your workload performance and capacity requirements at lowest possible cost.

AWS Professional Services & Partner Network

APN Technology Partners: Independent Software Vendors (ISVs), tools providers, platform providers, and others;
APN Consulting Partners: System Integrators (SIs), agencies, consultancies, Managed Service Providers (MSPs), and others;
APN Training Partners: a breadth of AWS Training options for learners of all levels, also provides classroom and digital offerings, and live instructors, on-demand courses. Finds who can help you learn AWS.

AWS IQ: quickly find professioal help for your AWS projects. Engage and pay AWS Certified third-party experts for on-demand project work. Video-conferencing, contract management, secure collaboration, integrated billing.

AWS re:Post: AWS-managed Q&A service.

AWS Managed Services (AMS) provides infrastructure and application support on AWS. Offers a team of AWS experts who manage and operate your infrastructure for security, reliability and abailability.

AWS Global Infrastructure:

AWS Regions: is a physical location around the world where we cluster data centers. We call each group of logical data centers an Availability Zone. Each AWS Region consists of a minimum of three, isolated, and physically separate AZs within a geographic area.

AWS has Regions all around the world;
Names can be us-east-1, eu-west-2..;
A region is a cluster of data centers;
Most AWS services are region-scoped.

How to choose an AWS Region:

Compliance with data governnance and legal requirements: data never leaves a region without your explicit permission;
Proximity to customers: reduced latency;
Available services with a Region: new services and new features aren’t available in every Region;
Pricing: pricing varies region to region and is transparent in the service pricing page.

AWS Availability Zones (AZ): is one or more discrete data centers with redundant power, networking, and connectivity in an AWS Region. AZs give customers the ability to operate production applications and databases that are more highly available, fault tolerant, and scalable than would be possible from a single data center.

Each region has many availability zones (min 3, max 6): ap-southeast-2a, ap-southeast-2b, ap-southeast-2c;
Each availability zone (AZ) is one or more discrete data centers with redundant power, networking and connectivity;
They’re separate from each other, so that they’re isolated from disasters;
They’re connected with high bandwidth, ultra-low latency networking.

AWS Local Zone location is an extension of an AWS Region where you can run your latency sensitive applications using AWS services such as Amazon Elastic Compute Cloud, Amazon Virtual Private Cloud, Amazon Elastic Block Store, Amazon File Storage, and Amazon Elastic Load Balancing in geographic proximity to end-users.

AWS Edge Locations (Point of Presence): is a site that Amazon CloudFront uses to store cached copies of your content closer to your customers for faster delivery.

Amazon has 400+ Points of Presence (400+ Edge Locations & 10+ Regional Caches) in 90+ cities across 40+ countries;
Content is delivered to end users with lower latency.

AWS WaveLenght are infrastructure deplaoyments, embedded within the telecommunications providers’ datacenters at the edge of the 5G networks.

AWS Outposts are “server racks” that offers the same AWS infrastructure, services, APIs & tools to build your own applications on-premises just as in the cloud. AWS will setup and manage “Outpost racks” within your on-premises infrastructure. Customer is responsible of the Outposts Rack physical security.

Low-latency access to on-premises systems;
Local data processing;
Data residency;
Easier migration from on-premises to the cloud;
Fully managed service.
Supported services on Outposts: EC2, EBS, S3, EKS, ECS, RDS, EMR

Service: AWS offers a broad set of global cloud-based products including compute, storage, database, analytics, networking, machine learning and AI, mobile, developer tools, IoT, security, enterprise applications, and much more.

Tools to access AWS Services:

AWS Management Console: WEB-UI management tool (protected by password + MFA);
AWS CLI: direct access to the public APIs of AWS services via command-line shell;
AWS SDK: language specific APIs (set of libraries), enables access and manage AWS services programmatically (embedded in application).

AWS Shared Responsibility Model:

Customer responsibility for the security IN the cloud: Customers are responsible for the security of everything that they create and put in the AWS Cloud;
- EC2 instances:
  - Customer is responsible for management of the guest OS (including security patches and updates), firewall & network configuration, IAM;
  - Encrypting application data.
- S3:
  - Bucket configuration;
  - Bucket policy / public setting;
  - IAM user and roles;
  - Enabling encryption.
AWS responsibility for the security OF the cloud: AWS operates, manages, and controls the components at all layers of infrastructure;
- Protecting infrastructure (hardware, software, facilities and networking):
- Managed services (like S3, DynamoDB, RDS, etc);
- S3:
  - Guarantee unlimited storage;
  - Guarantee encryption;
  - Ensure separation of the data between different cutomers;
  - Ensure AWS employees can’t access customer’s data.
Shared controls:
- Patch management, Configuration Management, Awareness & Training.

AWS Identity and Access Management:

IAM (Identity and Access Management) enables you to securely control access to Amazon Web Services services and resources for your users.
Users are people within your organization, and can be grouped. Users don’t have to belong to a group, and user can belong to multiple groups. Groups only contain users, not other groups. Root privileges has complete access to all AWS services and resources. Root account created by default.

Actions that can be performed only by the root user:

Change account settings (account name, email address, root user password, root user access keys);
View certain tax invoices;
Close your AWS account;
Restore IAM user permissions;
Change or cancel AWS Support plan;
Register as a seller in the Reserved Instance Marketplace;
Configure an Amazon S3 bucket to enable MFA;
Edit or delete an Amazon S3 bucket policy that includes an invalid VPC ID or VPC endpoint ID;
Sign up for GovCloud.

Policies define the permissions of the users and groups described in JSON documents.

IAM Roles for services set of permission attached to some AWS services to perform actions on your behalf (EC2, Lambda, CloudFormation).

IAM Policy is a JSON document that defines permissions. ┌─────────────────────────────────────────────────────────────┐ │ IAM POLICY │ ├─────────────────────────────────────────────────────────────┤ │ Version (Required) - Policy language version │ │ Id (Optional) - Policy identifier │ │ Statement (Required) - Array of permission blocks │ │ ├── Sid (Optional) - Statement ID │ │ ├── Effect (Required) - “Allow” or “Deny” │ │ ├── Principal (Required*) - Who the policy applies to │ │ ├── Action (Required) - What actions are permitted │ │ ├── Resource (Required) - Which resources are affected │ │ └── Condition (Optional) - When the policy applies │ └─────────────────────────────────────────────────────────────┘

Principal is required for resource-based policies, not identity-based

Policy Types:

Identity-based - Attached to users, groups, or roles (no Principal needed)
Resource-based - Attached to resources like S3 buckets (Principal required)

Best IAM practices:

Root account shouldn’t be used or shared;
In IAM Policies apply the least privilege principle: don’t give more permissions than a user needs;
Create individual IAM users for each person who needs to access AWS. One user - one physical user;
Assign users to groups and assign permissions to Groups;
IAM roles are ideal for situations in which access to services or resources needs to be granted temporarily, instead of long-term.

IAM Credentials Report (account-level) a report that lists all your account’s users and the status of their various; IAM Access Advisor (user-level) - access advisor shows the service permissions granted to a user and when those services were last accessed. Can be used to revise policies.

AWS Resource Access Manager (AWS RAM) helps you securely share your resources across AWS accounts, within your organization or organizational units (OUs) and with IAM roles and users for supported resource types (Aurora, VPC Subnets, Transit Gateway, Route53, EC2 Dedicated Hosts, License Manager Configurations, etc). Avoid resource duplication.

AWS Service Catalog self-portal to launch a set of authorized products pre-defined by admins.

AWS STS (Security Token Service): enables you to create temporary, limited-privileges credentials to access your AWS resources.

STS API	Use Case
`AssumeRole`	Cross-account access, or same-account role assumption
`AssumeRoleWithSAML`	Users logged in with SAML (corporate IdP)
`AssumeRoleWithWebIdentity`	Users logged in with IdP (Facebook, Google, OIDC) — prefer Cognito instead
`GetSessionToken`	MFA for root or IAM user
`GetFederationToken`	Temporary credentials for federated user

Session Policies: Optional policy passed when calling AssumeRole — further restricts the role’s permissions for that session only.

Amazon Cognito:

Component	Purpose
User Pools	User directory for sign-up/sign-in, returns JWT tokens
Identity Pools	Exchange tokens for temporary AWS credentials (access AWS services)

User → Cognito User Pool → JWT Token → Cognito Identity Pool → AWS Credentials → AWS Services

⚠️ Exam trap: User Pools = authentication (who are you?), Identity Pools = authorization (AWS access)

IAM Access Analyzer:

Identifies resources shared with external entities (S3, IAM roles, KMS, Lambda, SQS)
Validates IAM policies against best practices
Generates policies based on access activity (least privilege)
Zone of trust = Organization or Account

AWS Directory Services:

Service	Users Stored	On-Prem Connection	Use Case
AWS Managed Microsoft AD	In AWS	Two-way trust	Full AD features, MFA, trust with on-prem
AD Connector	On-prem only	Proxy (no trust)	Keep users on-prem, redirect auth
Simple AD	In AWS	❌ Cannot	Basic AD, standalone, no on-prem

┌─────────────────────────────────────────────────────────────────────────────┐
│                       AWS Directory Services                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  1. AWS Managed Microsoft AD (two-way trust)                                │
│                                                                             │
│       ┌──────────┐      trust       ┌──────────────────┐                    │
│  auth │          │◄────────────────►│  AWS Managed AD  │ auth               │
│  ◄────┤ On-Prem  │                  │       [MS]       ├────►               │
│       │    AD    │                  └──────────────────┘                    │
│       └──────────┘                                                          │
│                                                                             │
│  2. AD Connector (proxy only - NO users stored in AWS)                      │
│                                                                             │
│       ┌──────────┐      proxy       ┌──────────────────┐                    │
│       │          │◄────────────────►│   AD Connector   │ auth               │
│       │ On-Prem  │                  │       [⚡]        ├────►               │
│       │    AD    │                  └──────────────────┘                    │
│       └──────────┘                                                          │
│                                                                             │
│  3. Simple AD (standalone - NO on-prem connection)                          │
│                                                                             │
│                                     ┌──────────────────┐                    │
│                          ❌         │    Simple AD     │ auth               │
│              (no on-prem)           │       [DB]       ├────►               │
│                                     └──────────────────┘                    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

⚠️ Exam trap: AD Connector is just a proxy — it does NOT store users, only redirects authentication to on-prem AD.

AWS Organizations (Global service):

┌─────────────────────────────────────────────────────────────────────┐
│                    Root Organizational Unit (OU)                    │
│  ┌────────────────┐                                                 │
│  │  Management    │  ← Full admin power, SCPs do NOT apply here     │
│  │  Account       │                                                 │
│  └────────────────┘                                                 │
│                                                                     │
│  ┌──────────────────────┐      ┌──────────────────────────────────┐ │
│  │     OU (Dev)         │      │          OU (Prod)               │ │
│  │  ┌────┐  ┌────┐      │      │  ┌────┐  ┌────┐                  │ │
│  │  │Acct│  │Acct│      │      │  │Acct│  │Acct│                  │ │
│  │  └────┘  └────┘      │      │  └────┘  └────┘                  │ │
│  │   Member Accounts    │      │  ┌────────────┐ ┌──────────────┐ │ │
│  └──────────────────────┘      │  │  OU (HR)   │ │ OU (Finance) │ │ │
│                                │  │ ┌──┐ ┌──┐  │ │  ┌──┐ ┌──┐   │ │ │
│                                │  │ │  │ │  │  │ │  │  │ │  │   │ │ │
│                                │  │ └──┘ └──┘  │ │  └──┘ └──┘   │ │ │
│                                │  └────────────┘ └──────────────┘ │ │
│                                └──────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘

Member accounts can only be part of ONE organization
Consolidated Billing - single payment method, volume discounts
Shared Reserved Instances and Savings Plans across accounts
API available to automate account creation

Consolidated Billing Benefits:

Combined Usage across all accounts → volume pricing discounts
One Bill for all accounts in the Organization
Pooling of Reserved EC2 instances for optimal savings

Multi-Account Strategies:

Account per department / cost center / dev-test-prod
Regulatory restrictions enforcement (via SCP)
Resource isolation (separate VPCs)
Separate per-account service limits
Isolated logging account (CloudTrail → central S3, CloudWatch → central account)

Service Control Policies (SCP):

Whitelist or blacklist IAM actions at OU or Account level
Must have explicit Allow from root → through each OU → to target account
Affects ALL Users and Roles in account, including Root user
Does NOT apply to Management Account (full admin power always)
Does NOT affect service-linked roles

⚠️ Exam trap: SCPs don’t affect Management Account — if question asks “restrict ALL accounts”, Management Account is still unrestricted!

⚠️ Exam trap: Service-linked roles are NOT affected by SCPs — they always work!

What SCPs CANNOT do:

❌ Restrict the Management Account
❌ Affect service-linked roles
❌ Grant permissions (only restrict what’s already allowed)
❌ Affect actions in the Management Account itself

AWS Organizations – Tag Policies:

Standardize tags across resources in an AWS Organization
Define allowed tag keys and values
Prevent non-compliant tagging operations (no effect on resources without tags)
Generate compliance reports for tagged/non-compliant resources
Use EventBridge to monitor non-compliant tags
Helps with Cost Allocation Tags and Attribute-Based Access Control

IAM Conditions - restrict API calls based on:

Condition Key	Purpose	Example
`aws:SourceIp`	Restrict by client IP	Only allow from corporate IP range
`aws:RequestedRegion`	Restrict by region	Only allow eu-west-1 API calls
`ec2:ResourceTag`	Restrict based on tags	Only manage EC2 with tag “Env=Dev”
`aws:MultiFactorAuthPresent`	Force MFA	Require MFA for sensitive actions

⚠️ Exam trap: Fake condition keys! Only these are real:

✅ aws:RequestedRegion, aws:SourceIp, aws:SourceVpc, aws:SourceVpce
❌ aws:SourceRegion, aws:Region, ec2:SourceRegion — DON’T EXIST

S3 Bucket Policies vs IAM Policies:

Aspect	IAM Policy	S3 Bucket Policy
Attached to	User/Group/Role	S3 Bucket
Cross-account	Requires role assumption	Direct access via Principal
Use case	User-centric permissions	Resource-centric, public access, cross-account

S3 Access Decision Logic:

IAM Policy ALLOWS  +  S3 Bucket Policy ALLOWS  →  ACCESS ✅
IAM Policy ALLOWS  +  S3 Bucket Policy (silent) →  ACCESS ✅
IAM Policy (silent) +  S3 Bucket Policy ALLOWS  →  ACCESS ✅  (if same account)
IAM Policy DENIES   OR  S3 Bucket Policy DENIES →  DENIED ❌

Cross-account: BOTH must explicitly Allow

Common S3 Policy Conditions:

Condition	Purpose
`aws:SourceIp`	Restrict by IP range
`aws:SourceVpce`	Restrict to specific VPC endpoint
`aws:SourceVpc`	Restrict to specific VPC
`s3:x-amz-acl`	Control ACL settings
`s3:x-amz-server-side-encryption`	Require encryption
`aws:SecureTransport`	Require HTTPS (deny HTTP)

Example - Require HTTPS:

{
  "Effect": "Deny",
  "Principal": "*",
  "Action": "s3:*",
  "Resource": "arn:aws:s3:::bucket/*",
  "Condition": {
    "Bool": { "aws:SecureTransport": "false" }
  }
}

⚠️ Exam trap: "Principal": "*" = anonymous access. "Principal": {"AWS": "*"} = any authenticated AWS user.

⚠️ Exam trap: Cross-account S3 access — bucket policy must explicitly allow the external principal AND the external account needs IAM permissions.

⚠️ Exam trap: S3 ARN patterns matter!

arn:aws:s3:::bucket → Bucket-level actions (ListBucket, GetBucketLocation)
arn:aws:s3:::bucket/* → Object-level actions (GetObject, PutObject, DeleteObject)
Missing /* = Access Denied for object operations!

IAM Roles vs Resource-Based Policies (Cross-Account Access):

Two ways to access S3 in another account:

Option 1: Role as Proxy (AssumeRole)
┌──────────────┐      ┌──────────────┐      ┌──────────────┐
│    User      │─────►│    Role      │─────►│   Amazon S3  │
│  Account A   │      │  Account B   │      │  Account B   │
└──────────────┘      └──────────────┘      └──────────────┘
                      (become this role,
                       lose Account A perms)

Option 2: Resource-Based Policy (S3 Bucket Policy)
┌──────────────┐      ┌──────────────┐      ┌──────────────┐
│    User      │─────►│  S3 Bucket   │─────►│   Amazon S3  │
│  Account A   │      │   Policy     │      │  Account B   │
└──────────────┘      └──────────────┘      └──────────────┘
                      (grants access to
                       Account A user directly,
                       keeps Account A perms)

Aspect	Assume Role	Resource-Based Policy
Permissions	Give up original, take role’s	Keep original + gain resource access
Use case	Need full different identity	Need BOTH source and target access

Example: User in Account A needs to scan DynamoDB in Account A AND dump to S3 in Account B → Use resource-based policy on S3 (keeps DynamoDB permissions)

EventBridge Target Permissions:

Target	Policy Type	Why
Lambda	Resource-based	Lambda can define “who invokes me”
SNS	Resource-based	SNS can define “who publishes to me”
SQS	Resource-based	SQS can define “who sends to me”
S3	Resource-based	S3 can define “who writes to me”
API Gateway	Resource-based	API GW can define “who calls me”
Kinesis	IAM Role	No invoke policy — need role
EC2 Auto Scaling	IAM Role	No invoke policy — need role
ECS Task	IAM Role	No invoke policy — need role
SSM Run Command	IAM Role	No invoke policy — need role

Memory trick: “Can the target say WHO is allowed to invoke it?”

YES → Resource-based policy (Lambda, SNS, SQS, S3, API Gateway)
NO → EventBridge needs IAM Role to assume

Memory hook: “SLSS + API GW” = Resource-based (SNS, Lambda, SQS, S3, API Gateway) Memory hook: “KEES” = IAM Role needed (Kinesis, EC2 Auto Scaling, ECS, SSM)

⚠️ Exam trap: Lambda = resource-based, Kinesis = IAM role. Don’t mix them up!

IAM Permission Boundaries:

Supported for users and roles only (NOT groups)
Sets MAXIMUM permissions an IAM entity can get
Effective permissions = intersection of Identity Policy ∩ Permission Boundary ∩ SCP

┌───────────────────────────────────────────────────────────┐
│                                                           │
│      ┌─────────────┐         ┌─────────────────┐          │
│      │Organizations│         │   Permissions   │          │
│      │    SCP      │         │    Boundary     │          │
│      │      ✓      │    ✓    │       ✓         │          │
│      │         ┌───┴─────────┴───┐             │          │
│      │         │                 │             │          │
│      └─────────┤   Effective     ├─────────────┘          │
│                │   Permissions   │                        │
│      ┌─────────┤       ✓         ├─────────────┐          │
│      │         │                 │             │          │
│      │         └───┬─────────────┘             │          │
│      │  Identity   │         ✓                 │          │
│      │   Policy    │                           │          │
│      │      ✓      │                           │          │
│      └─────────────┘                           │          │
│                                                           │
└───────────────────────────────────────────────────────────┘

Use Cases:

Delegate IAM user creation to non-admins (within boundaries)
Allow developers to self-assign policies without privilege escalation
Restrict specific user (vs SCP restricts whole account)

⚠️ Exam trap: Permission Boundaries do NOT apply to groups! Only users and roles.

IAM Policy Evaluation Logic (order matters):

                         ┌─────────────────┐
                         │  Start: DENY    │
                         └────────┬────────┘
                                  ▼
                    ┌─────────────────────────┐
                    │   Explicit Deny?        │──── YES ────► DENY ❌
                    └─────────────┬───────────┘
                                  │ NO
                                  ▼
                    ┌─────────────────────────┐
                    │   In Org with SCP?      │──── NO ─────► Skip to Resource-Based
                    └─────────────┬───────────┘
                                  │ YES
                                  ▼
                    ┌─────────────────────────┐
                    │   SCP Allows?           │──── NO ─────► DENY ❌ (implicit)
                    └─────────────┬───────────┘
                                  │ YES
                                  ▼
                    ┌─────────────────────────┐
                    │   Resource-Based        │──── ALLOW + same account ──► ALLOW ✅
                    │   Policy Allows?        │
                    └─────────────┬───────────┘
                                  │ (continue if cross-account or no resource policy)
                                  ▼
                    ┌─────────────────────────┐
                    │   Identity Policy       │──── NO ─────► DENY ❌ (implicit)
                    │   Allows?               │
                    └─────────────┬───────────┘
                                  │ YES
                                  ▼
                    ┌─────────────────────────┐
                    │   Permission Boundary   │──── NO ─────► DENY ❌ (implicit)
                    │   Allows? (if exists)   │
                    └─────────────┬───────────┘
                                  │ YES
                                  ▼
                    ┌─────────────────────────┐
                    │   Session Policy        │──── NO ─────► DENY ❌ (implicit)
                    │   Allows? (if exists)   │
                    └─────────────┬───────────┘
                                  │ YES
                                  ▼
                         ┌─────────────────┐
                         │   ALLOW ✅      │
                         └─────────────────┘

Key Rules:

Explicit Deny ALWAYS wins (checked first)
Must have Allow at EVERY applicable level
Same-account: Resource-based Allow alone can grant access
Cross-account: BOTH sides must Allow

AWS IAM Identity Center (successor to AWS SSO):

One login for: AWS accounts, Business apps (Salesforce, Box, M365), SAML2.0 apps, EC2 Windows
Identity providers: Built-in store, Active Directory, OneLogin, Okta

Active Directory Integration:

Option 1: AWS Managed Microsoft AD (out-of-box integration)
┌──────────────────┐              ┌─────────────────────┐
│  IAM Identity    │───connect───►│ AWS Managed         │
│  Center          │              │ Microsoft AD        │
└──────────────────┘              └─────────────────────┘

Option 2: Self-Managed AD (two approaches)
┌──────────────────┐     ┌─────────────────┐    two-way trust    ┌────────────┐
│  IAM Identity    │─────│ AWS Managed     │◄──────────────────►│ On-Prem AD │
│  Center          │     │ Microsoft AD    │                     └────────────┘
└────────┬─────────┘     └─────────────────┘
         │
         │               ┌─────────────────┐       proxy          ┌────────────┐
         └───────────────│  AD Connector   │◄────────────────────►│ On-Prem AD │
                         └─────────────────┘                      └────────────┘

Permission Sets: Collection of IAM Policies assigned to users/groups for AWS access ABAC: Fine-grained permissions based on user attributes (cost center, title, locale)

AWS Control Tower - multi-account governance:

Runs on top of AWS Organizations
Automates environment setup and ongoing policy management

Guardrails (ongoing governance):

Type	Implementation	Example
Preventive	SCPs	Restrict regions across all accounts
Detective	AWS Config	Identify untagged resources

Detective Guardrail Flow:

┌─────────────────────────────────────────────────────────────────────┐
│  AWS Control Tower                                                  │
│  ┌─────────────┐                                                    │
│  │ Guardrail   │ trigger                                            │
│  │ (Detective) ├───────────►┌─────┐ notify  ┌───────┐               │
│  │             │(NON_COMPLIANT)│ SNS │────────►│ Admin │             │
│  │ AWS Config  │             └──┬──┘         └───────┘              │
│  └──────┬──────┘                │                                   │
│         │ monitor               │ invoke                            │
│         ▼                       ▼                                   │
│  ┌────────────────┐       ┌──────────┐                              │
│  │ Member Accounts│◄──────│  Lambda  │ remediate (add tags)         │
│  └────────────────┘       └──────────┘                              │
└─────────────────────────────────────────────────────────────────────┘

⚠️ IAM Exam Traps Summary

Trap	Reality
“IAM is regional”	❌ IAM is GLOBAL — no region selection
“SCPs restrict Management Account”	❌ Management Account has full power always
“SCPs affect service-linked roles”	❌ Service-linked roles are NOT affected
“Permission Boundaries work on groups”	❌ Only users and roles
“AssumeRole keeps original permissions”	❌ You give up original, take role’s
“Resource-based policy requires role assumption”	❌ No role needed — keep original permissions
“Cognito User Pools give AWS access”	❌ User Pools = auth only; Identity Pools = AWS credentials
“S3 GetObject on bucket ARN”	❌ Need `bucket/*` for object actions, `bucket` alone = Access Denied
“`aws:SourceRegion` condition key”	❌ Doesn’t exist — use `aws:RequestedRegion`
“EventBridge → Kinesis uses resource policy”	❌ Kinesis needs IAM Role (no resource-based policy)

🎯 IAM Quick Decision Table

Scenario	→ Solution
Restrict entire AWS account	SCP (at OU or Account level)
Restrict specific user/role (not whole account)	Permission Boundary
Cross-account, keep original permissions	Resource-based policy
Cross-account, need different identity	Assume Role
Centralized login for multiple AWS accounts	IAM Identity Center
External users (millions, mobile/web)	Cognito
Temporary AWS credentials	STS
Corporate IdP integration (SAML)	IAM Identity Center or AssumeRoleWithSAML
Social login (Google, Facebook)	Cognito Identity Pools
Share resources across accounts	AWS RAM
Find externally shared resources	IAM Access Analyzer
Multi-account governance with guardrails	AWS Control Tower
Standardize tags across Organization	Tag Policies

🎯 MASTER SUMMARY: IAM & Organizations Exam Guide

Part 1: Core Principles (Understand WHY → Derive WHAT)

Principle 1: Implicit Deny by Default

Everything starts DENIED. You must explicitly Allow. This applies to:

IAM policies
SCPs
Permission Boundaries

Why? Security principle — if you forget something, it’s denied (safe default).

Derive: If SCP only allows EC2 → everything else is denied. No need to memorize “deny lists.”

Principle 2: Explicit Deny ALWAYS Wins

No matter how many Allows exist, one Deny = blocked.

Why? Prevents privilege escalation — you can always restrict, never override a restriction.

Derive: To block an action, just add Deny anywhere in the chain. Order doesn’t matter for Deny.

Principle 3: Permissions = Intersection, Not Union

Effective permissions = what ALL layers allow together.

SCP ∩ Permission Boundary ∩ Identity Policy = Effective Permissions

Why? Each layer is a guardrail — you can only narrow, never expand beyond any layer.

Derive: If SCP allows S3+EC2, but Identity Policy only allows S3 → only S3 works.

Principle 4: Management Account is Untouchable

SCPs never apply to Management Account. It always has full power.

Why? Someone must be able to fix things if SCPs lock everyone out.

Derive: “Restrict ALL accounts” questions — Management Account is the exception.

Principle 5: Scope Determines Tool

Whole account restriction → SCP
Specific user/role restriction → Permission Boundary
Action-level control → IAM Policy + Conditions

Why? Different granularity needs different tools.

Derive: “Prevent developers from escalating privileges” → Permission Boundary (user-level, not account-level).

Principle 6: Cross-Account = Two Choices

Assume Role → Give up original identity, become the role
Resource-based Policy → Keep original identity + access target

Why? Sometimes you need both source AND target access (e.g., DynamoDB scan → S3 dump).

Derive: “Access DynamoDB in Account A AND S3 in Account B” → Resource-based policy on S3.

Principle 7: Temporary Credentials > Long-term

STS provides temporary, auto-expiring credentials.

Why? Reduces blast radius of compromise — credentials expire.

Derive: Cross-account access, federation, MFA → all use STS behind the scenes.

Principle 8: Authentication ≠ Authorization

Cognito User Pools = Authentication (who are you?)
Cognito Identity Pools = Authorization (what can you access?)
IAM Identity Center = Both for AWS accounts

Why? Separation of concerns — different systems handle different problems.

Derive: “Millions of mobile users need S3 access” → User Pool (auth) + Identity Pool (AWS creds).

Principle 9: Service-Linked Roles are Special

SCPs don’t affect them. They’re created BY AWS FOR AWS services.

Why? AWS services need guaranteed permissions to function.

Derive: “SCP blocks everything but service still works” → It’s using a service-linked role.

Principle 10: IAM is Global

No region selection. Users, roles, policies exist everywhere.

Why? Identity should be consistent — you’re YOU regardless of region.

Derive: “Create IAM user in us-east-1” → Trick question, IAM has no region.

Part 2: Decision Trees

Cross-Account Access Decision

Need cross-account access?
    │
    ├─► Need BOTH source + target permissions?
    │       │
    │       └─► YES → Resource-based policy
    │
    └─► Need different identity/permissions?
            │
            └─► YES → Assume Role

Restriction Scope Decision

What to restrict?
    │
    ├─► Entire account(s)?
    │       │
    │       └─► SCP (attach to OU or Account)
    │
    ├─► Specific user/role?
    │       │
    │       └─► Permission Boundary
    │
    └─► Specific actions with conditions?
            │
            └─► IAM Policy with Conditions

Identity Provider Decision

Who needs access?
    │
    ├─► Internal employees to AWS accounts?
    │       │
    │       └─► IAM Identity Center
    │
    ├─► External users (millions, mobile/web)?
    │       │
    │       └─► Cognito
    │
    └─► Corporate IdP (SAML)?
            │
            ├─► To AWS Console → IAM Identity Center
            └─► Programmatic → AssumeRoleWithSAML

The “CANNOT” List

What	Cannot
SCPs	Affect Management Account
SCPs	Affect service-linked roles
Permission Boundaries	Apply to groups
Cognito User Pools	Give AWS credentials directly
Simple AD	Join with on-premises AD
AD Connector	Store users (it’s just a proxy)

Part 3: Scenario Pattern Recognition

Pattern: “Restrict ALL member accounts from using a service”

Keywords: all accounts, prevent, organization-wide, block service Answer: SCP at Root OU level Why: SCPs cascade down OUs. Root OU = all member accounts (not Management).

Pattern: “Allow developers to create IAM users but prevent privilege escalation”

Keywords: delegate, self-service, prevent escalation, limit what they can create Answer: Permission Boundary Why: Boundary limits max permissions of created entities.

Pattern: “User needs to access resources in two accounts simultaneously”

Keywords: scan + dump, read from A write to B, both accounts Answer: Resource-based policy (on target resource) Why: AssumeRole would lose access to source account.

Pattern: “Millions of mobile app users need S3 access”

Keywords: mobile, web app, millions, external users, S3/DynamoDB access Answer: Cognito User Pools + Identity Pools Why: User Pools authenticate, Identity Pools give temporary AWS credentials.

Pattern: “Corporate employees need SSO to multiple AWS accounts”

Keywords: SSO, single sign-on, multiple accounts, employees, SAML, Active Directory Answer: IAM Identity Center Why: Built for this — integrates with AD, manages permission sets across accounts.

Pattern: “Detect untagged resources across organization”

Keywords: detect, compliance, untagged, non-compliant, monitor Answer: Control Tower Detective Guardrail (uses AWS Config) Why: Detective = monitoring (not blocking). Uses Config rules.

Pattern: “Prevent creating resources in unapproved regions”

Keywords: prevent, block, region restriction, all accounts Answer: SCP with aws:RequestedRegion condition (or Control Tower Preventive Guardrail) Why: Preventive = blocking. SCPs stop the action.

Keywords: share, subnets, cross-account, Transit Gateway, avoid duplication Answer: AWS RAM (Resource Access Manager) Why: RAM shares resources without duplication.

Pattern: “Find resources shared with external accounts”

Keywords: identify, find, external access, shared externally, audit Answer: IAM Access Analyzer Why: Analyzes policies to find external principal access.

Pattern: “Temporary credentials for cross-account access”

Keywords: temporary, cross-account, assume, programmatic Answer: STS AssumeRole Why: Returns temporary credentials for the target role.

Pattern: “Require MFA for sensitive operations”

Keywords: MFA, multi-factor, sensitive, delete, critical Answer: IAM Policy Condition: aws:MultiFactorAuthPresent Why: Condition key checks MFA status.

Pattern: “Standardize tag format across organization”

Keywords: standardize, tags, enforce, format, organization-wide Answer: Tag Policies Why: Define allowed tag keys/values, prevent non-compliant tags.

Pattern: “Connect IAM Identity Center to on-premises AD”

Keywords: on-premises, Active Directory, Identity Center, trust Answer: Two-way trust with AWS Managed Microsoft AD, OR AD Connector (proxy) Why: Can’t connect directly — need AWS AD service in between.

Part 4: Quick Reference Tables

SCP vs Permission Boundary vs IAM Policy

Aspect	SCP	Permission Boundary	IAM Policy
Scope	Account/OU	User/Role	User/Group/Role
Applies to groups?	N/A (account level)	❌ NO	✅ YES
Affects root user?	✅ YES	❌ NO (root has no boundary)	❌ NO
Affects Management Account?	❌ NO	✅ YES	✅ YES
Default	Implicit Deny	Implicit Deny	Implicit Deny

Directory Services Comparison

Service	Users Stored	On-Prem Connection	Use Case
AWS Managed Microsoft AD	In AWS	Two-way trust	Need AD features in AWS
AD Connector	On-prem only	Proxy	Keep users on-prem
Simple AD	In AWS	❌ Cannot	Basic AD, no on-prem

STS API Quick Reference

API	When to Use
AssumeRole	Cross-account, same-account role switch
AssumeRoleWithSAML	Corporate IdP (SAML) login
AssumeRoleWithWebIdentity	Social login (prefer Cognito)
GetSessionToken	MFA for IAM user
GetFederationToken	Custom federation

Part 5: Ultimate Instant-Answer Table

Question Contains	→ Instant Answer
“restrict ALL member accounts”	SCP at Root OU
“Management Account” + “restrict”	❌ Can’t — SCPs don’t apply
“service-linked role” + “blocked”	❌ Can’t — SCPs don’t affect
“prevent privilege escalation”	Permission Boundary
“groups” + “permission boundary”	❌ Not supported
“both accounts” / “source and target”	Resource-based policy
“give up permissions”	Assume Role
“millions of users” / “mobile app”	Cognito
“User Pools” + “AWS access”	❌ Need Identity Pools too
“SSO multiple accounts”	IAM Identity Center
“on-premises AD” + “Identity Center”	AWS Managed AD + trust, or AD Connector
“Simple AD” + “on-prem”	❌ Cannot connect
“share resources” / “avoid duplication”	AWS RAM
“find external access”	IAM Access Analyzer
“temporary credentials”	STS
“MFA required”	`aws:MultiFactorAuthPresent` condition
“region restriction”	`aws:RequestedRegion` condition or SCP
“tag standardization”	Tag Policies
“detect” + “compliance”	Detective Guardrail (AWS Config)
“prevent” + “organization-wide”	Preventive Guardrail (SCP)
“Control Tower” + “remediate”	Lambda (triggered by SNS from Config)
“landing zone”	Control Tower
“SAML” + “programmatic”	AssumeRoleWithSAML
“social login”	Cognito Identity Pools
“IAM” + “regional”	❌ Trick — IAM is GLOBAL

Part 6: Elimination Checklist

□ Is it about restricting accounts?
  → Yes = Think SCP
  → But Management Account? = SCP won't work

□ Is it about restricting a specific user/role?
  → Yes = Think Permission Boundary
  → Is it a group? = Permission Boundary won't work

□ Is it cross-account access?
  → Need both source + target access? = Resource-based policy
  → Need different identity? = Assume Role

□ Is it millions of external users?
  → Yes = Cognito (not IAM users)
  → Need AWS credentials? = Identity Pools required

□ Is it corporate employees to AWS?
  → Yes = IAM Identity Center

□ Is it about compliance/detection?
  → Yes = Detective Guardrail / AWS Config

□ Is it about blocking/prevention?
  → Yes = Preventive Guardrail / SCP

🏆 The Golden Rules

Implicit Deny — Everything denied until explicitly allowed
Explicit Deny Wins — One Deny overrides all Allows
Intersection Rule — Effective = SCP ∩ Boundary ∩ Policy
Management Account Exception — SCPs don’t touch it
Service-Linked Roles Exception — SCPs don’t affect them
Permission Boundaries ≠ Groups — Only users and roles
AssumeRole = Identity Switch — You become the role, lose original
Resource-Based = Keep Both — Keep original + access target
User Pools ≠ AWS Access — Need Identity Pools for credentials
IAM = Global — No region, ever
SCP Allow = Everything Else Denied — Like IAM, implicit deny default
FullAWSAccess SCP — Default SCP that allows everything (remove carefully!)

Amazon VPC:

VPC (Virtual Private Cloud) is a service that lets you launch AWS resources in a logically isolated virtual network that you define.

Max 5 VPCs per region (soft limit)
Max 5 CIDRs per VPC
CIDR range: min /28 (16 IPs) → max /16 (65,536 IPs)
Only private IPv4 ranges allowed:
- 10.0.0.0/8 (10.0.0.0 – 10.255.255.255)
- 172.16.0.0/12 (172.16.0.0 – 172.31.255.255)
- 192.168.0.0/16 (192.168.0.0 – 192.168.255.255)
VPC CIDR should NOT overlap with other networks (e.g., corporate)

CIDR – IPv4:

CIDR (Classless Inter-Domain Routing) — method for allocating IP addresses. Used in Security Groups, NACLs, and all AWS networking.

CIDR	IPs	Use Case
`/32`	1	Single IP (e.g., SSH from your IP)
`/28`	16	Smallest VPC/subnet
`/27`	32	Small subnet
`/26`	64	Medium subnet
`/24`	256	Common subnet size
`/16`	65,536	Largest VPC
`/0`	All	`0.0.0.0/0` = open to internet

⚠️ Exam trap: “Need 29 IPs for EC2” → /27 (32 IPs) is NOT enough! AWS reserves 5 IPs per subnet (first 4 + last 1) → 32 - 5 = 27 < 29. Use /26 (64 - 5 = 59 ✓)

AWS Reserved IPs per subnet (5):

.0 — Network Address
.1 — VPC Router
.2 — Amazon DNS
.3 — Future use
.255 — Broadcast (not supported, but reserved)

IP Addresses:

IPv4 (4.3 billion addresses)
IPv6 (3.4 × 10³⁸ addresses) — every IPv6 address in AWS is public and internet-routable (no private range)
Elastic IP: fixed public IPv4 address attached to EC2 instance

IPv6 in VPC:

IPv4 cannot be disabled for VPC and subnets
Enable IPv6 to operate in dual-stack mode (IPv4 + IPv6)
EC2 gets at least a private IPv4 + a public IPv6
Communicate via either IPv4 or IPv6 to internet through IGW

⚠️ Exam trap: “Can’t launch EC2 in subnet” → NOT because of IPv6 (space is huge). It’s because no available IPv4 in the subnet → solution: create a new IPv4 CIDR in your subnet.

Subnets:

Subnets partition your network inside VPC.

Public subnet — accessible from the internet (has route to IGW)
Private subnet — NOT accessible from the internet
Route Tables — define access to the internet and between subnets

Internet Gateway (IGW):

Internet gateway helps VPC instances connect with the internet (public subnets have a route to IGW).

Scales horizontally, highly available and redundant
Created separately from VPC
One VPC ↔ One IGW (1:1 mapping)
IGW alone does NOT allow internet access → Route Tables must be edited!

NAT (Network Address Translation):

NAT allows instances in private subnets to access the internet while remaining private.

Private Subnet EC2 ──► NAT (Public Subnet) ──► IGW ──► Internet
       10.0.0.20          EIP: 12.34.56.78         ▲
                           (translates src IP)      │
                                                    │
                    Response comes back to NAT ◄────┘
                    NAT forwards to 10.0.0.20

NAT Instance (self-managed, outdated but still on exam):

EC2 instance with special AMI in public subnet
Must disable Source/Destination Check on EC2
Must have Elastic IP attached
Must configure Route Tables (private subnet → NAT Instance)
You manage Security Groups, patching, HA (ASG + multi-AZ)
Can be used as Bastion Host

NAT Gateway (AWS-managed):

Created in specific AZ, uses Elastic IP
5 Gbps → auto-scales to 100 Gbps
Can’t be used by instances in the same subnet
Requires IGW (Private Subnet → NATGW → IGW)
No Security Groups to manage

NAT Gateway HA: Resilient within single AZ only → create one NATGW per AZ for fault tolerance. No cross-AZ failover needed.

NAT Gateway vs NAT Instance:

Feature	NAT Gateway	NAT Instance
Availability	HA within AZ (create in each AZ)	Manual failover (ASG + script)
Bandwidth	Up to 100 Gbps	Depends on instance type
Maintenance	AWS-managed	You manage (patching, OS)
Cost	Per hour + data transferred	Per hour + instance type + network
Public IPv4	✅	✅
Private IPv4	✅	✅
Security Groups	❌ No	✅ Yes
Bastion Host	❌ No	✅ Yes
Port Forwarding	❌ No	✅ Yes (iptables)

⚠️ Exam trap: “NAT + Security Groups” → NAT Instance (NAT Gateway has NO SGs). “NAT + Bastion Host” → also NAT Instance.

⚠️ Exam trap: “Private instances need internet, managed, HA” → NAT Gateway. NAT Instance is legacy — only pick it if question says “existing NAT Instance” or needs SG/Bastion.

Bastion Host:

Bastion Host = EC2 instance in public subnet used to SSH into private instances.

Users ──SSH (port 22)──► Bastion Host (Public Subnet) ──SSH──► EC2 (Private Subnet)
                         BastionHost-SG                        LinuxInstance-SG
                         Inbound: port 22                      Inbound: port 22
                         from corp CIDR                        from BastionHost-SG

Bastion SG: allow inbound port 22 from restricted CIDR (e.g., corporate public IP)
Private EC2 SG: allow inbound port 22 from Bastion Host SG or Bastion’s private IP

⚠️ Exam trap: “SSH into private EC2” → Bastion Host (or SSM Session Manager for no-SSH approach). NOT NAT Gateway — NAT is for outbound internet only.

Security Groups vs NACLs:

Network ACL (NACL) — firewall at subnet level. Can have ALLOW and DENY rules. Rules only include IP addresses. Automatically applies to all instances in the subnet. STATELESS: Return traffic must be explicitly allowed. Checks packets both ways.

Default NACL: allows ALL inbound and outbound traffic
Custom NACL: denies ALL traffic by default until you add rules

Security Groups — firewall at instance (ENI) level. Can have only ALLOW rules. Rules include IP addresses and other security groups. STATEFUL: Return traffic is automatically allowed.

Default SG: denies all inbound, allows all outbound
Inbound/Outbound rules are separate — they control who initiates the connection, not who responds
Applies to any service with an ENI: EC2, RDS, Aurora, ElastiCache, ECS (awsvpc), Lambda (in VPC), EFS mount targets, ALB, NLB, VPC Endpoints (Interface)

Inbound port 22 allowed:   Outside ──SSH──► Your EC2  ✅ (response auto-allowed out)
Outbound port 22 allowed:  Your EC2 ──SSH──► Outside  ✅ (response auto-allowed in)

⚠️ Exam trap: “Default NACL” → allows all traffic. “Custom NACL” → denies all by default. Don’t confuse them!

Feature	Security Group	NACL
Level	Instance (ENI)	Subnet
Rules	Allow only	Allow AND Deny
State	Stateful (return auto-allowed)	Stateless (must allow both directions)
Rule Evaluation	All rules evaluated together	Rules processed in order (lowest # first, first match wins)
Association	Manually assigned to instance	Automatically applies to all instances in subnet
Default	Deny all inbound, allow all outbound	Allow ALL traffic

Ephemeral Ports:

Client connects to a defined port (e.g., 443), response comes back on a random ephemeral port
NACL must allow ephemeral port range for return traffic (because stateless!)
Windows: 49152 – 65535 | Linux: 32768 – 60999

Client (11.22.33.44)                          Web Server (55.66.77.88)
    ──► Src Port: 50105, Dest Port: 443 ──►      (fixed port 443)
    ◄── Dest Port: 50105, Src Port: 443 ◄──      (response to ephemeral port)

NACL with Ephemeral Ports — Example (Web → DB):

  Web Subnet (Public)                              DB Subnet (Private)
  ┌──────────────────┐                             ┌─────────────────┐
  │    EC2 (Web)     │                             │ RDS (port 3306) │
  │                  │                             │                 │
  └────────┬─────────┘                             └────────┬────────┘
           │                                                │
        Web-NACL                                         DB-NACL
     ┌───────────────┐                           ┌───────────────┐
     │ OUTBOUND:     │    ── request ──►         │ INBOUND:      │
     │  port 3306    │                           │  port 3306    │
     │  to DB CIDR   │                           │  from Web CIDR│
     │               │                           │               │
     │ INBOUND:      │    ◄── response ──        │ OUTBOUND:     │
     │  port 1024-   │                           │  port 1024-   │
     │  65535         │                           │  65535        │
     │  from DB CIDR │                           │  to Web CIDR  │
     └───────────────┘                           └───────────────┘

4 NACL rules needed (because stateless = each direction, each NACL):

NACL	Direction	Port	CIDR	Why
Web-NACL	Outbound	3306	DB Subnet CIDR	Web initiates DB connection
Web-NACL	Inbound	1024-65535	DB Subnet CIDR	DB response on ephemeral port
DB-NACL	Inbound	3306	Web Subnet CIDR	Accept DB connection from Web
DB-NACL	Outbound	1024-65535	Web Subnet CIDR	Send response on ephemeral port

⚠️ Exam trap: With SGs you’d only need 2 rules (allow 3306 each side) — stateful handles the rest. With NACLs you need 4 rules — don’t forget the ephemeral port rules for return traffic!

⚠️ Exam trap: “NACL blocking return traffic” → you forgot to allow ephemeral ports outbound (server side) or inbound (client side). SGs don’t have this problem (stateful).

VPC Flow Logs:

VPC Flow Logs capture information about IP traffic going into your interfaces:

VPC Flow Logs
Subnet Flow Logs
Elastic Network Interface (ENI) Flow Logs
Captures network info from AWS managed interfaces: ELB, RDS, ElastiCache, Redshift, WorkSpaces, NATGW, Transit Gateway
Data can go to S3, CloudWatch Logs, Kinesis Data Firehose

Flow Log Syntax:

version account-id interface-id srcaddr dstaddr srcport dstport packets bytes start end protocol action log-status
2 123456789010 eni-1235b8ca srcIP dstIP 20641 22 6 20 4249 ... ACCEPT OK
2 123456789010 eni-1235b8ca srcIP dstIP 49761 3389 6 20 4249 ... REJECT OK

Key fields:

srcaddr / dstaddr — identify problematic IPs
srcport / dstport — identify problematic ports
action — ACCEPT or REJECT (due to SG or NACL)

Troubleshoot SG vs NACL using Flow Logs (ACTION field):

Scenario	Inbound	Outbound	Blocked by
Incoming blocked	REJECT	—	NACL or SG
Incoming allowed, response blocked	ACCEPT	REJECT	NACL (SG is stateful → would auto-allow)
Outgoing blocked	—	REJECT	NACL or SG
Outgoing allowed, response blocked	ACCEPT	REJECT	NACL

Memory trick: “ACCEPT then REJECT” = always NACL (stateless blocks return traffic). SG would never block return traffic (stateful).

Flow Logs Architectures:

Flow Logs → CloudWatch Logs → Contributor Insights → Top-10 IP addresses
Flow Logs → CloudWatch Logs → Metric Filter (SSH, RDP) → CW Alarm → SNS alert
Flow Logs → S3 → Athena (SQL queries) → QuickSight (visualization)

VPC Peering:

VPC Peering — privately connect two VPCs using AWS’ network, behave as if same network.

Must NOT have overlapping CIDRs
NOT transitive — must create peering for each pair (A↔B, A↔C, B↔C)
Must update Route Tables in each VPC’s subnets
Works across different accounts and regions
Can reference a Security Group in peered VPC (cross-account, same region only)

VPC-A ◄──Peering──► VPC-B ◄──Peering──► VPC-C
  │                                        │
  └──────────Peering (A↔C needed!)─────────┘
         (B does NOT relay traffic)

⚠️ Exam trap: “VPC A peers with B, B peers with C, can A talk to C?” → NO! Not transitive. Need separate A↔C peering. If you need many VPCs connected → use Transit Gateway instead.

VPC Endpoints:

VPC Endpoints connect to AWS services using private network instead of public internet.

Redundant and scale horizontally
Remove the need for IGW, NATGW to access AWS services
Troubleshooting: check DNS Resolution in VPC + check Route Tables

Two types:

Feature	Interface Endpoint	Gateway Endpoint
How	Provisions an ENI (private IP)	Target in Route Table
Security Group	✅ Must attach	❌ No
Services	Most AWS services	S3 and DynamoDB only
Cost	$ per hour + $ per GB	Free
Access from on-prem	✅ (via VPN/DX)	❌ No
Powered by	AWS PrivateLink	Route Table entry

Option 1 (costly):     Lambda (VPC) ──► NAT GW ──► IGW ──► DynamoDB (public)
Option 2 (free/better): Lambda (VPC) ──► Gateway Endpoint ──► DynamoDB (private)

⚠️ Exam trap: “S3 or DynamoDB access from VPC” → Gateway Endpoint (free, preferred on exam). Interface Endpoint only when access needed from on-premises (VPN/Direct Connect), different VPC, or different region.

⚠️ Exam trap: “Lambda in VPC can’t reach DynamoDB” → either add NAT GW + IGW, or (better) use VPC Gateway Endpoint for DynamoDB.

⚠️ Exam trap - “VPC resources access SQS/SNS/KMS privately (no internet)”:

✅ VPC Interface Endpoint (PrivateLink) — private ENI, traffic stays on AWS network
❌ VPN = connects on-prem ↔ AWS, doesn’t solve VPC → AWS service
❌ NAT instance/gateway = still routes through public internet
❌ Internet Gateway = is the public internet — exactly what they want to avoid

AWS PrivateLink (VPC Endpoint Services):

AWS PrivateLink — expose a service in your VPC to other VPCs privately.

3rd party VPC: Network Load Balancer (service provider)
Customer VPC: ENI (Interface Endpoint, consumer)
No VPC peering, IGW, NAT, or route tables needed
Works across accounts

Site-to-Site VPN:

Site-to-Site VPN — encrypted connection between on-premises and AWS over the public internet.

Components:

Virtual Private Gateway (VGW) — VPN concentrator on AWS side, attached to VPC
- Can customize ASN (Autonomous System Number)
Customer Gateway (CGW) — software or physical device on customer side

Setup:

CGW IP: use public IP of your device (or public IP of NAT device if behind NAT-T)
Enable Route Propagation for VGW in route table associated with your subnets
To ping EC2 from on-prem → add ICMP protocol to inbound Security Group rules

AWS VPN CloudHub:

Secure communication between multiple on-prem sites via multiple VPN connections
Low-cost hub-and-spoke model over public internet
Setup: multiple VPN connections on same VGW + dynamic routing + route tables

⚠️ Exam trap: “Ping EC2 from on-premises doesn’t work” → check ICMP allowed in SG inbound + Route Propagation enabled.

Direct Connect (DX):

Direct Connect — dedicated private physical connection from on-premises to AWS.

Setup at AWS Direct Connect Location (co-location facility)
Requires Virtual Private Gateway on VPC
Access both public (S3) and private (EC2) resources on same connection
Supports IPv4 and IPv6
Lead time: often > 1 month to establish

Virtual Interfaces (VIFs):

Private VIF → access VPC resources (EC2 in private subnet)
Public VIF → access public AWS services (S3, Glacier)
Transit VIF → access VPCs via Transit Gateway

Corporate DC ──► Customer Router ──► DX Endpoint ──► VPG ──► VPC
                                     (DX Location)
                                      VLAN 1 (Private VIF) ──► EC2
                                      VLAN 2 (Public VIF)  ──► S3, Glacier

Connection Types:

Type	Speed	Details
Dedicated	1 / 10 / 100 Gbps	Physical port dedicated to you. Request via AWS first
Hosted	50 Mbps – 10 Gbps	Via AWS Direct Connect Partners. Capacity on demand

Direct Connect Gateway:

Connect DX to VPCs in multiple regions (same account)
One DX connection → DX Gateway → multiple VPCs across regions

Resiliency:

High resiliency: one connection at multiple DX locations
Maximum resiliency: separate connections on separate devices at multiple locations

Backup:

DX fails → backup with another DX (expensive) or Site-to-Site VPN (cheaper)

⚠️ Exam trap: “Private, dedicated, consistent connection” → Direct Connect. “Encrypted over internet” → Site-to-Site VPN. DX is NOT encrypted by default (add VPN on top for encryption).

⚠️ Exam trap: “Improve connection within days/1 week” → NOT Direct Connect (takes > 1 month). Use Site-to-Site VPN for quick setup. DX is only the answer when time is not a constraint.

AWS Client VPN:

AWS Client VPN — connect end-devices (laptops) to AWS or on-premises via OpenVPN over the internet. Access EC2 using private IP.

Transit Gateway:

Transit Gateway — transitive peering hub for thousands of VPCs and on-premises (hub-and-spoke / star topology).

Regional resource, can work cross-region (peering)
Share cross-account using Resource Access Manager (RAM)
Route Tables: control which VPC can talk to which
Works with: Direct Connect Gateway, VPN connections
Only AWS service that supports IP Multicast

                    ┌──► VPC-A
                    │
Corporate DC ──► Transit Gateway ──► VPC-B
  (VPN/DX)          │
                    ├──► VPC-C
                    │
                    └──► VPC-D

Without TGW: complex mesh of VPC peering + VPN connections (N² connections) With TGW: single hub, all spokes connect to it (N connections)

ECMP (Equal-Cost Multi-Path Routing):

Forward packets over multiple best paths to increase bandwidth
Use case: multiple Site-to-Site VPN connections to TGW for more bandwidth

Setup	Throughput
VPN → VGW (1 connection, 2 tunnels)	1.25 Gbps
VPN → TGW (1 connection, ECMP)	2.5 Gbps (both tunnels used)
2× VPN → TGW (ECMP)	5.0 Gbps
3× VPN → TGW (ECMP)	7.5 Gbps

Share DX across accounts:

Transit Gateway + Direct Connect Gateway + Transit VIF
Use RAM to share Transit Gateway with other accounts

⚠️ Exam trap: “Connect many VPCs + on-premises, simplify topology” → Transit Gateway. NOT VPC Peering (not transitive, mesh complexity).

⚠️ Exam trap: “Increase VPN bandwidth to AWS” → multiple VPN connections + Transit Gateway with ECMP. VGW limited to 1.25 Gbps.

VPC Traffic Mirroring:

Traffic Mirroring — capture and inspect network traffic in your VPC.

Source: ENIs → Target: ENI or NLB
Can filter traffic, optionally truncate packets
Source and target can be in same or different VPCs (via peering)
Use cases: content inspection, threat monitoring, troubleshooting

Source A (ENI) ──┐
                 ├──► Traffic Mirroring ──► NLB ──► ASG (Security Appliances)
Source B (ENI) ──┘    (filter optional)

Egress-only Internet Gateway:

Egress-only IGW — like a NAT Gateway, but for IPv6.

Allows instances outbound connections over IPv6
Prevents the internet from initiating inbound IPv6 connections
Must update Route Tables

IPv4 outbound: Private EC2 ──► NAT Gateway ──► IGW ──► Internet
IPv6 outbound: EC2 ──► Egress-only IGW ──► Internet  (no inbound initiated)

⚠️ Exam trap: “IPv6 instances need outbound internet but block inbound” → Egress-only IGW (NOT NAT Gateway — NAT is for IPv4 only).

Networking Costs:

Core principle: Ingress is free, egress costs money. Keep traffic inside AWS to minimize costs.

EC2 Data Transfer Costs (per GB):

Traffic Path	Cost
Traffic in to EC2 (ingress)	Free
Same AZ, private IP	Free
Same AZ, public/Elastic IP	$0.02
Cross-AZ, private IP	$0.01
Cross-region	$0.02

Cost optimization tips:

Use private IP instead of public IP → saves money + better network performance
Use same AZ for maximum savings (trade-off: less HA)
DX locations co-located in same region → lower egress cost
Keep heavy processing (DB queries) inside AWS, send only results out

⚠️ Exam trap: “Lowest egress cost” with Direct Connect available

DX egress < Internet egress (~$0.02/GB vs ~$0.09/GB)
Strategy: keep big data transfers inside AWS (free/cheap), send only small results to users via DX (not internet)
❌ “Access over internet” → always more expensive egress than DX
❌ “Deploy on-premises + query AWS” → large query results cross the boundary on EVERY request (expensive even via DX)

NAT Gateway vs VPC Gateway Endpoint (for S3):

Path	Cost
EC2 → NAT GW → IGW → S3	$0.045/hr + $0.045/GB + $0.09/GB cross-region
EC2 → Gateway Endpoint → S3	Free (endpoint) + $0.01/GB same-region

Subnet 1: EC2 ──► NAT GW ──► IGW ──► S3         (costly: ~$0.09/GB)
Subnet 2: EC2 ──► VPC Gateway Endpoint ──► S3    (free endpoint, ~$0.01/GB)

⚠️ Exam trap: “Reduce cost of S3 access from VPC” → Gateway Endpoint (free, no NAT GW charges). Route table entry with pl-id for Amazon S3 → vpce-id.

S3 Data Transfer Pricing (USA):

Path	Cost/GB
S3 ingress (upload)	Free
S3 → Internet	$0.09
S3 Transfer Acceleration	+$0.04 to $0.08 on top
S3 → CloudFront	Free
CloudFront → Internet	$0.085 (slightly cheaper than S3 direct)
S3 Cross-Region Replication	$0.02

⚠️ Exam trap: “Deliver S3 content to users cheaply” → CloudFront ($0.085/GB vs $0.09/GB direct) + caching + 7x cheaper S3 request pricing.

AWS Network Firewall:

AWS Network Firewall — protect your entire VPC, Layer 3 to Layer 7.

Inspect traffic in any direction:
- VPC to VPC
- Outbound to internet
- Inbound from internet
- To/from Direct Connect & Site-to-Site VPN
Internally uses AWS Gateway Load Balancer
Rules centrally managed cross-account by AWS Firewall Manager

Fine-Grained Controls:

1000s of rules: IP/port filtering (10,000s of IPs), protocol blocking (e.g., block SMB outbound)
Stateful domain list: allow only *.mycorp.com outbound
Regex pattern matching
Traffic filtering: Allow, Drop, or Alert
Active flow inspection (intrusion prevention)
Logs → S3, CloudWatch Logs, Kinesis Data Firehose

⚠️ Exam trap: “Sophisticated VPC-wide network protection, Layer 3-7, inspect all traffic directions” → AWS Network Firewall. NOT just NACLs/SGs (those are basic). NOT WAF (WAF is Layer 7 HTTP only).

🎯 MASTER SUMMARY: VPC & Networking Exam Guide

Part 1: Core Principles (Understand WHY → Derive WHAT)

Principle 1: Everything Starts with Routing

Traffic doesn’t flow just because resources exist — Route Tables are the backbone. IGW, NAT, VPC Peering, Endpoints — none work without correct route table entries. If connectivity fails, check routes first.

Key insight: Most “can’t connect” troubleshooting answers involve Route Tables, SGs, or NACLs.

Principle 2: Public vs Private = Route to IGW

A subnet is “public” only because its route table has 0.0.0.0/0 → igw-id. There’s no checkbox. No route to IGW = private subnet, regardless of what you call it.

Principle 3: Stateful vs Stateless = The Fundamental Security Split

SG (Stateful): Allow outbound → return automatically allowed inbound. Only control who initiates.
NACL (Stateless): Must explicitly allow BOTH directions, including ephemeral ports for responses.

Derivation: If the exam says “ACCEPT then REJECT in flow logs” → always NACL. SGs never block return traffic.

Principle 4: Private Subnet Internet Access = NAT (IPv4) or Egress-only IGW (IPv6)

Private instances can’t reach internet directly. They need a “translator” in a public subnet:

IPv4 → NAT Gateway (managed) or NAT Instance (legacy)
IPv6 → Egress-only IGW (all IPv6 is public, so NAT isn’t needed — just block inbound)

Principle 5: AWS Services from VPC = Endpoints (Stay Private)

Instead of routing through IGW to reach AWS services, use VPC Endpoints:

Gateway Endpoint (free) → S3 and DynamoDB only
Interface Endpoint (paid) → everything else, accessible from on-prem

Derivation: “Reduce cost of S3 access” or “private access to S3” → Gateway Endpoint.

Principle 6: On-Premises Connectivity = Speed vs Cost vs Time

Three options, each with trade-offs:

VPN: Fast to set up, encrypted, over internet, up to 1.25 Gbps (or more with TGW+ECMP)
Direct Connect: Takes >1 month, private physical line, up to 100 Gbps, NOT encrypted
Client VPN: For individual users (laptops), not site-to-site

Principle 7: Transitivity Doesn’t Exist in VPC Peering

VPC Peering is point-to-point. A↔B and B↔C does NOT mean A↔C. For hub connectivity → Transit Gateway.

Principle 8: Transit Gateway = The Universal Hub

TGW solves three problems: (1) transitive routing, (2) VPN bandwidth scaling (ECMP), (3) sharing DX across accounts. If question mentions “many VPCs” or “simplify network” → TGW.

Principle 9: Network Protection is Layered

Layer 3-4:  NACLs (subnet) → Security Groups (ENI)
Layer 7:    WAF (HTTP/HTTPS only)
Layer 3-7:  AWS Network Firewall (entire VPC, all directions)
Cross-acct: AWS Firewall Manager (centralize rules)

Principle 10: Egress Costs Money, Ingress is Free

All AWS networking pricing follows this: data IN = free, data OUT = costs. Minimize egress by keeping processing inside AWS and using private IPs.

Part 2: Decision Tree (Follow Keywords → Find Answer)

Connectivity Decision Tree

Need to connect to AWS?
│
├─ From on-premises SITE?
│  ├─ Need it NOW (days)? ──► Site-to-Site VPN
│  ├─ Need dedicated/private/consistent? ──► Direct Connect
│  ├─ Need encryption on DX? ──► VPN on top of DX
│  ├─ Multiple sites to connect? ──► VPN CloudHub
│  └─ Need DX to multiple regions? ──► DX Gateway
│
├─ From individual LAPTOP?
│  └─► AWS Client VPN (OpenVPN)
│
├─ VPC to VPC?
│  ├─ Just 2 VPCs? ──► VPC Peering
│  ├─ Many VPCs (hub-and-spoke)? ──► Transit Gateway
│  └─ Expose specific service? ──► PrivateLink (NLB + ENI)
│
└─ VPC to AWS Service (S3, DynamoDB, etc.)?
   ├─ S3 or DynamoDB? ──► Gateway Endpoint (free)
   ├─ Other service? ──► Interface Endpoint
   └─ Need on-prem access too? ──► Interface Endpoint

Security Decision Tree

Need to control traffic?
│
├─ At instance/ENI level? ──► Security Group (ALLOW only, stateful)
├─ At subnet level? ──► NACL (ALLOW + DENY, stateless)
├─ Block specific IPs (Layer 3)? ──► NACL (has DENY rules)
├─ Block HTTP patterns/SQL injection? ──► WAF (Layer 7)
├─ VPC-wide, all directions, L3-L7? ──► AWS Network Firewall
└─ Centralize across accounts? ──► AWS Firewall Manager

The CANNOT List

You CANNOT…	Why
Disable IPv4 in VPC	VPC requires IPv4; IPv6 is optional dual-stack
Use NAT Gateway as Bastion	NAT GW doesn’t support SSH — use NAT Instance
Attach >1 IGW per VPC	1:1 mapping only
Use Gateway Endpoint from on-prem	Gateway Endpoint = route table only; use Interface Endpoint
Make VPC Peering transitive	Need separate peering per pair, or use TGW
Encrypt DX natively	Add VPN on top for encryption
Set up DX in under 1 month	Lead time >1 month; use VPN for quick setup
Have VPC CIDR larger than /16	Max VPC size = /16 (65,536 IPs)
Attach SG to NAT Gateway	NAT GW has no SGs — only NAT Instance does
Use VGW for >1.25 Gbps VPN	Need TGW + ECMP for higher throughput

Part 3: Scenario Pattern Recognition

Pattern: “Private instances need internet access, managed, scalable”

Keywords: private subnet, internet, managed, scales Answer: NAT Gateway Why: AWS-managed, auto-scales to 100 Gbps, no SG/patching needed

Pattern: “SSH into private EC2 instances”

Keywords: SSH, private subnet, access, developers Answer: Bastion Host (or SSM Session Manager) Why: Bastion in public subnet acts as SSH jump box. SG: port 22 from corporate public CIDR

Pattern: “ACCEPT then REJECT in flow logs”

Keywords: flow logs, allowed then blocked, return traffic Answer: NACL is blocking (not SG) Why: SGs are stateful — they never block return traffic. Only NACLs (stateless) do this

Pattern: “Can’t launch EC2 in subnet”

Keywords: launch failure, subnet, no capacity Answer: No available IPv4 addresses → add new CIDR Why: IPv6 space is huge; the bottleneck is always IPv4

Pattern: “VPC A peers with B, B peers with C, can A reach C?”

Keywords: VPC peering, transitive, multiple VPCs Answer: NO — VPC Peering is not transitive Why: Need A↔C peering, or use Transit Gateway

Pattern: “Connect many VPCs + on-premises, simplify”

Keywords: many VPCs, hub-and-spoke, simplify, on-premises Answer: Transit Gateway Why: Single hub, N connections instead of N² mesh

Pattern: “Increase VPN bandwidth beyond 1.25 Gbps”

Keywords: VPN throughput, scale bandwidth, more than 1.25 Answer: Transit Gateway with ECMP + multiple VPN connections Why: VGW uses only 1 tunnel (1.25 Gbps); TGW uses both (2.5 Gbps) and stacks connections

Pattern: “Private, dedicated, consistent connection to AWS”

Keywords: dedicated, private, consistent, not internet Answer: Direct Connect Why: Physical private connection, doesn’t traverse internet

Pattern: “Improve connectivity within days/1 week”

Keywords: quickly, fast setup, days, immediately Answer: Site-to-Site VPN (NOT Direct Connect) Why: DX takes >1 month to establish

Pattern: “DX backup, cost-effective”

Keywords: Direct Connect fails, backup, cheap Answer: Site-to-Site VPN as backup Why: Second DX is expensive; VPN is cheap and quick

Pattern: “Access S3/DynamoDB from VPC privately”

Keywords: S3, DynamoDB, private access, no internet, reduce cost Answer: VPC Gateway Endpoint (free) Why: Free, route table entry, no NAT GW charges

Pattern: “Access AWS service from on-premises via DX/VPN”

Keywords: on-premises, AWS service, private access, VPN, Direct Connect Answer: VPC Interface Endpoint (not Gateway) Why: Gateway Endpoints can’t be accessed from on-prem

Pattern: “Connect multiple on-prem sites, backup over internet”

Keywords: multiple sites, hub-and-spoke, VPN, backup Answer: AWS VPN CloudHub Why: Multiple VPN connections on same VGW, over public internet

Pattern: “Expose service from one VPC to another privately”

Keywords: expose, service, private, cross-VPC, cross-account Answer: AWS PrivateLink (NLB + ENI) Why: No peering, no IGW, no routes needed

Pattern: “IPv6 outbound internet, block inbound”

Keywords: IPv6, outbound, prevent inbound, internet Answer: Egress-only Internet Gateway Why: NAT is IPv4 only; Egress-only IGW is the IPv6 equivalent

Pattern: “Capture IP traffic information/metadata”

Keywords: capture, IP traffic, information, logs, metadata Answer: VPC Flow Logs Why: Flow Logs = metadata (IPs, ports, action). Traffic Mirroring = full packet capture

Pattern: “Inspect actual network traffic content”

Keywords: inspect, deep packet, content, security appliance Answer: VPC Traffic Mirroring Why: Copies actual packets to ENI/NLB for analysis

Pattern: “500 Mbps Direct Connect”

Keywords: 500 Mbps, DX, connection Answer: Hosted connection Why: Dedicated = 1/10/100 Gbps only. Anything in between = Hosted

Pattern: “VPC-wide network protection, Layer 3-7”

Keywords: sophisticated, entire VPC, Layer 3-7, all directions Answer: AWS Network Firewall Why: NACLs/SGs are basic, WAF is HTTP-only. Network Firewall covers L3-L7 in all directions

Pattern: “DX to VPCs in multiple regions”

Keywords: Direct Connect, multiple regions, VPCs Answer: Direct Connect Gateway Why: One DX → DX Gateway → VPCs across regions

Part 4: Quick Reference Tables

On-Premises Connectivity Comparison:

Feature	Site-to-Site VPN	Direct Connect	Client VPN
Speed	Up to 1.25 Gbps (VGW) or more (TGW+ECMP)	50 Mbps – 100 Gbps	N/A
Path	Public internet	Private physical line	Public internet
Encrypted	✅ Yes (IPsec)	❌ No (add VPN on top)	✅ Yes (OpenVPN)
Setup time	Minutes/hours	>1 month	Minutes
Cost	Low	High	Low
Use case	Quick setup, backup for DX	Large bandwidth, consistent	Individual users
AWS side	VGW or TGW	VGW + DX Location	Client VPN Endpoint
On-prem side	CGW	Customer Router	OpenVPN client

VPC Endpoint Comparison:

Feature	Gateway Endpoint	Interface Endpoint
Services	S3, DynamoDB	Everything else
Cost	Free	$/hr + $/GB
How	Route Table entry	ENI (private IP)
SG	No	Yes
On-prem access	❌	✅

Security Layers:

Layer	Tool	Scope	Rules
Instance/ENI	Security Group	Per ENI	Allow only, stateful
Subnet	NACL	Per subnet	Allow + Deny, stateless
HTTP/HTTPS	WAF	CloudFront/ALB/API GW	Web ACL rules
Entire VPC (L3-L7)	Network Firewall	Per VPC	Allow/Drop/Alert
Cross-account	Firewall Manager	Organization	Centralized management

Key Numbers:

What	Value
Max VPCs per region	5 (soft limit)
Max CIDRs per VPC	5
VPC CIDR range	/28 (16 IPs) – /16 (65,536 IPs)
Reserved IPs per subnet	5
VPN throughput (VGW)	1.25 Gbps
VPN throughput (TGW, 1 conn)	2.5 Gbps (ECMP)
NAT Gateway bandwidth	Up to 100 Gbps
DX Dedicated speeds	1 / 10 / 100 Gbps
DX Hosted speeds	50 Mbps – 10 Gbps
DX setup time	>1 month
Ephemeral ports (Linux)	32768 – 60999
Ephemeral ports (Windows)	49152 – 65535

Part 5: Ultimate Instant-Answer Table

Question Contains	→ Instant Answer
“Private subnet internet IPv4, managed”	NAT Gateway
“NAT + Security Groups”	NAT Instance
“NAT + Bastion Host”	NAT Instance
“SSH into private EC2”	Bastion Host (or SSM)
“Bastion Host SG, which port/CIDR?”	Port 22, company public CIDR
“Default NACL behavior”	Allow ALL traffic
“Custom NACL behavior”	Deny ALL traffic
“ACCEPT then REJECT in flow logs”	NACL blocking (not SG)
“Return traffic blocked”	NACL (stateless)
“Ephemeral ports”	NACL outbound/inbound rules needed
“Top-10 IP addresses in flow logs”	CloudWatch Contributor Insights
“Analyze flow logs with SQL”	S3 + Athena
“VPC Peering transitive?”	NO — need TGW
“Route tables updated one side only”	Update BOTH VPCs
“S3/DynamoDB private access from VPC”	Gateway Endpoint (free)
“AWS service access from on-prem”	Interface Endpoint
“Lambda can’t reach DynamoDB”	VPC Gateway Endpoint
“Expose service cross-VPC privately”	PrivateLink (NLB + ENI)
“Ping EC2 from on-prem fails”	ICMP in SG + Route Propagation
“Multiple on-prem sites, VPN backup”	VPN CloudHub
“Private, dedicated, consistent connection”	Direct Connect
“Encrypted connection over internet”	Site-to-Site VPN
“Improve connection in days/1 week”	VPN (NOT DX — >1 month)
“DX backup, cost-effective”	Site-to-Site VPN
“500 Mbps DX connection”	Hosted (Dedicated = 1/10/100 only)
“DX to multiple regions”	DX Gateway
“Share DX across accounts”	TGW + DX GW + Transit VIF + RAM
“VPN bandwidth >1.25 Gbps”	TGW + ECMP
“Many VPCs + on-prem, simplify”	Transit Gateway
“IP Multicast”	Transit Gateway
“IPv6 outbound, block inbound”	Egress-only IGW
“Can’t launch EC2 in subnet”	IPv4 exhausted → new CIDR
“Reduce S3 access cost from VPC”	Gateway Endpoint (free)
“Capture IP traffic metadata”	VPC Flow Logs
“Deep packet inspection”	VPC Traffic Mirroring
“VPC-wide L3-L7 protection”	AWS Network Firewall
“Centralize firewall rules cross-account”	AWS Firewall Manager
“ALB → EC2 SG, most secure”	Reference ALB’s SG (not CIDR)

Part 6: Elimination Checklist

Connectivity Questions

□ Is it on-premises → AWS?
  → Yes: VPN, DX, or Client VPN
    □ Need it fast (days)?
      → Yes = Site-to-Site VPN
      → No (can wait months) = Direct Connect
    □ Individual user (laptop)?
      → Yes = Client VPN
    □ Multiple on-prem sites?
      → Yes = VPN CloudHub
  → No: VPC-to-VPC or VPC-to-service

□ Is it VPC → VPC?
  → 2 VPCs = VPC Peering
  → Many VPCs = Transit Gateway
  → Expose single service = PrivateLink

□ Is it VPC → AWS Service?
  → S3 or DynamoDB = Gateway Endpoint
  → Anything else = Interface Endpoint
  → Needs on-prem access = Interface Endpoint

Security Questions

□ What layer?
  → L3-L4 per instance = Security Group
  → L3-L4 per subnet = NACL
  → L7 HTTP only = WAF
  → L3-L7 entire VPC = Network Firewall

□ Need DENY rules?
  → Yes = NACL (SGs only have ALLOW)

□ Stateful or stateless matters?
  → "Return traffic blocked" = NACL (stateless)
  → "Ephemeral ports needed" = NACL

Cost Questions

□ Private IP or Public IP?
  → Private = cheaper (free same-AZ, $0.01 cross-AZ)
  → Public = $0.02 even same-AZ

□ S3 access path?
  → NAT GW → IGW = expensive
  → Gateway Endpoint = free

□ Content delivery?
  → S3 direct = $0.09/GB
  → CloudFront = $0.085/GB + caching

🏆 The Golden Rules

Route Tables are the backbone (no routes = no connectivity, regardless of gateways)
SG = stateful, NACL = stateless (derive all firewall behavior from this)
“ACCEPT then REJECT” = always NACL (SGs never block return traffic)
Gateway Endpoint for S3/DynamoDB (free, always preferred on exam)
VPC Peering is NOT transitive (many VPCs → Transit Gateway)
DX takes >1 month (quick fix → VPN, DX backup → VPN)
Dedicated DX = 1/10/100 Gbps (anything else → Hosted)
NAT Gateway has NO Security Groups (SGs/Bastion → NAT Instance)
VPN max 1.25 Gbps via VGW (more → TGW + ECMP)
IPv4 outbound = NAT, IPv6 outbound = Egress-only IGW (don’t mix them)
5 IPs reserved per subnet (always add 5 to your requirement)
Private IP = cheaper + better performance (always prefer over public)
Ingress = free, egress = costs money (keep processing inside AWS)
Reference SGs in rules (more secure and dynamic than CIDR-based rules)
DX is NOT encrypted (add VPN on top for encryption)

Amazon Route53:

┌──────────┐   example.com?   ┌─────────────┐
│  Client  │ ───────────────→ │  Route 53   │
│          │ ←─────────────── │             │
└────┬─────┘   54.22.33.44    └─────────────┘
     │
     │  54.22.33.44
     ▼
┌─────────────────────────────────────┐
│            AWS Cloud                │
│     ┌──────────────────────┐        │
│     │    EC2 Instance      │        │
│     │  Public IP:          │        │
│     │  54.22.33.44         │        │
│     └──────────────────────┘        │
└─────────────────────────────────────┘

AWS Route53 is a managed DNS (Domain Name System), collection of rules and records which helps clients understand how to reach a server through URLs.

Feature	Details
Type	Highly available, scalable, fully managed Authoritative DNS
Authoritative	You (customer) can update DNS records
Domain Registrar	Yes — can register domains directly
Health Checks	Monitor health of your resources
SLA	100% availability (only AWS service with this!)
Scope	Global service (not regional)
Why “53”?	Traditional DNS port number

⚠️ Exam trap: Route 53 is a global service — no region selection needed!

DNS Terminologies:

        http://api.www.example.com.
               │   │       │     │ │
               │   │       │     │ └── Root (.)
               │   │       │     └──── TLD (.com, .gov, .org)
               │   │       └────────── SLD (example.com)
               │   └────────────────── Sub Domain (www)
               └────────────────────── Sub Domain (api)
               
        └────────────────────────────┘
            FQDN (Fully Qualified Domain Name)

Term	Description
Domain Registrar	Amazon Route 53, GoDaddy, etc.
DNS Records	A, AAAA, CNAME, NS, etc.
Zone File	Contains DNS records
Name Server	Resolves DNS queries (Authoritative or Non-Authoritative)
TLD	.com, .us, .gov, .org
SLD	amazon.com, google.com

DNS Resolution Flow:

┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐
│ Browser │───→│   OS    │───→│   ISP   │───→│  Root   │───→│   TLD   │───→│  Name   │
│  Cache  │    │  Cache  │    │  Cache  │    │ Server  │    │ Server  │    │ Server  │
└─────────┘    └─────────┘    └─────────┘    └─────────┘    └─────────┘    └────┬────┘
                                                                                │
                                              ┌─────────────────────────────────┘
                                              ▼
                                        IP Address
                                     (cached on way back)

Browser cache → OS cache → ISP DNS Resolver
Root Server → “Go ask .com TLD”
TLD Server → “Go ask example.com Name Server”
Name Server → Returns IP address
Caches populated on the way back

Route 53 – Records:

Each record contains:

Field	Description
Domain/subdomain Name	e.g., example.com
Record Type	A, AAAA, CNAME, NS, etc.
Value	e.g., 12.34.56.78
Routing Policy	How Route 53 responds to queries
TTL	Time record is cached at DNS Resolvers

Record Type	Must Know	Description
A	✅	Maps hostname to IPv4
AAAA	✅	Maps hostname to IPv6
CNAME	✅	Maps hostname → another hostname (target must have A/AAAA)
NS	✅	Name Servers for the Hosted Zone — controls traffic routing
CAA, DS, MX, PTR, SOA, TXT, SPF, SRV	Advanced	Less common record types

⚠️ Exam trap: CNAME cannot be used for Zone Apex (example.com) — only for subdomains (www.example.com). Use Alias for apex!

Route 53 – Hosted Zones:

A container for records that define how to route traffic to a domain and its subdomains.

Type	Access	Example	Use Case
Public	Internet	example.com → 54.22.33.44	S3, CloudFront, EC2 (Public IP), ALB
Private	Within VPC(s)	api.example.internal → 10.0.0.10	Internal EC2, RDS, microservices

    PUBLIC HOSTED ZONE                    PRIVATE HOSTED ZONE
    ──────────────────                    ───────────────────
         
    ┌────────┐  example.com?              ┌─────────────────────────────────┐
    │ Client │ ──────────────┐            │              VPC                │
    └────────┘               │            │  ┌─────────────────────────┐    │
         ▲                   ▼            │  │  Private Hosted Zone    │    │
         │            ┌────────────┐      │  └───────────┬─────────────┘    │
         │            │  Public    │      │              │                  │
         └────────────│  Hosted    │      │   ┌──────────┴──────────┐       │
       54.22.33.44    │  Zone      │      │   ▼                     ▼       │
                      └─────┬──────┘      │ api.example     db.example      │
                            │             │ .internal?      .internal?      │
                            ▼             │   │                     │       │
                    ┌───────────────┐     │   ▼                     ▼       │
                    │ S3, CloudFront│     │ 10.0.0.10          10.0.0.35    │
                    │ EC2, ALB      │     │ (EC2)              (RDS)        │
                    └───────────────┘     └─────────────────────────────────┘

Cost: $0.50/month per hosted zone

Route 53 – TTL (Time To Live):

              myapp.example.com? 
┌────────┐ ─────────────────────→ ┌──────────┐
│ Client │ ←───────────────────── │ Route 53 │
└───┬────┘                        └──────────┘
    │  A 12.34.56.78 (TTL) 
    │
    │  Client caches result for TTL duration
    │
    │        HTTP Request
    └──────────────────────────→ ┌────────────┐
    ←─────────────────────────── │ Web Server │
             HTTP Response       └────────────┘

TTL	Traffic to Route 53	Record Freshness	Cost	Use Case
High (24 hr)	Less	Possibly outdated	Lower	Stable records
Low (60 sec)	More	Always fresh	Higher $$	Before migrations/changes

TTL is mandatory for each DNS record (except Alias records)
Strategy: Lower TTL before planned changes → make change → raise TTL back

⚠️ Exam trap: Changed DNS record but users still go to old IP? → TTL caching! Clients cache until TTL expires.

Route 53 – CNAME vs Alias:

AWS resources expose ugly hostnames (e.g., lb1-1234.us-east-2.elb.amazonaws.com) — you want myapp.mydomain.com

Feature	CNAME	Alias
Points to	Any hostname	AWS resources only
Zone Apex (root domain)	❌ NO	✅ YES
Cost	Standard DNS charges	Free
Health Check	❌	✅ Native
TTL	You set it	Auto-managed by Route 53
Record Type	CNAME	A or AAAA

⚠️ Exam trap: Need to point mydomain.com (root) to an ALB? → Alias (CNAME won’t work!)

Route 53 – Alias Records:

AWS extension to DNS that maps a hostname to an AWS resource. Automatically recognizes IP changes on the target.

┌───────────────────────────────────────────────┐
│  Route 53 Alias Record                        │
│  ┌────────────────────────────────────────┐   │
│  │ Record: example.com                    │   │
│  │ Type: A                                │   │
│  │ Value: MyALB-123456789.us-east-1...    │   │
│  └────────────────────────────────────────┘   │
└───────────┬───────────────────────────────────┘
            │ AWS-Managed (IP changes tracked)
            ▼
    ┌──────────────────┐
    │ Application      │
    │ Load Balancer    │
    │ (MyALB-1234...)  │
    └──────────────────┘

Characteristic	Detail
Works at Zone Apex	✅ YES (example.com)
Cost	Free (unlike CNAME)
Health Checks	✅ Native support
TTL	❌ Not settable (auto-managed)
Record Type	A or AAAA only
Auto IP tracking	✅ YES (AWS manages)

⚠️ Exam trap: Alias records — you cannot set TTL (Route 53 manages it automatically)

Targets:

Elastic Load Balancers (ALB, NLB, Classic LB)
CloudFront Distributions
API Gateway
Elastic Beanstalk environments
S3 Websites
VPC Interface Endpoints
Global Accelerator
Route 53 records in same hosted zone

⚠️ Exam trap: Cannot use Alias for EC2 DNS names — use regular A record or CNAME instead!

Route 53 – Routing Policies:

DNS responds to queries (does NOT route traffic like a load balancer).

Route53 policies:

Simple routing policy: Use for a single resource that performs a given function for your domain, for example, a web server that serves content for the example.com website. You can use simple routing to create records in a private hosted zone;
Weighted routing policy: Use to route traffic to multiple resources in proportions that you specify. You can use weighted routing to create records in a private hosted zone.
Failover routing policy: Use when you want to configure active-passive failover. You can use failover routing to create records in a private hosted zone;
Latency routing policy: Use when you have resources in multiple AWS Regions and you want to route traffic to the Region that provides the best latency. You can use latency routing to create records in a private hosted zone;
Geolocation routing policy: Use when you want to route traffic based on the location of your users. You can use geolocation routing to create records in a private hosted zone;
Geoproximity routing policy: Use when you want to route traffic based on the location of your resources and, optionally, shift traffic from resources in one location to resources in another location. You can use geoproximity routing to create records in a private hosted zone;
Multivalue answer routing policy: Use when you want Route 53 to respond to DNS queries with up to eight healthy records selected at random. You can use multivalue answer routing to create records in a private hosted zone;
IP-based routing policy: Use when you want to route traffic based on the location of your users, and have the IP addresses that the traffic originates from.

Policy	Use Case	Key Feature
Simple	Single resource	Randomly chosen if multiple values
Weighted	Load balancing	Control traffic % distribution
Failover	Active-passive HA	Primary + standby resource
Latency	Multi-region	Routes to lowest latency region
Geolocation	Location-based	Route by user geography
Geoproximity	Resource location bias	Route by resource location + bias
Multivalue	Multiple IPs	Up to 8 random healthy records
IP-based	Client IP routing	Route by CIDR blocks

⚠️ Exam trap: “Routing” in Route 53 ≠ Load Balancer routing. DNS responds to queries; it doesn’t route actual traffic!

Routing Policy – Simple:

SINGLE VALUE                    MULTIPLE VALUES
─────────────                   ──────────────────

    foo.example.com                foo.example.com
         │                              │
         │ A 11.22.33.44               │ A 11.22.33.44
┌─────────────────┐            ┌─────────────────────┐
│  Client         │            │  Client chooses     │
│  Gets 1 value   │            │  a random value     │
└─────────────────┘            └─────────────────────┘
                                   │ A 55.66.77.88
                                   │ A 99.11.22.33

Use for single resource (typical case)
Can return multiple values in same record — client picks one randomly
❌ Cannot use with Health Checks
When Alias enabled → only one AWS resource allowed

⚠️ Exam trap: Simple policy with multiple values ≠ load balancing! No health checks, no failover.

Routing Policy – Weighted:

Control % of traffic to each resource via relative weights
Formula: traffic % = weight for record / sum of all weights
Weights don’t need to sum to 100
Records must have same name and type
✅ Can use with Health Checks
Use cases: Load balancing between regions, A/B testing, canary deployments (e.g., 5% to new Elastic Beanstalk env)
Weight = 0 → stops traffic to that resource
All weights = 0 → all records returned equally

⚠️ Exam trap: Weighted ≠ round-robin! It’s percentage-based distribution, not sequential rotation. ⚠️ Exam trap: Weight = 0 stops all traffic to that resource (useful for maintenance)

Routing Policy – Latency-based:

Routes to resource with lowest latency (best response time) to user
Latency measured between user and AWS Regions
✅ Can use with Health Checks (failover capability)
Use case: App in multiple regions → minimize response time for users

⚠️ Exam trap: Latency ≠ Geography! German user may be directed to US if that has lowest latency. ⚠️ Exam trap: “Best user experience” / “minimize response time” → Latency, not Geolocation!

Route 53 – Health Checks:

Multi-region failover architecture:

                    ┌───────────┐
                    │ Route 53  │
                    │ DNS Record│
                    └─────┬─────┘
                          │
            ┌─────────────┴─────────────┐
            ▼                           ▼
       ❤ Health Check              ❤ Health Check
            │                           │
   ┌────────┴────────┐         ┌────────┴────────┐
   │    us-east-1    │         │    eu-west-1    │
   │  ┌───────────┐  │         │  ┌───────────┐  │
   │  │    ALB    │  │         │  │    ALB    │  │
   │  └─────┬─────┘  │         │  └─────┬─────┘  │
   │        ▼        │         │        ▼        │
   │  ┌───────────┐  │         │  ┌───────────┐  │
   │  │    ASG    │  │         │  │    ASG    │  │
   │  └───────────┘  │         │  └───────────┘  │
   └─────────────────┘         └─────────────────┘

How endpoint monitoring works:

  ❤ Health Checker   ❤ Health Checker   ❤ Health Checker
     (us-east-1)        (us-west-1)        (sa-east-1)
          │                  │                  │
          └──────────────────┼──────────────────┘
                             │ HTTP request to /health
                             ▼ 200 code
                    ┌─────────────────┐
                    │    eu-west-1    │
                    │  ┌───────────┐  │
                    │  │    ALB    │──┼── Must allow Route 53
                    │  └─────┬─────┘  │   Health Checker IPs!
                    │        ▼        │
                    │  ┌───────────┐  │
                    │  │  EC2/ASG  │  │
                    │  └───────────┘  │
                    └─────────────────┘

IP ranges: https://ip-ranges.amazonaws.com/ip-ranges.json

Setting	Value
Global health checkers	~15
Threshold (healthy/unhealthy)	3 (default)
Interval	30 sec (10 sec = higher cost)
Protocols	HTTP, HTTPS, TCP
Healthy if	>18% checkers report healthy
Pass codes	2xx and 3xx only
Text match	First 5120 bytes of response

⚠️ Exam trap: Must configure firewall/security group to allow Route 53 Health Checker IPs!

HTTP Health Checks work only for public resources
Integrated with CloudWatch metrics

Health Check Type	What It Monitors	Use Case
Endpoint	Application, server, AWS resource	Direct resource monitoring
Calculated	Other health checks	Aggregate multiple checks
CloudWatch Alarm	CW Alarms (DynamoDB throttles, RDS, custom)	Private resources

⚠️ Exam trap: Only 3 health check types! No direct SQS, SNS, or other service monitoring — use CloudWatch Alarm instead.

⚠️ Exam trap: Private resources → use CloudWatch Alarm health checks (HTTP checks can’t reach them!)

Health Checks for Private Resources:

                              ┌─────────────────────────────────┐
                              │              VPC                │
┌─────────────────┐           │  ┌───────────────────────────┐  │
│ Health Checker  │           │  │     Private subnet        │  │
│  (us-east-1)    │           │  │    ┌─────────────┐        │  │
└────────┬────────┘           │  │    │  EC2 (T2)   │        │  │
         │                    │  │    └──────┬──────┘        │  │
         │ ✖ Can't reach!     │  │           │ monitor       │  │
         │                    │  │           ▼               │  │
         │    monitor         │  │    ┌─────────────┐        │  │
         └────────────────────┼──┼───→│ CloudWatch  │        │  │
                              │  │    │   Alarm     │        │  │
                              │  │    └─────────────┘        │  │
                              │  └───────────────────────────┘  │
                              └─────────────────────────────────┘

Route 53 health checkers are outside VPC — can’t reach private endpoints
Solution: Create CloudWatch Metric → CloudWatch Alarm → Health Check monitors the alarm

Routing Policy – Geolocation:

Routes based on user location (not latency!)
Specify: Continent → Country → US State (most precise wins)
Must create “Default” record for unmatched locations
✅ Can use with Health Checks
Use cases: Website localization, content/access restrictions by country, regional load balancing

⚠️ Exam trap: Geolocation ≠ Latency! Geolocation = user’s geography; Latency = network performance. ⚠️ Exam trap: “Legal requirement” / “restrict access by country” → Geolocation (not Latency!)

Routing Policy – Geoproximity:

Routes based on geographic location of users AND resources
Use bias to shift traffic between resources:
- Expand (1 to 99) → more traffic to resource
- Shrink (-1 to -99) → less traffic to resource
Resources: AWS (specify region) or Non-AWS (specify lat/long)
Requires Route 53 Traffic Flow

⚠️ Exam trap: Geoproximity requires Traffic Flow (paid feature). Geolocation does NOT!

Routing Policy – IP-based:

  User B              User A
(200.5.4.100)      (203.0.113.56)
      │                  │
      └────────┬─────────┘
               ▼
          ┌─────────┐
          │Route 53 │
          └────┬────┘
               │
     ┌─────────┴─────────┐
     │  CIDR Collection  │
     ├───────────────────┤
     │ location-1: 203.0.113.0/24 │
     │ location-2: 200.5.4.0/24   │
     └─────────┬─────────┘
               │
     ┌─────────┴─────────┐
     │      Records      │
     ├───────────────────┤
     │ example.com → 1.2.3.4 (location-1) │
     │ example.com → 5.6.7.8 (location-2) │
     └─────────┬─────────┘
               │
       ┌───────┴───────┐
       ▼               ▼
   EC2 (5.6.7.8)   EC2 (1.2.3.4)
    User B →         User A →

Routes based on client IP address (CIDR blocks)
You define: CIDR → Location → Endpoint mappings
Use cases: Optimize performance, reduce network costs, route specific ISP users

Routing Policy – Multi-Value:

Returns multiple values/resources (up to 8 healthy records)
✅ Can use with Health Checks — returns only healthy resources
Client chooses one from the returned values

⚠️ Exam trap: Multi-Value is NOT a substitute for ELB! It’s client-side selection, not load balancing.

Domain Registrar vs. DNS Service:

Concept	Description
Domain Registrar	Where you buy/register domain (GoDaddy, Amazon Registrar, etc.) — annual fee
DNS Service	Where you manage DNS records (can be different from registrar!)

Registrar usually provides DNS service, but you can use a different DNS provider
Example: Buy domain from GoDaddy → Use Route 53 to manage DNS records
To use Route 53 with 3rd party registrar: Create Public Hosted Zone → Update NS records at the registrar (not in Route 53!)

⚠️ Exam trap: Update NS records at the registrar (GoDaddy), not in Route 53! And use Public Hosted Zone for internet-facing domains.

Route 53 – Hybrid DNS:

Route 53 Resolver automatically answers DNS queries for:

Local domain names for EC2 instances
Records in Private Hosted Zones
Records in public Name Servers

Hybrid DNS = Resolving DNS queries between VPC (Route 53 Resolver) and your networks (other DNS Resolvers)

Network Type	Connection
VPC / Peered VPC	Native
On-premises	Direct Connect or AWS VPN

Route 53 – Resolver Endpoints:

Inbound Endpoint — On-premises DNS resolvers can query Route 53 Resolver for AWS resources

                                    ┌─────────────────────────────────────────┐
                                    │               us-east-1                 │
   On-Premises Data Center          │  ┌───────────────────────────────────┐  │
  ┌──────────────────────┐          │  │              VPC                  │  │
  │                      │          │  │     Private Hosted Zone           │  │
  │  ┌────────────────┐  │          │  │       (aws.private)               │  │
  │  │ DNS Resolvers  │  │          │  │  ┌─────────────────────────────┐  │  │
  │  │(onpremise.     │  │          │  │  │     Private Subnet          │  │  │
  │  │  private)      │──┼── DNS Query: app.aws.private? ──────────────→│  │  │
  │  └────────────────┘  │          │  │  │  ┌────────────┐  ┌────────┐ │  │  │
  │         ▲            │          │  │  │  │    EC2     │  │Resolver│ │  │  │
  │         │            │          │  │  │  │(app.aws.   │←─│Inbound │ │  │  │
  │  ┌──────┴───────┐    │          │  │  │  │  private)  │  │Endpoint│ │  │  │
  │  │    Server    │    │          │  │  │  └────────────┘  └───┬────┘ │  │  │
  │  │ (web.onprem  │    │◀═══VPN or DX═══════════════════════════╝     │  │  │
  │  │  .private)   │    │          │  │  └─────────────────────────────┘  │  │
  │  └──────────────┘    │          │  └──────────────┬────────────────────┘  │
  └──────────────────────┘          │                 │ lookup               │
                                    │                 ▼                      │
                                    │           Route 53 Resolver            │
                                    └─────────────────────────────────────────┘

Endpoint	Direction	Use Case
Inbound	On-prem → AWS	On-prem resolves AWS Private Hosted Zone records
Outbound	AWS → On-prem	AWS resources resolve on-premises DNS records

⚠️ Exam trap: Inbound = queries coming IN to AWS. Outbound = queries going OUT from AWS. Think from AWS perspective!

Outbound Endpoint — Route 53 Resolver forwards DNS queries to on-premises DNS Resolvers

                                    ┌─────────────────────────────────────────┐
                                    │               us-east-1                 │
   On-Premises Data Center          │  ┌───────────────────────────────────┐  │
  ┌──────────────────────┐          │  │              VPC                  │  │
  │                      │          │  │     Private Hosted Zone           │  │
  │  ┌────────────────┐  │          │  │       (aws.private)               │  │
  │  │ DNS Resolvers  │  │          │  │  ┌─────────────────────────────┐  │  │
  │  │(onpremise.     │←─┼── DNS Query: web.onpremise.private? ────────│  │  │
  │  │  private)      │  │          │  │  │  ┌────────────┐  ┌────────┐ │  │  │
  │  └────────────────┘  │          │  │  │  │    EC2     │─→│Resolver│ │  │  │
  │         │            │          │  │  │  │(app.aws.   │  │Outbound│ │  │  │
  │         ▼            │          │  │  │  │  private)  │  │Endpoint│─┼──┼──┘
  │  ┌──────────────┐    │          │  │  │  └────────────┘  └────────┘ │  │
  │  │    Server    │    │◀═══VPN or DX═════════════════════════════════╝  │
  │  │ (web.onprem  │    │          │  │  └─────────────────────────────┘  │
  │  │  .private)   │    │          │  └──────────────┬────────────────────┘
  │  └──────────────┘    │          │                 │                    │
  └──────────────────────┘          │                 ▼                    │
                                    │           Route 53 Resolver          │
                                    └──────────────────────────────────────┘

Route 53 – Resolver Rules

Resolver Rules = define how DNS queries are forwarded from Outbound Endpoints

Rule Type	Description
Conditional Forwarding	Forward queries for specific domains to target DNS servers
System	Default rules (auto-created for Private Hosted Zones, VPC DNS)
Recursive	Forward all unmatched queries to Route 53 Resolver

Rules can be shared across accounts via AWS RAM
Use case: Centralized DNS management in multi-account setup

Resolver Rules Example:

Query: db.corp.local           Query: api.example.com
         │                              │
         ▼                              ▼
    ┌──────────────────────────────────────┐
    │         Resolver Rules               │
    │  ┌────────────────────────────────┐  │
    │  │ *.corp.local → 10.0.0.53       │──┼──▶ On-prem DNS
    │  │ *.example.com → System Rule    │──┼──▶ Route 53
    │  │ * (default) → Recursive        │──┼──▶ Public DNS
    │  └────────────────────────────────┘  │
    └──────────────────────────────────────┘

⚠️ Exam trap: “Share DNS resolution across accounts” → Resolver Rules + AWS RAM

Route 53 – DNSSEC

DNSSEC = DNS Security Extensions — protects against DNS spoofing/cache poisoning

Feature	Details
Purpose	Cryptographically sign DNS records to verify authenticity
Route 53 support	✅ DNSSEC signing for public hosted zones
How it works	Uses KMS to manage keys (KSK), Route 53 manages ZSK
Chain of trust	Root → TLD → Your domain (DS records link them)

Setup steps:

Enable DNSSEC signing in Route 53
Create KSK (Key Signing Key) in KMS — must be in us-east-1
Establish chain of trust with parent zone (add DS record at registrar)

⚠️ Exam trap: “Prevent DNS spoofing” or “verify DNS response authenticity” → DNSSEC ⚠️ Exam trap: DNSSEC KMS key must be in us-east-1 (like CloudFront certificates)

🎯 MASTER SUMMARY: Route 53 Exam Guide

Part 1: Core Principles (Understand WHY → Derive WHAT)

Principle 1: DNS ≠ Load Balancer

Route 53 responds to DNS queries — it returns IP addresses. It does NOT route actual network traffic.

“Routing policy” = how Route 53 answers DNS queries
Client receives IP(s), then connects directly to the resource
This is why Multi-Value is NOT a replacement for ELB

Principle 2: Alias = AWS’s DNS Superpower

CNAME has limitations (can’t use at Zone Apex, costs money). AWS invented Alias to solve this:

Works at Zone Apex (example.com) ✅
Free of charge ✅
Auto-tracks IP changes ✅
Native health check integration ✅

Rule: If pointing to AWS resource → use Alias. If pointing to non-AWS → use CNAME (or A record).

Principle 3: Zone Apex = Root Domain Problem

example.com (no subdomain) = Zone Apex = Root Domain

Record Type	Zone Apex?	Example
CNAME	❌ NO	Cannot use for example.com
Alias	✅ YES	Can use for example.com
A Record	✅ YES	Can use for example.com

DNS standard forbids CNAME at apex. AWS Alias bypasses this limitation.

Principle 4: Health Checks = Failover Enabler

Health checks are the foundation of high availability in Route 53:

Without health check → no automatic failover
Health checks are public — they run from outside your VPC
Private resources → use CloudWatch Alarm health checks

Principle 5: TTL = Caching Control

TTL determines how long clients cache DNS responses:

High TTL = less Route 53 traffic, lower cost, stale records risk
Low TTL = more traffic, higher cost, fresh records
Strategy: Lower TTL before changes → make change → raise TTL

Principle 6: Latency ≠ Geography

Two commonly confused policies:

Latency: Routes to region with best network performance (may cross continents!)
Geolocation: Routes based on user’s physical location (for legal/content restrictions)

German user might be routed to US-East if that has lower latency than EU-West.

Principle 7: Resource Policies for Cross-Service/Cross-Account

When another AWS service or another account needs access:

Update NS records at the registrar (not in Route 53)
Use Public Hosted Zone for internet-facing domains

Principle 8: Hybrid DNS = Inbound + Outbound

Think from AWS’s perspective:

Inbound Endpoint: Queries coming IN to AWS (on-prem → AWS)
Outbound Endpoint: Queries going OUT from AWS (AWS → on-prem)

Part 2: Decision Tree (Follow Keywords → Find Answer)

Step 1: What type of record do you need?

                    What are you pointing to?
                              │
        ┌─────────────────────┼─────────────────────┐
        ▼                     ▼                     ▼
   IPv4 Address         AWS Resource          Another Hostname
        │                     │                     │
        ▼                     ▼                     ▼
    A Record              Alias A/AAAA          Is it Zone Apex?
                          (FREE!)                   │
                                            ┌───────┴───────┐
                                            ▼               ▼
                                           Yes              No
                                            │               │
                                            ▼               ▼
                                         Alias           CNAME ok
                                       (required)

Step 2: Which routing policy?

                        What's the requirement?
                              │
    ┌────────────┬────────────┼────────────┬────────────┬────────────┐
    ▼            ▼            ▼            ▼            ▼            ▼
 Single      Traffic %    Best User    User's      Failover     Client IP
 Resource    Control      Experience   Country     HA Setup     Routing
    │            │            │            │            │            │
    ▼            ▼            ▼            ▼            ▼            ▼
 Simple      Weighted     Latency    Geolocation  Failover     IP-based

Step 3: Feature-Based Decision Table

If question mentions…	Answer is…
“Zone Apex” / “root domain” + AWS resource	Alias record
“example.com” (not www.) + ALB/CloudFront	Alias record
“free DNS queries”	Alias record
“minimize response time” / “best user experience”	Latency routing
“legal requirement” / “restrict by country”	Geolocation routing
“content localization by region”	Geolocation routing
“A/B testing” / “canary deployment”	Weighted routing
“traffic percentage” / “gradual migration”	Weighted routing
“active-passive” / “disaster recovery”	Failover routing
“primary and secondary”	Failover routing
“shift traffic between locations” / “bias”	Geoproximity routing
“on-premises resolves AWS domains”	Inbound Resolver Endpoint
“AWS resolves on-premises domains”	Outbound Resolver Endpoint
“DNS spoofing” / “verify authenticity”	DNSSEC
“private resource health check”	CloudWatch Alarm health check
“share DNS rules across accounts”	Resolver Rules + AWS RAM
“users still see old IP after change”	TTL caching issue

The “NOT” Rules (Eliminate Wrong Answers Fast)

Statement	Why It’s Wrong
CNAME for Zone Apex	CNAME cannot be used at root domain
Alias for EC2 DNS name	Alias doesn’t support EC2 DNS — use A/CNAME
Alias with custom TTL	Alias TTL is auto-managed, cannot be set
Health check for private EC2	Health checkers can’t reach private subnets
Simple policy with health check	Simple routing doesn’t support health checks
Geolocation for best performance	Geolocation = geography, not network performance
Multi-Value replaces ELB	Multi-Value is DNS-level, not true load balancing
Route 53 routes traffic	Route 53 answers DNS queries, doesn’t route traffic

The “CANNOT” List

Cannot…	Instead…
Use CNAME at Zone Apex	Use Alias
Set TTL on Alias records	TTL is auto-managed
Create Alias to EC2 DNS name	Use A record or CNAME
Health check private resources directly	Use CloudWatch Alarm
Use Geoproximity without Traffic Flow	Traffic Flow is required
Have health checks with Simple policy	Use Weighted/Failover/Multi-Value

Part 3: Scenario Pattern Recognition

Pattern: “Point root domain to AWS resource”

Keywords: example.com (no www), Zone Apex, ALB, CloudFront, root domain

Answer: Alias record (A type)

Why: CNAME cannot be used at Zone Apex. Alias can.

Pattern: “Minimize response time / Best user experience”

Keywords: lowest latency, best performance, fastest response, multi-region app

Answer: Latency-based routing

Why: Routes to AWS region with best network performance, regardless of geography.

Pattern: “Restrict access by country / Legal compliance”

Keywords: country restrictions, content localization, legal requirement, GDPR

Answer: Geolocation routing

Why: Routes based on user’s physical location, not network performance.

Pattern: “Gradual migration / Canary deployment”

Keywords: A/B testing, percentage of traffic, gradual rollout, 10% to new version

Answer: Weighted routing

Why: Control exact percentage of traffic to each resource.

Pattern: “Active-passive / Disaster recovery”

Keywords: primary and secondary, failover, standby, DR site

Answer: Failover routing policy + Health checks

Why: Automatically switches to secondary when primary fails health check.

Pattern: “On-premises needs to resolve AWS private domains”

Keywords: hybrid cloud, on-premises DNS, resolve Private Hosted Zone from datacenter

Answer: Inbound Resolver Endpoint

Why: Allows on-prem DNS servers to query Route 53 for AWS resources.

Pattern: “AWS resources need to resolve on-premises domains”

Keywords: EC2 needs to reach on-prem by hostname, resolve corp.local from VPC

Answer: Outbound Resolver Endpoint + Forwarding Rules

Why: Forwards DNS queries from VPC to on-premises DNS servers.

Pattern: “Users still seeing old IP after DNS change”

Keywords: DNS not updating, old IP, change not propagating

Answer: TTL caching issue

Solution: Wait for TTL to expire, or lower TTL before making changes.

Pattern: “Health check for private/internal resource”

Keywords: private subnet, internal EC2, RDS health, can’t reach from internet

Answer: CloudWatch Alarm-based health check

Why: Route 53 health checkers are public — can’t reach private resources.

Pattern: “Prevent DNS spoofing / Verify DNS authenticity”

Keywords: DNS security, cache poisoning, MITM, verify DNS response

Answer: DNSSEC

Remember: KMS key must be in us-east-1.

Keywords: multi-account, centralized DNS, share resolver rules

Answer: Resolver Rules + AWS RAM

Why: Resolver Rules can be shared across accounts via Resource Access Manager.

Pattern: “Buy domain elsewhere, use Route 53 for DNS”

Keywords: GoDaddy, third-party registrar, use Route 53

Answer: Create Public Hosted Zone → Update NS records at the registrar

Why: NS records tell the internet where to find your DNS. Update at registrar, not Route 53.

Part 4: Quick Reference Tables

Routing Policy Comparison

Policy	Health Check?	Use Case	Key Feature
Simple	❌ No	Single resource	Returns all values, client picks
Weighted	✅ Yes	A/B testing, migration	Traffic % control
Failover	✅ Yes (required)	DR, active-passive	Primary + secondary
Latency	✅ Yes	Multi-region apps	Best network performance
Geolocation	✅ Yes	Country restrictions	User’s physical location
Geoproximity	✅ Yes	Shift traffic by location	Bias values (-99 to +99)
Multi-Value	✅ Yes	Multiple healthy IPs	Up to 8 healthy records
IP-based	✅ Yes	Route by client CIDR	Client IP → location mapping

Record Type Quick Reference

Record	Maps To	Zone Apex?	AWS Extension?
A	IPv4	✅ Yes	No
AAAA	IPv6	✅ Yes	No
CNAME	Hostname	❌ No	No
Alias	AWS Resource	✅ Yes	✅ Yes (AWS-only)
NS	Name Servers	✅ Yes	No

Alias Targets (What Can Alias Point To?)

✅ Can Alias To	❌ Cannot Alias To
ALB, NLB, Classic LB	EC2 DNS name
CloudFront Distribution	Non-AWS resources
API Gateway	RDS endpoint
Elastic Beanstalk	Other CNAMEs
S3 Website Endpoint
VPC Interface Endpoint
Global Accelerator
Another Route 53 record

Health Check Types

Type	Monitors	Use Case
Endpoint	HTTP/HTTPS/TCP to public IP	Public resources
Calculated	Other health checks (AND/OR)	Aggregate multiple checks
CloudWatch Alarm	CloudWatch metric state	Private resources

Key Numbers to Remember

Item	Value
Hosted Zone cost	$0.50/month
Health check interval	30 sec (10 sec = extra cost)
Health checkers globally	~15
Healthy threshold	3 consecutive
% checkers for healthy	>18%
Multi-Value max records	8
Weighted max value	Any number (relative)
Geoproximity bias range	-99 to +99
TTL recommendation before changes	Low (60 sec)

Part 5: Ultimate Instant-Answer Table

Question Contains	→ Instant Answer
“Zone Apex” / “root domain” + AWS	Alias record
“example.com to ALB”	Alias record
“free DNS queries to AWS”	Alias record
“CNAME at root”	❌ Not possible → use Alias
“lowest latency” / “best performance”	Latency routing
“country restriction” / “legal”	Geolocation routing
“localization by region”	Geolocation routing
“A/B test” / “canary”	Weighted routing
“percentage of traffic”	Weighted routing
“active-passive” / “DR”	Failover routing
“primary/secondary”	Failover routing
“shift traffic” / “bias”	Geoproximity routing
“private resource health”	CloudWatch Alarm health check
“on-prem → AWS DNS”	Inbound Resolver Endpoint
“AWS → on-prem DNS”	Outbound Resolver Endpoint
“DNS spoofing” / “DNSSEC”	DNSSEC (KMS key in us-east-1)
“share DNS across accounts”	Resolver Rules + AWS RAM
“old IP still showing”	TTL caching
“GoDaddy + Route 53”	Update NS at registrar
“100% availability SLA”	Route 53 (only AWS service!)

Part 6: Elimination Checklist

When stuck between options, eliminate systematically:

□ Is it Zone Apex (root domain)?
  → Yes = eliminate CNAME, must use Alias or A
  → No = CNAME is acceptable

□ Do they need health checks?
  → Yes = eliminate Simple routing
  → Failover REQUIRES health checks

□ Is it about USER LOCATION?
  → Physical location = Geolocation
  → Network performance = Latency

□ Is the resource PRIVATE?
  → Yes = eliminate direct HTTP health check
  → Use CloudWatch Alarm instead

□ Is it pointing to AWS resource?
  → Yes = prefer Alias (free, auto-tracking)
  → No = use CNAME or A record

□ Do they need traffic PERCENTAGE control?
  → Yes = Weighted routing
  → Just failover = Failover routing

□ Is it HYBRID (on-prem + AWS)?
  → On-prem queries AWS = Inbound Endpoint
  → AWS queries on-prem = Outbound Endpoint

□ Is it about DNS SECURITY?
  → Spoofing/authenticity = DNSSEC
  → KMS key must be in us-east-1

🏆 The Golden Rules

Zone Apex + AWS = Alias (CNAME doesn’t work at root)
Alias is FREE (CNAME costs money)
Alias TTL = auto-managed (you can’t set it)
Latency ≠ Geography (latency = network speed, geolocation = physical location)
Private resources = CloudWatch Alarm (health checkers can’t reach them)
Failover REQUIRES health checks (no health check = no failover)
Route 53 = DNS responses, not traffic routing (it returns IPs, doesn’t route packets)
Third-party registrar = update NS at registrar (not in Route 53)
Inbound = INTO AWS, Outbound = OUT OF AWS (from AWS perspective)
DNSSEC KMS key = us-east-1 (like CloudFront certificates)
Route 53 = 100% SLA (only AWS service with this guarantee)
Weight = 0 stops ALL traffic (useful for maintenance)

AWS CloudFront & Global Accelerator:

AWS Services: Global vs Regional:

Understanding which services are global vs regional is critical for:

Certificate placement (ACM)
Data residency requirements
Disaster recovery planning
Cross-region access patterns

Always Global Services (no region selection):

Service	Why Global	Key Implication
IAM	Identity is account-wide	Users, roles, policies work everywhere
Route 53	DNS is global	Hosted zones accessible from any region
CloudFront	CDN with edge locations	Certs must be in us-east-1
WAF (for CloudFront)	Attached to global CF	WAF rules in us-east-1
Global Accelerator	Anycast IPs, global routing	Entry point is global
AWS Organizations	Multi-account management	SCPs apply across all regions
Artifact	Compliance documents	Account-level access

Regional Services (must select region):

Service	Regional Scope	Cross-Region Options
EC2	Instances in one region	AMI copy, snapshots
S3	Bucket in one region	Cross-Region Replication (CRR)
RDS	DB in one region	Read Replicas, snapshots
Lambda	Functions in one region	Deploy to each region
API Gateway	API in one region	Edge-Optimized uses CF
DynamoDB	Table in one region	Global Tables (multi-region)
Aurora	Cluster in one region	Global Database
KMS	Keys in one region	Multi-Region Keys (mrk-)
Secrets Manager	Secrets in one region	Multi-region replication
CloudHSM	HSM in one region	No cross-region option!
ELB	Load balancer in one region	Use Global Accelerator for global
VPC	Network in one region	VPC Peering, Transit Gateway

Certificate (ACM) Placement Rules:

Scenario	ACM Certificate Region
CloudFront distribution	us-east-1 (always)
Edge-Optimized API Gateway	us-east-1 (uses CloudFront)
Regional API Gateway	Same region as API
ALB/NLB	Same region as load balancer

Memory trick: “Where does TLS terminate?”

CloudFront terminates TLS → us-east-1
Regional service terminates TLS → same region

Global Services That “Feel” Regional:

Service	Global?	Gotcha
S3 bucket names	Globally unique	But bucket lives in ONE region
Lambda@Edge	Runs at edge	Must be authored in us-east-1
WAF for ALB	Regional	WAF for CloudFront = global (us-east-1)

Cross-Region Capabilities Summary:

Need	Solution
Global static content	S3 + CloudFront
Global API	API Gateway (Edge-Optimized) or Global Accelerator + ALB
Global database (NoSQL)	DynamoDB Global Tables
Global database (SQL)	Aurora Global Database
Global encryption keys	KMS Multi-Region Keys
Global secrets	Secrets Manager replication
Global fixed IPs	Global Accelerator

⚠️ Exam trap: “CloudHSM multi-region” → IMPOSSIBLE. CloudHSM is single-region only, no replication.

⚠️ Exam trap: “Same KMS key in two regions” → Possible with Multi-Region Keys (mrk- prefix). Regular keys are regional.

⚠️ Exam trap: “Lambda@Edge in eu-west-1” → Wrong. Lambda@Edge must be created in us-east-1, CloudFront replicates it.

AWS CloudFront is a Content Delivery Network (CDN), improves read performance, content is cached at the edge.

Hundreds of Points of Presence globally (edge locations, caches)
DDoS protection (worldwide), integrates with Shield + WAF

⚠️ Exam trap: CloudFront SSL/TLS certificates must be in us-east-1 (even if origin is in another region)

CloudFront Origins:

Origin Type	Use Case	Notes
S3 Bucket	Distribute files, cache at edge	Secured with OAC (Origin Access Control)
VPC Origin	Private apps in VPC subnets	ALB / NLB / EC2 — no public exposure needed
Custom Origin (HTTP)	Any public HTTP backend	S3 static website, custom servers

⚠️ Exam trap: “Restrict S3 access to CloudFront only” → OAC + S3 Bucket Policy

Create OAC in CloudFront, update S3 bucket policy to allow only CloudFront
Wrong: S3 Access Points = simplify access management, not redirect
OAI (Origin Access Identity) = legacy, replaced by OAC

CloudFront with VPC Origin (Private Resources):

                                    ┌─────────────────────────────────────┐
                                    │ VPC                                 │
                                    │  ┌─────────────────────────────┐    │
Users ──▶ CloudFront ──▶ VPC Origin │  │ Private Subnet              │    │
          (Edge)                    │  │  ├─▶ ALB                    │    │
                                    │  │  ├─▶ NLB                    │    │
                                    │  │  └─▶ EC2                    │    │
                                    │  └─────────────────────────────┘    │
                                    └─────────────────────────────────────┘

CloudFront vs S3 Cross-Region Replication:

Feature	CloudFront	S3 CRR
Scope	Global edge network	Per-region setup
Updates	Cached with TTL	Near real-time
Access	Read/Write (upload via CF)	Read-only
Best for	Static content, global availability	Dynamic content, low-latency in few regions

CloudFront Origin Groups (Failover)

Origin Group = primary origin + secondary origin for failover
CloudFront automatically fails over when primary returns error (5xx, 4xx, timeout)
Use case: High availability, disaster recovery

CloudFront Origin Groups (Failover):

                         ┌─────────────────────┐
                         │   Origin Group      │
                         │                     │
Users ──▶ CloudFront ───▶│  Primary: S3 (us-east-1)
                         │      │              │
                         │      ▼ (on error)   │
                         │  Secondary: S3 (eu-west-1)
                         │                     │
                         └─────────────────────┘

⚠️ Exam trap: “CloudFront high availability” or “origin failover” → Origin Groups

Not Route 53 failover (that’s DNS-level, not CDN-level)

CloudFront Cache Invalidations

Origin updated → CloudFront doesn’t know until TTL expires
Invalidation = force cache refresh, bypass TTL
Invalidate all files (/*) or specific path (/images/*)

Cache Invalidation Flow:

Admin ──▶ Invalidate /images/* ──▶ CloudFront ──▶ Edge Locations
                                                   │
                                        ┌──────────┴──────────┐
                                        ▼                     ▼
                                   [Cache]               [Cache]
                                   index.html ✓          index.html ✓
                                   /images/ ✗            /images/ ✗
                                   (invalidated)         (invalidated)

CloudFront Behaviors & Path Patterns

Behaviors = rules that define how CloudFront handles requests for different paths

Each distribution has a default behavior (matches all paths /*)
Add custom behaviors for specific paths (e.g., /api/*, /images/*)
Order matters: most specific path wins

Setting	Options
Path Pattern	`/api/`, `/images/`, `*.jpg`, etc.
Origin	Which origin to route to
Cache Policy	TTL, headers/cookies to cache by
Viewer Protocol	HTTP only, HTTPS only, Redirect HTTP→HTTPS
Allowed Methods	GET/HEAD, GET/HEAD/OPTIONS, ALL
Edge Functions	CloudFront Functions, Lambda@Edge

CloudFront Behaviors Example:

Request Path          Behavior              Origin
─────────────────────────────────────────────────────
/api/*            ──▶ API Behavior     ──▶ ALB (no cache)
/images/*         ──▶ Images Behavior  ──▶ S3 (long TTL)
/static/*         ──▶ Static Behavior  ──▶ S3 (long TTL)
/* (everything)   ──▶ Default Behavior ──▶ ALB (short TTL)

⚠️ Exam traps:

“Different cache settings for /api vs /images” → Behaviors with path patterns
“Forward cookies only for /api/” → Custom behavior for /api/
“Redirect HTTP to HTTPS” → Viewer Protocol Policy in behavior

CloudFront Signed URLs & Signed Cookies

Distribute private content to authorized users
Attach policy: URL expiration, allowed IP ranges, trusted signers

Feature	Signed URL	Signed Cookie
Access scope	1 file per URL	Multiple files (entire path)
Use case	Individual file download	Video streaming, multi-file access
URL change	Yes (unique per file)	No (cookie sent with all requests)

Signed URL vs S3 Pre-Signed URL:

Feature	CloudFront Signed URL	S3 Pre-Signed URL
Access via	CloudFront edge (cached)	Direct to S3
Use when	CloudFront in front of S3	Direct S3 access needed
Features	Caching, filtering by IP/path/date	Simple, S3-only

⚠️ Exam trap: “Private content via CloudFront” → Signed URL/Cookie

1 file → Signed URL
Multiple files → Signed Cookie
Direct S3 access (no CF) → S3 Pre-Signed URL

CloudFront Functions vs Lambda@Edge

Both run code at edge locations, but different scale/capabilities:

Feature	CloudFront Functions	Lambda@Edge
Language	JavaScript only	Node.js, Python
Execution time	< 1 ms	Up to 5-10 sec
Max memory	2 MB	128-3008 MB
Scale	Millions req/sec	Thousands req/sec
Triggers	Viewer Request/Response only	Viewer + Origin Request/Response
Network/File access	❌	✅
Cost	1/6th of Lambda@Edge	Higher

CloudFront Request Flow:

                    CloudFront           CloudFront
                     Functions            Functions
                        │                     │
User ──▶ Viewer Request ▼ ──▶ Cache ──▶ Origin Request ──▶ Origin (S3/ALB)
              │                              │
              │         Lambda@Edge      Lambda@Edge
              │              │                │
         Viewer Response ◀───┘ ◀── Origin Response ◀──────┘

Use Cases:

Use Case	Best Choice
URL rewrites, header manipulation	CloudFront Functions
A/B testing (simple)	CloudFront Functions
Authentication (JWT validation)	CloudFront Functions
Complex auth (DB lookup)	Lambda@Edge
Image resizing	Lambda@Edge
Call external APIs	Lambda@Edge

⚠️ Exam traps:

“Lightweight, high-scale” → CloudFront Functions
“Network access, longer execution” → Lambda@Edge
“Viewer-only triggers” → both work; “Origin triggers” → Lambda@Edge only

CloudFront Geo Restriction

Restrict access by country (Allowlist or Blocklist)
Country determined by 3rd party Geo-IP database
Use case: Copyright laws, regional content licensing

⚠️ Exam trap: “Block/allow by country” → Geo Restriction

Wrong: OAC = S3 origin access (not geo blocking)
Wrong: Security Groups = can’t attach to CloudFront
Wrong: Route 53 Latency = routes to nearest, doesn’t block

CloudFront Pricing & Price Classes

Cost varies by edge location (US/EU cheapest → India most expensive)
Reduce cost by limiting edge locations via Price Classes

Price Class	Regions Included	Cost
All	All regions	Best performance, highest cost
200	Most regions (excludes South America, Australia/NZ)	Balanced
100	US, Mexico, Canada, Europe, Israel only	Lowest cost

⚠️ Exam trap: “Reduce CloudFront costs” → use Price Class 100/200 (fewer edge locations)

AWS Global Accelerator

Problem: Global users → public internet → many hops → high latency

Without Global Accelerator (Public Internet):

America ───┐
           │    ┌───┬───┬───┬───┐
Europe ────┼───▶│hop│hop│hop│hop│───▶ Public ALB (India)
           │    └───┴───┴───┴───┘
Australia ─┘         (latency)

Solution: Use AWS internal network via Anycast IPs

Unicast IP: one server = one IP
Anycast IP: all servers share same IP → client routed to nearest

How it works:

2 static Anycast IPs created for your app
Traffic → nearest Edge Location → AWS private network → your app
Up to 60% improvement in latency

With Global Accelerator:

Users ──▶ Anycast IP ──▶ Edge Location ──▶ AWS Private Network ──▶ ALB/NLB/EC2
          (static)       (nearest)         (fast, optimized)

Supported Targets: Elastic IP, EC2, ALB, NLB (public or private)

Features:

Feature	Details
Performance	Intelligent routing, lowest latency, fast regional failover
Health Checks	Failover < 1 min for unhealthy endpoints, great for DR
Security	Only 2 IPs to whitelist, DDoS protection via AWS Shield
Caching	No client cache issues (IPs never change)

Endpoint Weights & Traffic Dial:

Endpoint weights: distribute traffic % between endpoints in same group (0-255)
Traffic dial: % of traffic to send to an endpoint group (0-100%)
Use case: Blue/green deployments, gradual rollouts

Global Accelerator vs CloudFront

Feature	CloudFront	Global Accelerator
Content	Cacheable + dynamic content	TCP/UDP applications
Caching	✅ At edge	❌ No caching (proxies packets)
Use cases	Images, videos, APIs, websites	Gaming (UDP), IoT (MQTT), VoIP
Static IPs	❌	✅ 2 Anycast IPs
Failover	TTL-based	< 1 min (health checks)

⚠️ Exam traps:

“Non-HTTP” (gaming, IoT, VoIP) → Global Accelerator
“Static IP required” → Global Accelerator
“Fast regional failover” → Global Accelerator
“Cache at edge” → CloudFront
“Static IP + host-based routing + global” → Global Accelerator + ALB

Global Accelerator vs ELB vs Route 53

Service	Scope	Routing Level	Health Checks	Use Case
ELB (ALB/NLB)	Single region	Layer 4/7	✅ Targets	Distribute traffic across instances in 1 region
Route 53	Global (DNS)	DNS level	✅ Endpoints	DNS-based routing (latency, geo, failover)
Global Accelerator	Global (network)	Network level	✅ Endpoints	Fast global routing via AWS backbone

Scenario-Based Selection:

Scenario	Answer	Why
Distribute traffic in 1 region	ELB	Regional load balancing
Route users to nearest region via DNS	Route 53 (latency routing)	DNS resolves to closest endpoint
Instant failover across regions (<1 min)	Global Accelerator	Network-level, no DNS TTL delay
Need static IPs for global app	Global Accelerator	2 Anycast IPs
Non-HTTP (gaming, IoT, VoIP)	Global Accelerator	TCP/UDP support
Cost-sensitive global routing	Route 53	Cheaper, but slower failover (DNS TTL)

Failover Speed:

Route 53:         DNS TTL (30s - 5min+) before clients see change
Global Accel:     < 1 minute (health check driven, no DNS caching)

⚠️ Exam traps:

“Fastest failover” → Global Accelerator (not Route 53)
“DNS-based routing” → Route 53
“Static IP + global” → Global Accelerator
“Single region balancing” → ELB

⚠️ Exam trap — Blue-green deployment + DNS caching + tight deadline:

Route 53 weighted routing seems logical but mobile devices cache DNS → users won’t see new deployment for hours/days
Global Accelerator uses static Anycast IPs (no DNS change) → adjust endpoint weights to shift traffic instantly, no client caching issue
ELB = single region only, can’t do cross-deployment traffic splitting
CodeDeploy = deployment tool, doesn’t control traffic routing at network level
Key trigger words: “DNS caching”, “mobile phones”, “tight timeframe”, “blue-green global” → Global Accelerator

CloudFront Field-Level Encryption

Encrypt sensitive fields at edge (e.g., credit card numbers)
Data stays encrypted through entire request flow → only your app can decrypt
Uses asymmetric encryption (public key at edge, private key in app)

⚠️ Exam trap: “Encrypt specific form fields at edge” → Field-Level Encryption

Different from HTTPS (encrypts entire payload, not specific fields)

Quick Reference: Service Comparison Matrix

Scenario	CloudFront	Global Accelerator	Route 53	ELB
Cache static content	✅	❌	❌	❌
Non-HTTP (gaming, IoT)	❌	✅	❌	NLB only
Static IPs	❌	✅	❌	NLB only
Fastest failover (<1 min)	❌	✅	❌ (TTL)	❌
DNS-based routing	❌	❌	✅	❌
Single region balancing	❌	❌	❌	✅
Edge compute (Lambda)	✅	❌	❌	❌
Origin failover	✅ (Origin Groups)	✅	✅	❌
WebSocket support	✅	✅	N/A	✅ (ALB)

Decision Tree:

Question	Yes →
Need to cache content at edge?	CloudFront
Non-HTTP protocol (UDP, TCP raw)?	Global Accelerator
Need static IPs for whitelisting?	Global Accelerator (or NLB)
Need <1 min failover globally?	Global Accelerator
DNS-level routing (geo, latency)?	Route 53
Load balance within 1 region only?	ELB
Run code at edge locations?	CloudFront (Functions/Lambda@Edge)

🎯 MASTER SUMMARY: CloudFront & Global Accelerator Exam Guide

Part 1: Core Principles (Understand WHY → Derive WHAT)

Principle 1: Global vs Regional — Know Where Services Live

Understanding which services are global vs regional is fundamental for certificate placement, data residency, and cross-region patterns.

Always Global (no region selection):

IAM, Route 53, CloudFront, Global Accelerator, WAF (for CloudFront), Organizations

Regional with Multi-Region Options:

DynamoDB (Global Tables), Aurora (Global Database), KMS (Multi-Region Keys), Secrets Manager

Regional Only (no cross-region):

CloudHSM, VPC, EC2, ELB

Derive: “Where does TLS terminate?” = where certificate must be

CloudFront terminates TLS → cert in us-east-1
Regional service terminates TLS → cert in same region

Principle 2: CloudFront = Content, Global Accelerator = Connections

Two different problems, two different solutions:

CloudFront: Cache content at edge → reduce latency for static/dynamic content (HTTP/HTTPS)
Global Accelerator: Route connections through AWS backbone → reduce latency for any TCP/UDP traffic

CloudFront caches. Global Accelerator proxies (no caching).

Principle 2: Edge Locations = AWS’s Global Presence

Both services use AWS’s 400+ edge locations worldwide:

CloudFront: Caches content at edge, serves from nearest location
Global Accelerator: Entry point to AWS private network, routes to your endpoints

Edge = closer to users = lower latency.

Principle 3: Anycast vs Unicast IPs

Unicast: One IP = one server (traditional)
Anycast: One IP = many servers, routed to nearest (Global Accelerator magic)

Global Accelerator gives you 2 static Anycast IPs → users connect to same IPs worldwide, routed to nearest edge.

Principle 4: TTL = Cache Control

CloudFront caches based on TTL (Time To Live):

High TTL = better cache hit ratio, but stale content risk
Low TTL = fresher content, but more origin requests
Invalidation = force refresh, bypass TTL

Origin updates don’t propagate until TTL expires (or you invalidate).

Principle 5: Behaviors = Path-Based Routing

CloudFront Behaviors let you:

Route different paths to different origins (/api/* → ALB, /images/* → S3)
Apply different cache policies per path
Set different viewer protocols (HTTP, HTTPS, redirect)

More specific path patterns take precedence.

Principle 6: Origin Access Control (OAC) = S3 Security

OAC ensures only CloudFront can access your S3 bucket:

Create OAC in CloudFront
Update S3 bucket policy to allow only CloudFront
Users can’t bypass CloudFront to access S3 directly

OAI (Origin Access Identity) is legacy → use OAC.

Principle 7: Edge Compute = CloudFront Functions vs Lambda@Edge

Two options for running code at edge:

CloudFront Functions: Lightweight, JavaScript, <1ms, millions req/sec, viewer triggers only
Lambda@Edge: Full Lambda, Node.js/Python, up to 10s, network access, all triggers

Simple = CloudFront Functions. Complex = Lambda@Edge.

Principle 8: Failover Speed Matters

Different services, different failover speeds:

Route 53: DNS TTL delay (30 sec to minutes)
Global Accelerator: <1 minute (health check driven, no DNS caching)
CloudFront Origin Groups: Automatic on origin error

Need instant failover? → Global Accelerator.

Part 2: Decision Tree (Follow Keywords → Find Answer)

Step 0: Is the service Global or Regional?

Which service?
│
├─► IAM, Route 53, CloudFront, Global Accelerator
│   └─► GLOBAL (no region, but CloudFront certs in us-east-1)
│
├─► DynamoDB, Aurora, KMS, Secrets Manager
│   └─► REGIONAL but has MULTI-REGION options
│
├─► CloudHSM
│   └─► REGIONAL ONLY (no cross-region!)
│
└─► EC2, ELB, VPC, Lambda, API Gateway
    └─► REGIONAL (deploy per region)

Step 1: HTTP or non-HTTP?

                    What protocol?
                          │
            ┌─────────────┴─────────────┐
            ▼                           ▼
      HTTP/HTTPS                   TCP/UDP (non-HTTP)
            │                           │
            ▼                           ▼
    Need caching?              Global Accelerator
            │                   (gaming, IoT, VoIP)
     ┌──────┴──────┐
     ▼             ▼
    Yes            No
     │             │
     ▼             ▼
CloudFront    Global Accelerator
              (if static IPs needed)

Step 2: What’s the main requirement?

                    What's the goal?
                          │
    ┌──────────┬──────────┼──────────┬──────────┐
    ▼          ▼          ▼          ▼          ▼
  Cache     Static     Fast      Edge      Block
 Content     IPs     Failover   Compute   Country
    │          │          │          │          │
    ▼          ▼          ▼          ▼          ▼
CloudFront  Global    Global    CloudFront  CloudFront
           Accel     Accel     Functions/   Geo
                               Lambda@Edge  Restriction

Step 3: Feature-Based Decision Table

If question mentions…	Answer is…
“cache at edge”	CloudFront
“static content” / “CDN”	CloudFront
“gaming” / “UDP”	Global Accelerator
“IoT” / “MQTT”	Global Accelerator
“VoIP” / “real-time”	Global Accelerator
“static IP” / “whitelist IP”	Global Accelerator
“fast failover” / “<1 min failover”	Global Accelerator
“origin failover” / “HA for origin”	CloudFront Origin Groups
“restrict S3 to CloudFront only”	OAC + S3 Bucket Policy
“private content” / “authenticated access”	Signed URL/Cookie
“block by country”	Geo Restriction
“different cache per path”	Behaviors
“redirect HTTP to HTTPS”	Viewer Protocol Policy
“force cache refresh”	Invalidation
“run code at edge”	CloudFront Functions or Lambda@Edge
“lightweight edge compute”	CloudFront Functions
“complex edge compute” / “DB lookup”	Lambda@Edge
“encrypt specific fields”	Field-Level Encryption
“reduce CloudFront costs”	Price Class 100/200
“SSL certificate for CloudFront”	ACM in us-east-1

The “NOT” Rules (Eliminate Wrong Answers Fast)

Statement	Why It’s Wrong
Global Accelerator caches content	GA proxies packets, no caching
CloudFront for UDP/gaming	CloudFront = HTTP/HTTPS only
Route 53 for instant failover	Route 53 = DNS TTL delay (not instant)
Security Groups on CloudFront	Can’t attach SGs to CloudFront
OAC for non-S3 origins	OAC is S3-only; use auth headers for ALB/custom
CloudFront Functions for DB access	No network access — use Lambda@Edge
CloudFront Functions at origin triggers	Viewer triggers only — use Lambda@Edge
Signed URL for multiple files	Use Signed Cookie for multiple files
Global Accelerator for caching	No caching — use CloudFront

The “CANNOT” List

Cannot…	Instead…
Use CloudFront for UDP	Use Global Accelerator
Attach Security Groups to CloudFront	Use Geo Restriction, WAF, or Signed URLs
Use OAI (deprecated)	Use OAC (Origin Access Control)
Run CloudFront Functions at origin	Use Lambda@Edge for origin triggers
Access network in CloudFront Functions	Use Lambda@Edge
Use CloudFront cert from other regions	ACM certificate must be in us-east-1
Get static IPs from CloudFront	Use Global Accelerator for static IPs

Part 3: Scenario Pattern Recognition

Pattern: “Cache static content globally”

Keywords: CDN, cache, static files, images, videos, global distribution

Answer: CloudFront

Why: CloudFront caches at 400+ edge locations. Global Accelerator doesn’t cache.

Pattern: “Gaming / IoT / VoIP application”

Keywords: UDP, TCP, gaming, real-time, MQTT, non-HTTP

Answer: Global Accelerator

Why: CloudFront = HTTP/HTTPS only. Global Accelerator supports any TCP/UDP.

Pattern: “Need static IPs for whitelisting”

Keywords: static IP, firewall whitelist, fixed IP addresses

Answer: Global Accelerator (2 Anycast IPs)

Why: CloudFront uses dynamic IPs. Global Accelerator provides 2 static Anycast IPs.

Pattern: “Fastest possible failover”

Keywords: instant failover, <1 minute, DR, disaster recovery

Answer: Global Accelerator

Why: Route 53 = DNS TTL delay. Global Accelerator = health-check driven, <1 min.

Pattern: “Origin failover for CloudFront”

Keywords: CloudFront HA, origin fails, backup origin

Answer: CloudFront Origin Groups

Why: Primary + secondary origin. Automatic failover on 4xx/5xx errors.

Pattern: “Restrict S3 access to CloudFront only”

Keywords: S3 only via CloudFront, prevent direct S3 access, secure S3 origin

Answer: OAC (Origin Access Control) + S3 Bucket Policy

Why: OAC creates CloudFront identity. S3 policy allows only that identity.

Pattern: “Private content via CloudFront”

Keywords: authenticated users, premium content, temporary access

Answer: Signed URL (1 file) or Signed Cookie (multiple files)

Why: Signed URLs/Cookies include expiration, IP restrictions, trusted signers.

Pattern: “Different settings for different paths”

Keywords: /api/, /images/, path-based, different cache, different origin

Answer: CloudFront Behaviors

Why: Each behavior = path pattern + origin + cache policy + settings.

Pattern: “Block users by country”

Keywords: geo blocking, country restriction, copyright, regional licensing

Answer: CloudFront Geo Restriction

Why: Allowlist or blocklist countries. Based on Geo-IP database.

Pattern: “Run code at edge (simple)”

Keywords: URL rewrite, header manipulation, JWT validation, lightweight

Answer: CloudFront Functions

Why: <1ms execution, JavaScript, millions req/sec, 1/6 cost of Lambda@Edge.

Pattern: “Run code at edge (complex)”

Keywords: database lookup, external API call, image resize, origin trigger

Answer: Lambda@Edge

Why: Up to 10s execution, network access, Node.js/Python, all 4 triggers.

Pattern: “Force cache refresh after update”

Keywords: stale content, cache not updating, force refresh

Answer: CloudFront Invalidation

Why: Bypass TTL, force edge locations to fetch new content from origin.

Pattern: “Reduce CloudFront costs”

Keywords: cost optimization, cheaper CDN, reduce edge locations

Answer: Price Class 100 or 200

Why: Fewer edge locations = lower cost (but potentially higher latency for excluded regions).

Pattern: “Encrypt specific form fields”

Keywords: credit card encryption, PII at edge, field-level security

Answer: CloudFront Field-Level Encryption

Why: Encrypts specific fields at edge → stays encrypted through entire flow.

Pattern: “SSL certificate for CloudFront”

Keywords: HTTPS, custom domain, SSL/TLS certificate

Answer: ACM certificate in us-east-1

Why: CloudFront is global but requires certificates in us-east-1 region.

Part 4: Quick Reference Tables

Global vs Regional Services

Service	Scope	Certificate/Key Location	Cross-Region Option
IAM	Global	N/A	N/A (account-wide)
Route 53	Global	N/A	N/A (global DNS)
CloudFront	Global	us-east-1	N/A (already global)
Global Accelerator	Global	N/A	N/A (already global)
API Gateway (Edge)	Regional*	us-east-1	Uses CloudFront
API Gateway (Regional)	Regional	Same region	Deploy per region
Lambda	Regional	N/A	Deploy per region
Lambda@Edge	Global*	N/A	Author in us-east-1
DynamoDB	Regional	N/A	Global Tables
Aurora	Regional	N/A	Global Database
KMS	Regional	Same region	Multi-Region Keys (mrk-)
CloudHSM	Regional	Same region	❌ None!
Secrets Manager	Regional	N/A	Multi-region replication
S3	Regional	N/A	Cross-Region Replication
ALB/NLB	Regional	Same region	Use Global Accelerator

*Edge-Optimized API Gateway lives in one region but uses CloudFront for routing

CloudFront vs Global Accelerator

Feature	CloudFront	Global Accelerator
Purpose	Cache content at edge	Route traffic via AWS backbone
Protocols	HTTP/HTTPS only	Any TCP/UDP
Caching	✅ Yes	❌ No (proxies packets)
Static IPs	❌ No	✅ 2 Anycast IPs
Use cases	Websites, APIs, streaming	Gaming, IoT, VoIP
Failover	Origin Groups (TTL-based)	<1 min (health checks)
Edge compute	✅ Functions/Lambda@Edge	❌ No
DDoS protection	✅ Shield	✅ Shield

CloudFront Functions vs Lambda@Edge

Feature	CloudFront Functions	Lambda@Edge
Language	JavaScript only	Node.js, Python
Execution time	<1 ms	Up to 5-10 sec
Memory	2 MB	128-3008 MB
Scale	Millions req/sec	Thousands req/sec
Triggers	Viewer only	Viewer + Origin
Network access	❌ No	✅ Yes
Cost	1/6th of Lambda@Edge	Higher

Feature	Signed URL	Signed Cookie	S3 Pre-Signed URL
Access scope	1 file	Multiple files	1 file
Access via	CloudFront	CloudFront	Direct S3
Use case	Single download	Streaming, multi-file	Direct S3 access
Caching	✅ Yes	✅ Yes	❌ No (S3 direct)

Service Comparison: When to Use What

Scenario	Service
Cache static content globally	CloudFront
Cache + origin failover	CloudFront + Origin Groups
Non-HTTP (gaming, IoT)	Global Accelerator
Static IPs for whitelisting	Global Accelerator
Fastest failover (<1 min)	Global Accelerator
DNS-based routing	Route 53
Single region load balancing	ELB (ALB/NLB)
Edge compute (simple)	CloudFront Functions
Edge compute (complex)	Lambda@Edge

Failover Speed Comparison

Service	Failover Speed	Mechanism
Route 53	DNS TTL (30s - 5min+)	DNS resolution
Global Accelerator	<1 minute	Health checks, network-level
CloudFront Origin Groups	Immediate on error	Origin error triggers
ELB	Seconds	Target health checks

CloudFront Origin Types

Origin Type	Use Case	Security
S3 Bucket	Static files	OAC + Bucket Policy
S3 Website	Static website	Public bucket or signed URLs
ALB	Dynamic content	Security Group, custom headers
VPC Origin	Private resources	No public exposure needed
Custom HTTP	Any HTTP server	Auth headers, IP whitelist

Key Numbers to Remember

Item	Value
Edge locations	400+ globally
Global Accelerator static IPs	2 Anycast IPs
CloudFront Functions execution	<1 ms
Lambda@Edge max execution	5-10 seconds
CloudFront Functions memory	2 MB
Lambda@Edge max memory	3008 MB
ACM certificate region	us-east-1 (required)
Global Accelerator failover	<1 minute

Part 5: Ultimate Instant-Answer Table

Question Contains	→ Instant Answer
“global service” / “no region”	IAM, Route 53, CloudFront, Global Accelerator
“multi-region encryption”	KMS Multi-Region Keys (mrk-)
“multi-region database (NoSQL)”	DynamoDB Global Tables
“multi-region database (SQL)”	Aurora Global Database
“CloudHSM multi-region”	IMPOSSIBLE (single-region only)
“Lambda@Edge region”	Author in us-east-1
“Edge-Optimized API cert”	us-east-1
“Regional API cert”	Same region as API
“cache at edge” / “CDN”	CloudFront
“static content globally”	CloudFront
“gaming” / “UDP” / “IoT”	Global Accelerator
“VoIP” / “real-time TCP”	Global Accelerator
“static IP” / “whitelist”	Global Accelerator
“<1 min failover”	Global Accelerator
“origin failover”	CloudFront Origin Groups
“S3 only via CloudFront”	OAC + Bucket Policy
“OAI”	Legacy → use OAC
“private content” (1 file)	Signed URL
“private content” (many files)	Signed Cookie
“block by country”	Geo Restriction
“path-based settings”	Behaviors
“HTTP → HTTPS”	Viewer Protocol Policy
“stale cache” / “force refresh”	Invalidation
“simple edge code”	CloudFront Functions
“complex edge code” / “DB”	Lambda@Edge
“viewer triggers only”	CloudFront Functions (or Lambda@Edge)
“origin triggers”	Lambda@Edge only
“encrypt form fields”	Field-Level Encryption
“reduce CloudFront cost”	Price Class 100/200
“CloudFront SSL cert”	ACM in us-east-1
“no caching, just faster”	Global Accelerator
“DNS routing”	Route 53 (not CloudFront/GA)
“single region LB”	ELB (not CloudFront/GA)

Part 6: Elimination Checklist

When stuck between options, eliminate systematically:

□ Is it HTTP/HTTPS?
  → Yes = CloudFront or Global Accelerator
  → No (UDP, raw TCP) = Global Accelerator only

□ Do they need CACHING?
  → Yes = CloudFront
  → No = Global Accelerator (or neither)

□ Do they need STATIC IPs?
  → Yes = Global Accelerator (or NLB)
  → No = CloudFront is fine

□ What's the FAILOVER requirement?
  → Instant (<1 min) = Global Accelerator
  → DNS-based = Route 53
  → Origin failover = CloudFront Origin Groups

□ Is it about EDGE COMPUTE?
  → Simple (headers, rewrites) = CloudFront Functions
  → Complex (network, DB) = Lambda@Edge
  → Origin triggers = Lambda@Edge only

□ Is it about PRIVATE CONTENT?
  → 1 file = Signed URL
  → Multiple files = Signed Cookie
  → Direct S3 = S3 Pre-Signed URL

□ Is it about S3 ORIGIN SECURITY?
  → Restrict to CloudFront = OAC + Bucket Policy
  → OAI mentioned = legacy, use OAC

□ Is it about COUNTRY RESTRICTION?
  → Block/allow by country = Geo Restriction
  → Not Security Groups (can't attach to CF)

□ What REGION for SSL cert?
  → CloudFront = us-east-1 (always)
  → ALB = same region as ALB

🏆 The Golden Rules

Global services = IAM, Route 53, CloudFront, Global Accelerator — no region selection
CloudHSM = regional ONLY — no cross-region replication (unlike KMS Multi-Region)
CloudFront cert MUST be in us-east-1 — even if origin elsewhere
Lambda@Edge authored in us-east-1 — CloudFront replicates globally
Edge-Optimized API Gateway cert in us-east-1 — uses CloudFront behind scenes
Regional API Gateway cert in same region — no CloudFront involved
CloudFront = caching, Global Accelerator = routing — different purposes
Non-HTTP (gaming, IoT, VoIP) = Global Accelerator — CloudFront is HTTP only
Static IPs = Global Accelerator — 2 Anycast IPs
Fastest failover = Global Accelerator — <1 min, no DNS TTL delay
Origin failover = Origin Groups — primary + secondary origin
OAC replaces OAI — use OAC for S3 origin security
Signed URL = 1 file, Signed Cookie = many files
CloudFront Functions = lightweight, Lambda@Edge = complex
Origin triggers = Lambda@Edge only — CloudFront Functions = viewer only
Price Class = cost control — fewer regions = lower cost
Invalidation = force refresh — bypass TTL for immediate updates
Global Accelerator doesn’t cache — it proxies packets through AWS backbone

AWS EC2 (Elastic Compute Cloud)

EC2 (Elastic Compute Cloud) is virtual computer (instance) in the cloud.
EC2 consists:

Renting virtual machine (EC2);
Storing data on virtual machine (EBS);
Distributing load across machines (ELB);
Scaling the services using an auto-scaling group (ASG).

EC2 configuration options:

Operating System (OS): Linux, Windows or Mac OS;
How much compute power & cores (CPU);
How much random-access memory (RAM);
How much storage space:
- Network-attached (EBS & EFS);
- Hardware (EC2 Instance Store).
Network card: speed of the card, Public IP address;
Bootstrap script (launching commands when a machine starts for the first time): EC2 User Data - Installing updates, Downloading files and etc.

EC2 Instance Types:

General Purpose: provides a balance of compute, memory, and networking resources:
- application servers;
- gaming servers;
- backend servers for enterprise applications;
- small and medium databases;
Compute Optimized: processing workloads, media transcoding, high-performance web servers, machine learning and etc;
- high-performance web servers;
- compute-intensive applications servers;
- dedicated gaming servers
- batch processing workloads (that require processing many transactions in a single group);
Memory Optimized: real-time processing of large unstructured in-memory data sets (preloaded in memory before running an application);
- high-performance database;
- real-time processing of a large amount of unstructured data;
Accelerated Computing: use directly on hardware accelerators, or coprocessors, which is more efficiently than is possible in software running on CPUs:
- functions with floating-point number calculations;
- graphics processing;
- data pattern matching;
Storage Optimized: designed for workloads that require high, sequential read and write access to large datasets on local storage;
- SQl and NoSQL databases;
- data warehousing applications;
- distributed file systems;
- high-frequency online transaction processing (OLTP) systems;
HPC Optimized: built to offer the best price performance for running HPC workloads at scale on AWS, for applications that benefit from high-performance processors;
- large, complex simulations;
- deep learning workloads.

T/M → General (Typical, Moderate) C → Compute (CPU) R → Memory (RAM) P/G → Accelerated (Processing/GPU) I/D → Storage (I/O, Disk)

r6i.2xlarge │││ └─── Size within the instance class ││└────── Additional capabilities (i = Intel) │└─────── Generation of hardware (6th) └──────── Family - instance class (R = Memory Optimized)

EC2 Placement group: control over the EC2 Instance placement strategy. Placement group strategies:

Cluster—clusters instances into a low-latency group in a single Availability Zone (10 Gbps bandwidth). AZ fail causes full outage;
Spread—spreads instances across underlying hardware (max 7 instances per group per AZ). Maximizes high availability via isolation;
Partition—spreads instances across many different partitions (which rely on different sets of racks) within an AZ. Scales to 100s of EC2 instances per group (Hadoop, Cassandra, Kafka).

CLUSTER (same rack)        SPREAD (diff racks)       PARTITION (diff racks)
┌─────────────────┐        ┌───┐ ┌───┐ ┌───┐        ┌─────┐ ┌─────┐ ┌─────┐
│ ┌──┐┌──┐┌──┐┌──┐│        │EC2│ │EC2│ │EC2│        │Part1│ │Part2│ │Part3│
│ │  ││  ││  ││  ││        └───┘ └───┘ └───┘        │┌──┐ │ │┌──┐ │ │┌──┐ │
│ └──┘└──┘└──┘└──┘│        Rack1 Rack2 Rack3        ││  │ │ ││  │ │ ││  │ │
└─────────────────┘        (max 7 per AZ)           │└──┘ │ │└──┘ │ │└──┘ │
  10 Gbps, 1 AZ                                     └─────┘ └─────┘ └─────┘
  Low latency              High availability        100s instances (Kafka)

⚠️ Exam trap: Cluster = same rack (low latency, high risk); Spread = different racks (max 7/AZ); Partition = different racks (100s instances, Kafka/Cassandra)

AMI (Amazon Machine Image) - customization of an EC2 instance, added ext. software.

A Public AMI: Amazon provided;
Your own AMI: you make and maintain them yourself;
An AWS Marketplace AMI: an AMI someone else made (and potentially sells).

⚠️ Exam trap: AMIs are region-specific. Cannot launch EC2 from AMI in another region — must copy AMI to target region first (creates new AMI ID)

EC2 Image Builder service to automate the creation, maintain, validate and test of Virtual Machine or container images. Can be run on schedule.

Elastic Network Interfaces (ENI): logical component in a VPC that represents a virtual network card. Bound to a specific AZ.

ENI attributes:

Primary private IPv4, one or more secondary IPv4;
One Elastic IP (IPv4) per private IPv4;
One Public IPv4;
One or more security groups to each ENI;
MAC address to each ENI.

NOTE: You can create ENI independently and attach them on the fly (move them) on EC2 instances for failover.

Security Groups vs NACLs:

Feature	Security Groups	NACLs
Level	Instance (ENI)	Subnet
State	Stateful (return traffic auto-allowed)	Stateless (must allow both directions)
Rules	Allow only	Allow AND Deny
Rule Order	All rules evaluated	Rules processed in order (lowest first)
Default	Deny all inbound, allow all outbound	Allow all (default NACL)
Association	Assigned to instances	Assigned to subnets

┌─────────────────────────────────────────────────────────┐
│                         VPC                             │
│  ┌───────────────────────────────────────────────────┐  │
│  │           Subnet (with NACL)                      │  │
│  │  ┌─────────────────────┐  ┌─────────────────────┐ │  │
│  │  │ Security Group      │  │ Security Group      │ │  │
│  │  │  ┌───────────────┐  │  │  ┌───────────────┐  │ │  │
│  │  │  │      EC2      │  │  │  │      EC2      │  │ │  │
│  │  │  └───────────────┘  │  │  └───────────────┘  │ │  │
│  │  └─────────────────────┘  └─────────────────────┘ │  │
│  └───────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────┘

⚠️ Exam trap: Security Groups = stateful (allow inbound → outbound auto-allowed). NACLs = stateless (must explicitly allow both directions)

EC2 Instance Lifecycle:

Stop: data on EBS disk kept intact for next start
Terminate: root EBS volumes destroyed (unless configured otherwise)
Start behavior:
- First start: OS boots + EC2 User Data script runs
- Following starts: OS boots + applications start + caches warm up (takes time)

EC2 Hibernate:

In-memory (RAM) state is preserved → much faster boot (OS not stopped/restarted)
How it works: RAM state written to file in root EBS volume
Root EBS volume must be encrypted
Use cases: long-running processing, saving RAM state, services with slow initialization

EC2 Hibernate - Requirements & Limits:

Instance RAM: must be < 150 GB
Root volume: must be EBS, encrypted, not instance store, and large enough
Supported purchasing options: On-Demand, Reserved, and Spot Instances
NOT supported: Dedicated Hosts, bare metal instances
Max hibernation period: 60 days

⚠️ Exam trap: Root EBS must be encrypted for hibernate. Max 60 days, max 150GB RAM

EC2 Purchasing Options:

Option	Description	Savings	Use Case
On-Demand	Pay by second (Linux/Win) or hour	None	Short-term, unpredictable workloads
Reserved	1 or 3 year commitment	Up to 72%	Steady-state usage (databases)
Savings Plans	Commit to $/hour for 1-3 years	Up to 72%	Flexible across instance types
Spot	Bid on unused capacity	Up to 90%	Fault-tolerant, flexible workloads
Dedicated Hosts	Physical server for your use	-	Compliance, licensing (per-socket)
Dedicated Instances	Hardware dedicated to you	-	Compliance (no server control)
Capacity Reservations	Reserve capacity in specific AZ	None	Ensure availability, no discount

Spot Instances:

Can be interrupted with 2-minute warning
Define max spot price — terminated if current price > max
Spot Block (deprecated): 1-6 hour uninterrupted blocks
Spot Fleet: collection of Spot + On-Demand instances

⚠️ Exam trap: Spot = cheapest but can be terminated. Use for batch jobs, data analysis, CI/CD, NOT databases

Reserved Instances:

Standard RI: Up to 72% discount, can change AZ, instance size (same family)
Convertible RI: Up to 66% discount, can change instance family, OS, tenancy
Scheduled RI (deprecated): Reserve for specific time windows

⚠️ Exam trap: Reserved = commit for 1-3 years. Convertible RI = less discount but more flexibility

Connect to EC2:
ssh -i /<path>/<key_pair_name>.pem <instance_user_name>@<instance_public_dns_name/IP>
Example:
ssh -i /home/kali/Downloads/aws.pem ubuntu@51.20.123.211

🎯 MASTER SUMMARY: EC2 Exam Guide

Part 1: Core Principles (Understand WHY → Derive WHAT)

Principle 1: EC2 is Just a Virtual Computer — Everything Else Solves Its Limitations

EC2 alone is just a VM. The entire ecosystem exists to solve its inherent problems:

Single point of failure → ELB (distribute load), ASG (replace failed instances)
Data loss on termination → EBS (persistent storage), EFS (shared storage)
Manual scaling → ASG (automatic scaling)
No redundancy → Placement groups, Multi-AZ deployments

Deriving answers: When you see a limitation scenario, think: “What problem is this solving?” The answer maps to the appropriate service.

Principle 2: Instance Types = Optimize for the Bottleneck

Every workload has a bottleneck. Match the instance family to it:

Bottleneck	Family	Memory Aid
Nothing specific	T/M	Typical, Moderate (General)
CPU/Processing	C	CPU = Compute
Memory/RAM	R	RAM = Memory
GPU/AI/ML	P/G	Processing/GPU
Disk/IOPS	I/D	I/O, Disk = Storage

Deriving answers: “Processing large datasets in-memory” → Bottleneck is RAM → R family. “Batch processing” → Bottleneck is CPU → C family.

Principle 3: Placement Groups Trade-off = Latency vs. Availability

You can’t have both maximum performance AND maximum availability. Choose your priority:

Priority	Strategy	Trade-off
Lowest latency	Cluster	All in one rack → rack fails = all fail
Maximum isolation	Spread	Different racks → max 7 instances per AZ
Partition isolation + scale	Partition	Different racks → 100s of instances

Deriving answers:

“10 Gbps network” or “lowest latency” → Cluster (same rack = fastest network)
“Critical application, high availability” → Spread (isolation = safety)
“Hadoop, Kafka, Cassandra” → Partition (needs both scale AND isolation)

Principle 4: AMI = Snapshot of Everything, But Region-Locked

An AMI is a complete image (OS + software + config). The trade-off: specificity vs. portability.

AMIs are region-specific — they can’t cross region boundaries without copying
Copy creates a NEW AMI ID in the target region

Deriving answers: “Launch instance in another region from existing AMI” → Must copy first (can’t use original AMI ID)

Principle 5: Security Groups vs NACLs = Instance vs. Subnet, Stateful vs. Stateless

WHY stateful matters: If you allow traffic IN, the response OUT is automatic. You don’t need to think about return traffic.

WHY stateless matters: You must explicitly allow BOTH directions. More control, more work.

Question	SG	NACL
“Where does it apply?”	Instance (ENI)	Subnet
“Can it DENY traffic?”	No (allow only)	Yes
“Do I need to allow return traffic?”	No (stateful)	Yes (stateless)
“Which is evaluated first?”	Rules are combined	Rules processed in order

Deriving answers: “Block specific IP” → NACLs (SGs can’t deny). “Allow port 443 inbound” → Both work, but SGs don’t need outbound rule.

Principle 6: Hibernate = RAM Preserved, But Has Constraints

WHY hibernate exists: Cold boots are slow because RAM is empty. Hibernate saves RAM state.

WHY encryption is required: RAM contains sensitive data — writing it to disk unencrypted = security risk.

Deriving answers:

“Reduce startup time” + “preserve state” → Hibernate
Root volume NOT encrypted? → Hibernate won’t work
RAM > 150GB? → Hibernate won’t work
Hibernation > 60 days? → Not supported

Principle 7: Purchasing Options = Trade Money for Flexibility

The fundamental trade-off: commitment = savings. More flexibility = more cost.

Most Expensive                                    Cheapest
(Most Flexible)                                   (Least Flexible)
     │                                                  │
     ▼                                                  ▼
On-Demand ─→ Capacity Res ─→ Savings Plans ─→ Reserved ─→ Spot
   100%         100%            ~72%            ~72%      ~90%

But Spot has a catch: AWS can take it back. Use only for work that can be interrupted.

The Mental Model:

“I need it now, unpredictable duration” → On-Demand
“I’ll commit to using X amount” → Reserved/Savings Plans
“I can handle interruption” → Spot
“I need the capacity but can’t commit” → Capacity Reservation (no discount)

Principle 8: Dedicated = Compliance, Not Performance

Dedicated Hosts and Dedicated Instances exist for compliance, not speed.

Need	Solution
Per-socket/per-core licensing	Dedicated Host (you see the physical server)
Regulatory: “no shared hardware”	Either works (Dedicated Instance = simpler)
Just want better performance	Neither (use instance optimization instead)

Deriving answers: “Bring your own license” or “socket-based licensing” → Dedicated Host

Principle 9: Spot Instances = Interruptible, 2-Minute Warning

Spot is AWS’s “leftover capacity” at a discount. The trade-off: they can take it back.

Critical rules:

2-minute warning before termination
Define max price — terminated if current price exceeds max
Spot Fleet = pool of Spot + On-Demand (for reliability)

Good for: Batch processing, CI/CD, data analysis, anything that can restart Bad for: Databases, user-facing apps, anything that can’t handle interruption

Principle 10: EC2 Lifecycle = What Happens to Your Data?

Action	EBS Root	Instance Store	RAM
Stop	✅ Preserved	❌ Lost	❌ Lost
Terminate	❌ Deleted (default)	❌ Lost	❌ Lost
Hibernate	✅ Preserved + RAM saved	❌ Lost	✅ Saved to EBS

Deriving answers: “Data survives restart?” → EBS only. “RAM survives?” → Hibernate only.

Part 2: Decision Tree (Follow Keywords → Find Answer)

Instance Type Selection

What's the bottleneck?
        │
        ├─→ "Nothing specific" ─────────────→ T/M (General Purpose)
        │
        ├─→ "CPU" / "batch" / "compute" ───→ C (Compute)
        │
        ├─→ "RAM" / "in-memory" / "cache" ─→ R (Memory)
        │
        ├─→ "GPU" / "ML" / "AI" ───────────→ P/G (Accelerated)
        │
        └─→ "IOPS" / "database" / "OLTP" ──→ I/D (Storage)

Placement Group Selection

What's the priority?
        │
        ├─→ "Lowest latency" / "10 Gbps" ──→ Cluster
        │
        ├─→ "High availability" / "isolation" ─→ Spread (max 7/AZ)
        │
        └─→ "Kafka" / "Hadoop" / "Cassandra" ─→ Partition

Purchasing Option Selection

Can it be interrupted?
        │
        ├─→ YES ─────────────────────────────→ Spot (90% savings)
        │
        └─→ NO
             │
             └─→ How long do you need it?
                      │
                      ├─→ "Hours/days" ──────→ On-Demand
                      │
                      ├─→ "1-3 years" ───────→ Reserved/Savings Plans
                      │
                      └─→ "Guaranteed capacity, no commit" ─→ Capacity Res

The “CANNOT” List

You CANNOT…	Because…
Launch EC2 from AMI in different region	AMIs are region-locked (copy first)
Have > 7 instances in Spread placement group (per AZ)	Spread = different rack per instance, racks limited
Hibernate with unencrypted root EBS	RAM data written to disk = security risk
Hibernate with > 150GB RAM	Storage/write time constraint
Hibernate for > 60 days	AWS limitation
Block traffic with Security Group	SGs can only ALLOW (use NACLs to deny)

Part 3: Scenario Pattern Recognition

Pattern: “Need lowest network latency between instances”

Keywords: low latency, 10 Gbps, HPC, tightly coupled Answer: Cluster placement group Why: Same rack = same network switch = lowest latency. Trade-off is single point of failure.

Pattern: “Critical application, maximize availability”

Keywords: high availability, fault tolerance, critical, isolated Answer: Spread placement group Why: Different racks = different failure domains. Limit: 7 instances per AZ.

Pattern: “Kafka/Hadoop/Cassandra with 100s of instances”

Keywords: Kafka, Hadoop, Cassandra, distributed, large scale, partitions Answer: Partition placement group Why: Partition-aware applications distribute replicas across partitions. Scales to 100s.

Pattern: “Processing large in-memory datasets”

Keywords: in-memory, real-time analytics, caching, SAP HANA Answer: Memory Optimized (R family) Why: Bottleneck is RAM. R = RAM.

Pattern: “Batch processing, video encoding”

Keywords: batch, transcoding, compute-intensive, scientific modeling Answer: Compute Optimized (C family) Why: Bottleneck is CPU. C = CPU.

Pattern: “Cost-effective, fault-tolerant workload”

Keywords: cost-effective, can tolerate interruption, batch, CI/CD Answer: Spot Instances Why: 90% savings, but can be interrupted with 2-min warning. OK for resilient workloads.

Pattern: “Steady-state database, long-term”

Keywords: database, steady, 24/7, long-term, predictable Answer: Reserved Instances Why: 72% savings for 1-3 year commitment. Databases run continuously.

Pattern: “Bring your own license (BYOL)”

Keywords: BYOL, per-socket, per-core, software license Answer: Dedicated Host Why: You need visibility into physical server (sockets/cores) for licensing.

Pattern: “Reduce startup time, preserve application state”

Keywords: fast boot, preserve RAM, reduce initialization time Answer: EC2 Hibernate Why: RAM saved to EBS, no cold boot. Must have encrypted root volume.

Pattern: “Block specific IP address”

Keywords: block IP, deny traffic, blacklist Answer: NACL (not Security Group) Why: Security Groups can only ALLOW. NACLs can DENY.

Pattern: “Launch instance in another region from existing AMI”

Keywords: cross-region, AMI, different region Answer: Copy AMI to target region first Why: AMIs are region-specific. Cannot use AMI ID from another region.

Pattern: “Compliance requires dedicated hardware, don’t need server visibility”

Keywords: compliance, dedicated, isolated hardware Answer: Dedicated Instance Why: Simpler than Dedicated Host when you don’t need socket/core visibility.

Pattern: “Need guaranteed capacity in specific AZ, no long-term commitment”

Keywords: capacity, guarantee, specific AZ, no discount needed Answer: On-Demand Capacity Reservation Why: Reserves capacity immediately. No commitment required, but no discount either.

Pattern: “Flexible commitment across instance types”

Keywords: flexible, multiple instance types, Savings Plans Answer: Compute Savings Plans Why: Commit $/hour, use across any instance type/region. More flexible than Reserved.

Part 4: Quick Reference Tables

Instance Type Families

Family	Optimized For	Use Cases	Memory Aid
T, M	Balance	Web servers, small DBs	Typical, Moderate
C	CPU	Batch, video encoding	CPU
R	RAM	In-memory DBs, caching	RAM
P, G	GPU	ML, graphics	Processing, GPU
I, D	Disk IOPS	Databases, data warehouses	I/O, Disk

Placement Group Comparison

Strategy	Same Rack?	Max Instances	Use Case
Cluster	Yes	No limit	Low latency, HPC
Spread	No	7 per AZ	High availability
Partition	No	100s	Hadoop, Kafka, Cassandra

Purchasing Options Quick Comparison

Option	Savings	Commitment	Interruption?
On-Demand	0%	None	No
Reserved	72%	1-3 years	No
Savings Plans	72%	$/hour for 1-3 years	No
Spot	90%	None	YES (2-min warning)
Dedicated Host	Varies	Optional	No
Capacity Res	0%	None	No

Hibernate Requirements

Requirement	Limit
RAM	< 150 GB
Root Volume	EBS, encrypted, large enough
Max Duration	60 days
NOT Supported	Dedicated Hosts, bare metal

Part 5: Ultimate Instant-Answer Table

Question Contains	→ Instant Answer
“lowest latency between instances”	Cluster placement group
“10 Gbps bandwidth”	Cluster placement group
“spread across racks”	Spread or Partition
“max 7 instances”	Spread placement group
“Kafka, Hadoop, Cassandra”	Partition placement group
“in-memory” / “real-time analytics”	R family (Memory)
“batch processing”	C family (Compute)
“video transcoding”	C family (Compute)
“GPU” / “ML training”	P/G family (Accelerated)
“high IOPS” / “OLTP”	I/D family (Storage)
“90% savings”	Spot Instances
“2-minute warning”	Spot Instances
“can be interrupted”	Spot Instances
“steady-state” + “database”	Reserved Instances
“1-3 year commitment”	Reserved Instances
“flexible across instance types”	Savings Plans
“per-socket licensing”	Dedicated Host
“BYOL”	Dedicated Host
“compliance + dedicated hardware”	Dedicated Host or Instance
“guarantee capacity” + “no commitment”	Capacity Reservation
“fast boot” / “preserve RAM”	EC2 Hibernate
“reduce startup time”	EC2 Hibernate
“encrypted root volume” + “hibernate”	Required for Hibernate
“block IP”	NACL (not SG)
“stateless”	NACL
“stateful”	Security Group
“deny traffic”	NACL
“allow only”	Security Group
“cross-region AMI”	Copy AMI first
“AMI different region”	Copy AMI first
“automate AMI creation”	EC2 Image Builder
“ENI failover”	Move ENI to standby instance

Part 6: Elimination Checklist

Choosing Instance Type

□ Is the workload CPU-bound?
  → Yes = C family
  → No = continue

□ Does it need lots of RAM?
  → Yes = R family
  → No = continue

□ Does it need GPU?
  → Yes = P/G family
  → No = continue

□ Does it need high disk IOPS?
  → Yes = I/D family
  → No = T/M (General Purpose)

Choosing Purchasing Option

□ Can workload tolerate interruption?
  → Yes = Consider Spot (90% savings)
  → No = continue

□ Is usage predictable for 1-3 years?
  → Yes = Reserved or Savings Plans
  → No = continue

□ Do you need flexibility across instance types?
  → Yes = Savings Plans
  → No = Reserved Instances

□ Short-term, unpredictable?
  → Yes = On-Demand

Choosing Placement Group

□ Need lowest possible latency?
  → Yes = Cluster
  → No = continue

□ Need maximum isolation/availability?
  → Yes = Spread (max 7/AZ)
  → No = continue

□ Running Kafka/Hadoop/Cassandra at scale?
  → Yes = Partition
  → No = No placement group needed

Hibernate Eligibility

□ Is root volume EBS and encrypted?
  → No = Hibernate NOT available
  
□ Is RAM < 150GB?
  → No = Hibernate NOT available
  
□ Is it Dedicated Host or bare metal?
  → Yes = Hibernate NOT available
  
□ All above passed?
  → Hibernate available

🏆 The Golden Rules

Instance family = bottleneck (C=CPU, R=RAM, I/D=Disk, P/G=GPU)
Cluster = same rack = fastest but risky (one failure = all fail)
Spread = different racks = safest (max 7 instances per AZ)
Partition = different racks + scale (Kafka, Hadoop, Cassandra)
AMIs are region-locked (copy to use in another region)
Security Groups = allow only, stateful (NACLs = allow+deny, stateless)
Hibernate = RAM preserved (needs encrypted EBS, <150GB RAM, max 60 days)
Spot = cheapest but interruptible (90% savings, 2-min warning)
Reserved = commit for savings (72%, 1-3 years)
Dedicated Host = you see the server (for BYOL, socket licensing)
NACL to DENY, SG to ALLOW (SGs can’t block specific IPs)
Savings Plans = flexible Reserved (commit $/hour, not instance type)

Storage:

Amazon EC2 Instance store provides temporary block-level storage for your instance. This storage is located on disks that are physically attached to the host computer. Instance store is ideal for temporary storage of information that changes frequently, such as buffers, caches, scratch data, and other temporary content. It can also be used to store temporary data that you replicate across a fleet of instances, such as a load-balanced pool of web servers.

Amazon EBS (Elastic Block Store) volume is a block-level storage, it’s a network drive, not physical drive - uses the network to communicate the instance, has a bit of latency. EBS has a provisioned capacity (size in GBs, and IOPS) that can be increased over time (billed by provisioned, not used).

Can be attached to instances quickly and while they run. It allows instances to persist data, even after termination;
Can be mounted only to one instance at a time. And locked to a specific availability zone. (To move to another AZ use EBS Snapshot).

EBS Delete on Termination attribute controls the EBS behaviour when an EC2 instance terminates. Root EBS volme is going to be deleted by default, any other attached EBS volume will get disabled Termination attribute.

EBS Snapshot backup (snapshot) of your EBS volume at a point in time. Not necessary to detach volume, but recommended. Snapshots consume IO - avoid during high traffic. Possible to copy snapshots across AZ or Regions. EBS Snapshot Archive EBS Snapshots could be moved to Archive (that is 75% cheaper, but it takes within 24 to 72 hours for restoring the archive).* Recycle Bin rules to retain deleted snapshots to recover them after an accidental deletion (from 1 day to 1 year) Fast Snapshot Restore (FSR) - eliminates latency on first use of EBS volume created from snapshot by pre-initializing all data blocks. Without FSR, volumes load data lazily from S3 causing performance penalty until “warmed up”. Enabled per snapshot per AZ. Expensive - use for critical workloads needing immediate full performance (databases, time-sensitive apps).

EBS Snapshot - Cross-Region & Encryption Flow:

┌─────────┐   snapshot   ┌───────────┐   copy      ┌───────────┐
│   EBS   │ ───────────→ │  Snapshot │ ─────────→  │  Snapshot │
│ (AZ-A)  │              │ (Region A)│  (encrypt)  │ (Region B)│
└─────────┘              └─────┬─────┘             └─────┬─────┘
                               │ restore                 │ restore
                               ▼                         ▼
                         ┌─────────┐               ┌─────────┐
                         │   EBS   │               │   EBS   │
                         │ (AZ-A)  │               │ (AZ-X)  │
                         └─────────┘               └─────────┘

Snapshots are stored in S3 (region-level, not AZ-locked)
Can copy to another region for DR
Can enable encryption during copy (encrypt unencrypted volume)

Local EC2 Instance Store a high-performance hardware disk (better I/O performance than network drives - EBS volumes). Good for buffer / cache / scratch data / temporary content (Risk of data loss if hardware fails). Backups and Replication are your responsibility

⚠️ Exam trap: Instance Store = ephemeral (data lost on stop/terminate). Best I/O performance but no persistence

Instance Store Limitations:

Ephemeral: data lost on stop/terminate/hardware failure;
Cannot detach/reattach: tied to specific instance;
No snapshots: manual backups required;
Fixed size: varies by instance type, cannot resize (e.g., i3.large: 475 GB, i4i.32xlarge: 30 TB);
Cannot add after launch: must be specified at instance creation.

EBS Volume Types

General Purpose SSD (1 GiB - 16 TiB): General purpose SSD volume that balances price (cost effective storage) and performance (low-latency) for a wide variety of workloads: System boot volumes, Virtual desktops, Development and test environments;
- gp2 (older): 3,000 - 16,000 IOPS (linked to size - 3 IOPS per GB, means max IOPS at 5,334 GB);
- gp3 (newer): 3,000 - 16,000 IOPS, 125 - 1000 MiB/s (independent from IOPS size);

⚠️ Exam trap: gp2 IOPS linked to size (3 IOPS/GB); gp3 IOPS independent — know the difference!

Provisioned IOPS (PIOPS) SSD - Highest-performance SSD volume for mission-critical low-latency or high-throughput workloads: System boot volumes, databases workloads. Supports EBS Multi-attach; - io1: 4 GiB - 16 TiB, up to 64,000 IOPS (linked to size - 50 IOPS per 1 GiB, max IOPS at 1,280 GB); - io2 (higher durability - 99.999%): 4 GiB - 16 TiB, up to 64,000 IOPS (linked to size - 50 IOPS per 1 GiB, max IOPS at 1,280 GB); - io2 Block Express (sub-ms latency): 4 GiB - 64 TiB, up to 256,000 IOPS, (linked to size - 50 IOPS per 1 GiB, max IOPS at 1,280 GB);

EBS Multi-Attach: Achieve higher application availability in clustered Linux applications (ex: Teradata) by connecting the same EBS volume to multiple (up to 16) EC2 Instances at a time. Must be in the same AZ and only cluster-aware (GFS2, OCFS2, and NOT EXT4/XFS) file system is supported.

Feature	gp2	gp3	io1	io2	io2 Block Express
Type	General Purpose SSD	General Purpose SSD	Provisioned IOPS SSD	Provisioned IOPS SSD	Provisioned IOPS SSD
Size	1 GiB - 16 TiB	1 GiB - 16 TiB	4 GiB - 16 TiB	4 GiB - 16 TiB	4 GiB - 64 TiB
Max IOPS	16,000	16,000	64,000*	64,000*	256,000
Baseline IOPS	3 IOPS/GiB (min 100)	3,000	Provisioned	Provisioned	Provisioned
IOPS:GiB Ratio	3:1 (linked)	Independent	50:1	500:1	1,000:1
Max Throughput	250 MiB/s	1,000 MiB/s	1,000 MiB/s	1,000 MiB/s	4,000 MiB/s
Durability	99.8% - 99.9%	99.8% - 99.9%	99.8% - 99.9%	99.999%	99.999%
Latency	Single-digit ms	Single-digit ms	Single-digit ms	Single-digit ms	Sub-millisecond
Boot Volume	✅ Yes	✅ Yes	✅ Yes	✅ Yes	✅ Yes
Multi-Attach*	❌ No	❌ No	✅ Yes	✅ Yes	✅ Yes
Use Case	Dev/test, boot volumes	General workloads	Databases, critical apps	Databases, critical apps	Highest performance

*64,000 IOPS on Nitro instances, 32,000 on others

HDD Volume Types (Cannot be boot volumes)

st1 (HDD): 125 GiB - 16 TiB, up to 500 IOPS, max 500 MiB/s (independent from IOPS size); Low cost HDD volume designed for frequently accessed, throughput-intensive workloads: Big Data, Data Warehouses, Log Processing;
sc1 (HDD): 125 GiB - 16 TiB, up to 250 IOPS, max to 250 MiB/s (independent from IOPS size); Lowest cost HDD volume designed for less frequently accessed workloads: Infrequent access, lowest cost;

Feature	st1 (Throughput Optimized)	sc1 (Cold HDD)
Size	125 GiB - 16 TiB	125 GiB - 16 TiB
Max Throughput	500 MiB/s	250 MiB/s
Max IOPS	500	250
Boot Volume	❌ No	❌ No
Use Case	Big Data, Data Warehouses, Log Processing	Infrequent access, lowest cost
Cost	Low	Lowest

⚠️ Exam trap: HDD (st1/sc1) cannot be boot volumes. Only SSD (gp2/gp3/io1/io2) can boot

EBS Encryption: Fully managed, transparent encryption using KMS (AES-256) with minimal latency impact. Encrypts:

Data at rest inside the volume;
Data in flight between instance and volume;
All snapshots created from the volume;
All volumes created from encrypted snapshots.

*Encrypt unencrypted EBS volume: Create snapshot → Copy snapshot with encryption enabled → Create volume from encrypted snapshot → Attach to instance.

Amazon EFS (Elastic File System) - managed NFS that can be mounted on many EC2 instances and on-premises (multi-AZ). Highly available, auto-scaling (petabytes, no capacity planning), expensive (~3x gp2 cost, pay-per-use). Use cases: content management, web serving, data sharing, WordPress.

NFSv4.1 protocol; access controlled via security groups;
Linux only (POSIX file system); encryption at rest using KMS.

⚠️ Exam trap: EFS = Linux only (POSIX). Performance Mode cannot be changed after creation; Throughput Mode can

EFS Performance & Throughput Modes:

Performance Mode (Set at creation, CAN NOT be changed later):
- General Purpose (default): latency-sensitive (web server, CMS);
- Max I/O: higher latency, higher throughput, highly parallel (big data, media processing);
Throughput Mode (CAN be changed anytime):
- Bursting: 1 TB = 50 MiB/s + burst up to 100 MiB/s;
- Provisioned: set throughput regardless of size (e.g., 1 GiB/s for 1 TB);
- Elastic: auto-scales (up to 3 GiB/s reads, 1 GiB/s writes) - for unpredictable workloads.

EFS Storage Classes (lifecycle policies move files after N days):

Storage Tiers:
- Standard: frequently accessed files;
- Infrequent Access (EFS-IA): lower storage cost, retrieval fee (up to 92% cheaper);
- Archive: rarely accessed (few times/year), 50% cheaper than IA;
Availability:
- Standard: Multi-AZ for production;
- One Zone: single AZ, for dev (90%+ cost savings), backup enabled by default.

Amazon FSx for Windows File Server fully managed, highly reliable and scalable Windows native shared file system based on SMB protocol and Windows NTFS (Integrated with Microsoft Active Directory).

Amazon FSx for Lustre (Linux cluster) a fully managed high-performance, scalable file system for High Performance Computing (HPC): machine learning, analytics, video processing and financial modeling.

Instance Store vs EBS vs EFS:

Feature	Instance Store	EBS	EFS
Type	Block storage (local)	Block storage (network)	File storage (NFS)
Instances	1	1 (except io1/io2 Multi-Attach)	100s across AZs
AZ	Locked to instance	Locked to one AZ	Multi-AZ (regional)
Persistence	Ephemeral (lost on stop)	Persists independently	Persists independently
Performance	Best (hardware attached)	Good, network latency	Good, higher latency
OS	Linux & Windows	Linux & Windows	Linux only (POSIX)
Cost	Included with instance	Provisioned capacity	~3x EBS, pay-per-use
Use case	Cache, temp data	Boot volumes, databases	Shared files, WordPress

┌─────────────┐     ┌─────────────┐     ┌─────────────────────────┐
│ Instance    │     │    EBS      │     │          EFS            │
│   Store     │     │  (Network)  │     │     (Multi-AZ NFS)      │
├─────────────┤     ├─────────────┤     ├─────────────────────────┤
│ ┌─────────┐ │     │   ┌─────┐   │     │   ┌───┐ ┌───┐ ┌───┐    │
│ │   EC2   │ │     │   │ EC2 │   │     │   │EC2│ │EC2│ │EC2│    │
│ │ ┌─────┐ │ │     │   └──┬──┘   │     │   └─┬─┘ └─┬─┘ └─┬─┘    │
│ │ │Disk │ │ │     │      │      │     │     └─────┼─────┘      │
│ │ └─────┘ │ │     │   ┌──┴──┐   │     │       ┌───┴───┐        │
│ └─────────┘ │     │   │ EBS │   │     │       │  EFS  │        │
└─────────────┘     │   └─────┘   │     └───────┴───────┴────────┘
  Ephemeral         └─────────────┘         Shared across AZs
  Best I/O            Single AZ             Linux only, pay-per-use

AWS Storage Gateway: bridge between on-premise data and cloud data in S3, hybrid storage service to allow on-premise to seamlessly use the AWS Cloud. Use cases: disaster recovery, backup & restore, tiered storage.

Types of Storage Gateway:

File Gateway:
- Amazon S3: a file interface that enables you to store files as objects in Amazon S3 using the industry-standard NFS and SMB file protocols;
- Amazon FSx fully managed, highly reliable, and scalable file shares in the cloud using the industry-standard NFS and SMB protocols;
Volume Gateway: applications’ block storage volumes using the iSCSI protocol. Data written to these volumes can be asynchronously backed up as point-in-time snapshots of your volumes, and stored in the cloud as Amazon EBS snapshots. It’s possible to back up on-premises Volume Gateway volumes using the service’s native snapshot scheduler or by using the AWS Backup service;
Tape Gateway: an iSCSI-based virtual tape library (VTL) of virtual tape drives.

        On-Premises                              AWS Cloud
┌─────────────────────────┐              ┌─────────────────────────┐
│                         │              │                         │
│  ┌───────────────────┐  │              │  ┌─────────────────┐    │
│  │   File Gateway    │──┼──────────────┼─→│   S3 / FSx      │    │
│  └───────────────────┘  │   NFS/SMB    │  └─────────────────┘    │
│                         │              │                         │
│  ┌───────────────────┐  │              │  ┌─────────────────┐    │
│  │  Volume Gateway   │──┼──────────────┼─→│  EBS Snapshots  │    │
│  └───────────────────┘  │   iSCSI      │  └─────────────────┘    │
│                         │              │                         │
│  ┌───────────────────┐  │              │  ┌─────────────────┐    │
│  │   Tape Gateway    │──┼──────────────┼─→│ S3 Glacier/Deep │    │
│  └───────────────────┘  │   VTL        │  └─────────────────┘    │
└─────────────────────────┘              └─────────────────────────┘

⚠️ Exam trap: File Gateway = S3/FSx (NFS/SMB); Volume Gateway = EBS snapshots (iSCSI); Tape Gateway = Glacier (VTL)

AWS S3:

S3 (Simple Storage Service) provides object storage through a web service interface — “infinitely scaling” storage.
Amazon S3: allows to store objects (files) in ‘buckets’ (directories).
Amazon S3 offers unlimited storage space. The maximum file size for an object in Amazon S3 is 5 TB.

Use Cases: Backup/storage, Disaster Recovery, Archive, Hybrid Cloud storage, Media hosting, Data lakes & big data analytics, Static websites, Software delivery

Buckets:

Must have globally unique name (across all regions, all accounts)
Defined at the region level (looks global in console, but created in a region)

⚠️ Exam trap: “Can’t create bucket” + correct IAM permissions → name already taken globally

Naming convention:

No uppercase, No underscore;
3-63 characters long;
Not an IP;
Must start with lowercase letter or number;
Must NOT start with the prefix xn–;
Must NOT end with the suffix -s3alias.

Objects have a Key, which is a full path to them (s3://<bucket_name>/<folder_name>/<file-name>). Max size of an Object is 5TB (5000GB), if uploading more than 5GB, should be used “multi-part upload”.

No real directories — just keys with slashes / (UI tricks you)
Metadata: system/user key-value pairs
Tags: up to 10 Unicode key-value pairs (useful for security/lifecycle)
Version ID: if versioning enabled

S3 Consistency Model:

Since Dec 2020, S3 is strongly consistent for all operations (read-after-write)
PUT overwrite → immediate read returns latest version (not old data)
DELETE → immediate read returns 404 (not stale object)
No eventual consistency, no extra cost — built-in

⚠️ Exam trap: “Overwrite object, immediately read” → S3 always returns the latest version. Old “eventual consistency” behavior is gone. Distractors mentioning “might return previous data” or “might return new data” are wrong.

Amazon S3 Versioning protects against unintended deletes. It is enabled at the bucket level.

Same key overwrite increments version: 1, 2, 3…
Files before versioning enabled have version “null”
Suspending versioning does not delete previous versions
Easy rollback to any previous version

Amazon S3 Replication:

Cross-Region Replication (CRR) — compliance, lower latency, cross-account
Same-Region Replication (SRR) — log aggregation, prod-to-test sync
Must enable Versioning in source AND destination
Buckets can be in different AWS accounts
Copying is asynchronous
Only new objects replicated after enabling (use S3 Batch Replication for existing)
DELETE: delete markers can be replicated (optional), version ID deletes are NOT replicated
No chaining: bucket1 → bucket2 → bucket3 won’t replicate bucket1 objects to bucket3

┌───────────┐
│ S3 Bucket │ (eu-west-1)
└─────┬─────┘
      │ asynchronous
      │ replication
      ▼
┌───────────┐
│ S3 Bucket │ (us-east-2)
└───────────┘

S3 Security:

User-Based: IAM Policies (which API calls allowed for specific user)
Resource-Based:
- Bucket Policies — bucket-wide JSON rules, allows cross-account
- Object ACL — finer grain (can be disabled)
- Bucket ACL — less common (can be disabled)

S3 Access Scenarios:

Scenario	Use
IAM User → S3	IAM Policy attached to user
EC2 Instance → S3	IAM Role attached to EC2
Cross-Account → S3	Bucket Policy (resource-based)
Public/Anonymous → S3	Bucket Policy with `Principal: "*"`

1. IAM User Access          2. EC2 Instance Access       3. Cross-Account Access
   ┌──────────┐                ┌──────────┐                 ┌──────────┐
   │IAM Policy│                │ IAM Role │                 │  Bucket  │
   └────┬─────┘                └────┬─────┘                 │  Policy  │
        │                           │                       └────┬─────┘
   ┌────▼─────┐                ┌────▼─────┐                      ▼
   │ IAM User │───────────────▶│   EC2    │─────────────▶  ┌───────────┐
   └──────────┘                └──────────┘                │ S3 Bucket │
        │                                                  └───────────┘
        ▼                                                        ▲
   ┌──────────┐                                            ┌─────┴─────┐
   │ S3 Bucket│              4. Public Access              │ IAM User  │
   └──────────┘                 ┌──────────┐               │Other Acct │
                                │  Bucket  │               └───────────┘
                                │  Policy  │
                                │Principal:│
                                │   "*"    │
                                └────┬─────┘
                                     ▼
                           ┌───────────────────┐
                           │ Anonymous Visitor │───▶ S3 Bucket
                           └───────────────────┘

Bucket Policy (JSON): Resources, Effect (Allow/Deny), Actions (API calls), Principal (account/user)

Use cases: grant public access, force encryption at upload, cross-account access

{
  "Version": "2012-10-17",
  "Statement": [{
    "Sid": "PublicRead",
    "Effect": "Allow",
    "Principal": "*",
    "Action": ["s3:GetObject"],
    "Resource": ["arn:aws:s3:::examplebucket/*"]
  }]
}

Block Public Access — prevent data leaks:

Leave ON if bucket should never be public
Can be set at account level (applies to all buckets)

Access granted if: (IAM permissions ALLOW it OR resource policy ALLOWS it) AND no explicit DENY

⚠️ Exam trap: Bucket policy ALLOWS but user can’t access → check for explicit DENY in IAM policy (DENY always wins)

Encryption: encrypt objects using encryption keys

S3 Static Website Hosting:

URL: http://bucket-name.s3-website-<region>.amazonaws.com or http://bucket-name.s3-website.<region>.amazonaws.com
403 Forbidden? → bucket policy must allow public reads

S3 Durability & Availability:

Durability: 99.999999999% (11 9’s) across all storage classes — lose 1 object per 10,000 years if storing 10M objects
Availability: varies by class (S3 Standard: 99.99% = ~53 min/year downtime)

S3 Storage Classes:

Amazon S3 Standard - General Purpose: used for frequently access data, low latency and high throughput, sustain 2 concurrent facility failures (Big Data analytics, CDN);
Amazon S3 Standard-Infrequent Access (IA) (99.9% availability): for data that is less frequently accessed, but requires rapid access. Lower cost than S3 Standard (Disaster Recovery, backups);
Amazon S3 One Zone-Infrequent Access (99.5% availability): data will be lost when AZ is destroyed (storing secondary backup copies, data you can recreate);
Amazon S3 Glacier Instant Retrieval: (milisecond retrieval, but once a quarter, minimum storage duration 90 days). Low-cost object storage with pricing for storage and retrieval cost;
Amazon S3 Glacier Flexible Retrieval (minimum storage duration of 90 days):
- Expedited (1 to 5 minutes);
- Standard (3 to 5 hours);
- Bulk (5 to 12 hours) - free.
Amazon S3 Glacier Deep Archive: minimum storage duration of 180 days:
- Standard (12 hours);
- Bulk (48 hours).
Amazon S3 Intelligent Tiering: moves objects automatically between Access Tiers based on usage, small monthly monitoring and auto-tiering fee, but there are no retrieval charges. Ideal for data with unknown or changing access patterns. Requires a small monthly monitoring and automation fee per object.
- Frequent Access tier (automatic): default tier;
- Infrequent Access tier (automatic): object not accessed for 30 days;
- Archive Instant Access tier (automatic): objects not accessed for 90 days;
- Archive Access tier (optional): configurable from 90 days to 700+ days;
- Deep Archive Access tier (optional): configurable from 180 days to 700+ days.
S3 Express One Zone (99.95% availability): high-performance single-AZ class using Directory Buckets. 10x faster than Standard (single-digit ms latency), 50% lower cost. Co-locate compute + storage in same AZ. Use cases: AI/ML training, HPC, financial modeling. Integrates with SageMaker, Athena, EMR, Glue.

Move between classes manually or using S3 Lifecycle configurations

⚠️ Exam traps - Storage Classes:

“Unknown access pattern” → Intelligent-Tiering (auto-moves, no retrieval fee)
“Single AZ” → One Zone-IA or Express One Zone (data lost if AZ destroyed)
“Millisecond retrieval from archive” → Glacier Instant (not Flexible!)
“Cheapest archive” → Glacier Deep Archive (but 12-48hr retrieval)
“Lowest latency” → Express One Zone (10x faster than Standard)
Default choice without usage details → Intelligent-Tiering (most cost-effective)
Glacier Flexible modes: Expedited / Standard / Bulk only (NO “Instant” mode — that’s a separate class)
Glacier Deep Archive modes: Standard / Bulk only (NO Expedited)

S3 Storage Classes Comparison:

Class	Avail.	AZs	Min Duration	Retrieval	Use Case
Standard	99.99%	≥3	None	Instant, free	Frequently accessed
Intelligent-Tiering	99.9%	≥3	None	Instant, free	Unknown access patterns
Standard-IA	99.9%	≥3	30 days	Instant, per GB	Infrequent but rapid access
One Zone-IA	99.5%	1	30 days	Instant, per GB	Secondary backups, recreatable
Glacier Instant	99.9%	≥3	90 days	ms, per GB	Once/quarter access
Glacier Flexible	99.99%	≥3	90 days	1-5 min / 3-5 hr / 5-12 hr	Archive, flexible retrieval
Glacier Deep Archive	99.99%	≥3	180 days	12 hr / 48 hr	Long-term archive
Express One Zone	99.95%	1	None	<10ms	AI/ML, HPC, low-latency

Durability: 99.999999999% (11 9’s) for ALL classes

⚠️ Exam trap: Lifecycle transition timing must respect minimum storage duration

IA classes (Standard-IA, One Zone-IA) = 30-day minimum charge
Glacier Instant/Flexible = 90-day minimum charge
Glacier Deep Archive = 180-day minimum charge
Transitioning before the minimum → you pay for both the old class AND the minimum of the new class = more expensive
Example: Standard → One Zone-IA after 7 days = you still pay 30 days of IA → wasteful. Transition at 30 days instead
“Re-creatable data” = One Zone-IA is fine (single AZ risk acceptable = cheaper than Standard-IA)

S3 Performance:

Latency: 100-200 ms
Per prefix limits:
- 3,500 PUT/COPY/POST/DELETE requests/sec
- 5,500 GET/HEAD requests/sec
No limit on number of prefixes in a bucket
Prefix = path between bucket and file name (e.g., bucket/folder1/sub1/file → prefix: /folder1/sub1/)
Spread across N prefixes → N × 5,500 GET/sec (e.g., 4 prefixes = 22,000 GET/sec)
S3 Transfer Acceleration: use CloudFront edge locations for faster uploads (long distance)
S3 Multi-Part Upload: parallelize uploads, retry failed parts only (required >5GB, recommended >100MB)
- Single PUT limit = 5GB (error if exceeded without multi-part)
S3 Byte-Range Fetches: request specific byte ranges in parallel
- Speed up downloads (parallel parts)
- Retrieve partial data (e.g., just the file header)
- Better resilience (retry only failed parts)

⚠️ Exam traps:

“Read first X bytes” / “file header/metadata” → Byte-Range Fetch
“Large files + unstable connection” → Multi-Part Upload + Transfer Acceleration
Multi-Part = resilience (retry parts), Transfer Acceleration = speed (edge locations)
“Faster S3 uploads” → S3 Transfer Acceleration (NOT Global Accelerator — GA is for ALB/NLB/EC2 endpoints, not S3)
“Cost-effective faster uploads” → S3TA + Multipart (NOT Direct Connect — expensive, months to set up; NOT VPN — no speed improvement)

S3 Batch Operations:

Bulk operations on existing objects with a single request
Use cases: modify metadata, copy between buckets, encrypt objects, modify ACLs/tags, restore from Glacier, invoke Lambda per object
Job = object list + action + optional parameters
Manages retries, tracks progress, sends notifications, generates reports
Use S3 Inventory to get object list, Athena to filter objects

⚠️ Exam trap: “Encrypt existing objects” / “change encryption on all files” → S3 Batch Operations

Lifecycle Rules = transition/delete (NOT encrypt)
CRR = replication (NOT in-place encryption)
Access Points = access control (NOT encryption)

S3 Inventory ──▶ Athena (filter) ──▶ S3 Batch Operations ──▶ Processed Objects
     │                                      ▲
     └── Objects List Report                │
                                   User: operation + params

S3 Lifecycle Rules:

Automate transitions between storage classes
Transition Actions: move objects to another class after X days
Expiration Actions: delete objects/versions/incomplete uploads after X days
Rules can target: prefix (s3://bucket/mp3/*) or object tags (Department: Finance)

⚠️ Exam trap:

“Delete old versions to reduce costs” → Expiration Actions (not Transition!)
“Delete incomplete multipart uploads” → Expiration Actions (Lifecycle Rule)
Transition = move to cheaper storage class
Expiration = permanently delete (objects, versions, or incomplete uploads)

Storage Class Transitions (allowed paths):

Standard ──┬──▶ Standard-IA ──┬──▶ Intelligent-Tiering ──┬──▶ One Zone-IA
           │                  │                          │
           │                  ▼                          ▼
           ├──▶ Glacier Instant ◀────────────────────────┤
           │                                             │
           ▼                                             ▼
      Glacier Flexible ◀─────────────────────────────────┤
           │                                             │
           ▼                                             ▼
      Glacier Deep Archive ◀─────────────────────────────┘

(All classes can transition DOWN, never UP)

Lifecycle Scenarios:

Scenario	Solution
Thumbnails recreatable, needed 60 days, then delete	One Zone-IA + expire after 60 days
Source images: immediate access 60 days, then 6hr retrieval OK	Standard → Glacier after 60 days
Recover deleted objects immediately for 30 days, then 48hr OK for 365 days	Versioning + noncurrent → Standard-IA → Glacier Deep Archive

S3 Analytics - Storage Class Analysis:

Recommendations for Standard and Standard-IA only (NOT One Zone-IA/Glacier)
Report updated daily, 24-48 hours to start seeing data
Good first step to create/improve Lifecycle Rules

⚠️ Exam trap: “Optimal days to transition” / “Lifecycle recommendations” → S3 Analytics (not Inventory!)

S3 Requester Pays:

Normally bucket owner pays for storage + data transfer
With Requester Pays: requester pays for requests + data download
Use case: share large datasets with other accounts
⚠️ Requester must be authenticated (cannot be anonymous)

S3 Event Notifications:

Events: S3:ObjectCreated, S3:ObjectRemoved, S3:ObjectRestore, S3:Replication
Object name filtering possible (*.jpg)
Delivery: typically seconds, can take a minute+
Destinations: SNS, SQS, Lambda Function
Requires resource policies on destination (SNS/SQS/Lambda must allow S3 to invoke)

           ┌──▶ SNS
           │
S3 Events ─┼──▶ SQS
           │
           └──▶ Lambda

S3 Event Notifications with EventBridge:

S3 → EventBridge → 18+ AWS services as destinations
Advanced filtering: JSON rules (metadata, object size, name)
EventBridge capabilities: Archive, Replay Events, Reliable delivery

⚠️ Exam trap: “Get notified on object upload” → Event Notifications (NOT Access Logs, Analytics, or Select)

Access Logs = audit logging (no notifications)
Analytics = storage class recommendations
S3 Select = query data inside objects

S3 Storage Lens

Overview:

Analyze and optimize storage across entire AWS Organization
30 days of usage & activity metrics
Aggregate by: Organization, accounts, regions, buckets, or prefixes
Export metrics daily to S3 (CSV, Parquet)

                    ┌─ Organization
                    ├─ Accounts
S3 Storage Lens ───▶├─ Regions        ───▶ Aggregate ───▶ Dashboard ───▶ ┌─ Summary Insights
   (Configure)      └─ Buckets                           (Analyze)       ├─ Data Protection
                                                                         └─ Cost Efficiency
                                                                            (Optimize)

Default Dashboard:

Multi-Region and Multi-Account data
Preconfigured by AWS, can’t be deleted (only disabled)

Metrics Categories:

Category	Key Metrics	Use Cases
Summary	StorageBytes, ObjectCount	Identify fastest-growing or unused buckets
Cost-Optimization	NonCurrentVersionStorageBytes, IncompleteMultipartUploadStorageBytes	Find incomplete multipart uploads >7 days, transition candidates
Data-Protection	VersioningEnabledBucketCount, MFADeleteEnabledBucketCount, SSEKMSEnabledBucketCount	Audit data protection best practices
Access-Management	ObjectOwnershipBucketOwnerEnforcedBucketCount	Check Object Ownership settings
Event	EventNotificationEnabledBucketCount	Identify buckets with Event Notifications
Performance	TransferAccelerationEnabledBucketCount	Find buckets with Transfer Acceleration
Activity	AllRequests, GetRequests, PutRequests, BytesDownloaded	Understand storage request patterns
Status Code	200OKStatusCount, 403ForbiddenErrorCount, 404NotFoundErrorCount	Monitor HTTP response distribution

Free vs Paid:

Feature	Free	Advanced (Paid)
Metrics	~28 usage metrics	+ Activity, Cost Optimization, Data Protection, Status Code
Retention	14 days	15 months
CloudWatch Publishing	❌	✅
Prefix Aggregation	❌	✅

S3 Encryption

4 Encryption Methods:

Method	Key Management	Header	Notes
SSE-S3	AWS-managed	`"x-amz-server-side-encryption": "AES256"`	Default for new buckets, AES-256
SSE-KMS	AWS KMS	`"x-amz-server-side-encryption": "aws:kms"`	Audit via CloudTrail, KMS quota limits
DSSE-KMS	AWS KMS (double)	`"x-amz-server-side-encryption": "aws:kms:dsse"`	Two layers of encryption, compliance
SSE-C	Customer-managed (outside AWS)	Key in every HTTP header	HTTPS required, S3 doesn’t store key
Client-Side	Customer encrypts before upload	N/A	Full control, use S3 Encryption Library

⚠️ Exam trap: “Customer manages keys” + “never store keys in AWS” → SSE-C or Client-Side

SSE-C: encryption happens in S3, but YOU send the key with each request (S3 discards after use)
Client-Side: encryption happens on YOUR side before upload, S3 never sees unencrypted data
Both = keys never stored in AWS; difference = WHERE encryption happens

⚠️ Exam trap: “Keys in AWS OK” + “control rotation policy” → SSE-KMS

SSE-S3 = AWS manages everything (no rotation control)
SSE-KMS = keys in AWS, but YOU control rotation (automatic yearly or on-demand)

⚠️ Exam trap: “Encrypt all objects by default” → Do nothing (SSE-S3 is automatic since Jan 2023)

All new objects encrypted with SSE-S3 by default
HTTPS = encryption in transit (not at rest)
Versioning = version history (not encryption)

Encryption Evaluation Order:

Bucket Policy evaluated first (can deny/require specific encryption)
Default Encryption applied if no encryption header in request

SSE-S3 (Server-Side Encryption with S3-Managed Keys):

User ──── HTTP(S) + Header ────▶ ┌─────────────────────────────────┐
          (upload object)        │           Amazon S3             │
                                 │  Object + S3 Owned Key          │
                                 │         ↓                       │
                                 │    [Encryption]                 │
                                 │         ↓                       │
                                 │    S3 Bucket (encrypted)        │
                                 └─────────────────────────────────┘

SSE-KMS Limitation:

Upload calls GenerateDataKey API, download calls Decrypt API
Counts toward KMS quota: 5,500 / 10,000 / 30,000 req/s (region-dependent)
Request quota increase via Service Quotas Console

SSE-KMS (Server-Side Encryption with KMS Keys):

User ──── HTTP(S) + Header ────▶ ┌─────────────────────────────────┐
          (upload object)        │           Amazon S3             │
                                 │  Object + KMS Key (API call)    │
                                 │         ↓                       │
                                 │    [Encryption]                 │
                                 │         ↓                       │
                                 │    S3 Bucket (encrypted)        │
                                 └─────────────────────────────────┘
                                            ▲
              ┌─────────┐                   │ API call
              │ KMS Key │───────────────────┘
              └─────────┘
        (GenerateDataKey / Decrypt)

⚠️ Exam trap: “High-throughput S3” + “encryption” → SSE-S3 (not SSE-KMS!)

SSE-KMS has API quota limits + extra cost per request
SSE-S3 = no limits, no extra cost, still AES-256

SSE-C (Server-Side Encryption with Customer-Provided Keys):

User ──── HTTPS ONLY ──────────▶ ┌─────────────────────────────────┐
     (object + key in header)    │           Amazon S3             │
                                 │  Object + Client-Provided Key   │
                                 │         ↓                       │
                                 │    [Encryption]                 │
                                 │         ↓                       │
                                 │    S3 Bucket (encrypted)        │
                                 └─────────────────────────────────┘
                                 (S3 discards key after use)

Client-Side Encryption:

┌──────┐   ┌────────────┐   ┌──────────────┐            ┌───────────┐
│ File │ + │ Client Key │ → │ [Encryption] │ → HTTP(S) → │ S3 Bucket │
└──────┘   └────────────┘   │ (client-side)│            │(encrypted)│
                            └──────────────┘            └───────────┘
           (Customer manages keys + encryption cycle)

Force Encryption in Transit (HTTPS):

Use bucket policy with aws:SecureTransport condition
Deny requests where aws:SecureTransport: false

{
  "Effect": "Deny",
  "Principal": "*",
  "Action": "s3:GetObject",
  "Resource": "arn:aws:s3:::my-bucket/*",
  "Condition": {
    "Bool": { "aws:SecureTransport": "false" }
  }
}

Origin = scheme (protocol) + host (domain) + port
- Example: https://www.example.com (port 443 implied for HTTPS)
Same origin: http://example.com/app1 & http://example.com/app2
Different origins: http://www.example.com & http://other.example.com
Browser-based mechanism to allow requests to other origins
Requests blocked unless cross-origin server allows via CORS Headers

CORS Flow (Preflight Request):

┌────────────────┐                           ┌────────────────┐
│  Web Server    │                           │  Web Server    │
│   (Origin)     │                           │ (Cross-Origin) │
│ example.com    │                           │  other.com     │
└───────┬────────┘                           └───────▲────────┘
        │                                            │
        │ HTTPS Request                              │
        ▼                                            │
   ┌─────────────┐    1. OPTIONS (Preflight)         │
   │ Web Browser │──────────────────────────────────▶│
   │             │    Host: other.com                │
   │             │    Origin: example.com            │
   │             │◀──────────────────────────────────│
   │             │    2. Preflight Response          │
   │             │    Access-Control-Allow-Origin:   │
   │             │      https://example.com          │
   │             │    Access-Control-Allow-Methods:  │
   │             │      GET, PUT, DELETE             │
   │             │──────────────────────────────────▶│
   └─────────────┘    3. GET / (actual request)      │
                      Host: other.com                │
                      Origin: example.com            │

S3 CORS:

If client makes cross-origin request to S3 bucket, must enable correct CORS headers
Configure CORS on the cross-origin bucket (the one being requested)
Allow specific origin or * (all origins)

S3 CORS Example (Static Website with Assets in Different Bucket):

┌─────────────┐   1. GET /index.html                    ┌─────────────────────┐
│ Web Browser │────────────────────────────────────────▶│ S3: my-bucket-html  │
│             │◀────────────────────────────────────────│ (Static Website)    │
│             │   index.html                            │ Origin bucket       │
│             │                                         └─────────────────────┘
│             │   2. GET /images/coffee.jpg
│             │      Host: my-bucket-assets.s3-website...
│             │      Origin: my-bucket-html.s3-website...
│             │────────────────────────────────────────▶┌─────────────────────┐
│             │◀────────────────────────────────────────│ S3: my-bucket-assets│
│             │   Access-Control-Allow-Origin:          │ (Static Website)    │
└─────────────┘     my-bucket-html.s3-website...        │ Cross-origin bucket │
                                                        │ ← CORS config here  │
                                                        └─────────────────────┘

⚠️ Exam trap: CORS errors on S3 → configure CORS on the target bucket (the one being requested), not the origin

S3 MFA Delete

Requires MFA code before critical S3 operations

MFA required for:
- Permanently delete an object version
- Suspend versioning on bucket
MFA NOT required for:
- Enable versioning
- List deleted versions
Prerequisites: Versioning must be enabled
Only root account can enable/disable MFA Delete

S3 Access Logs

Log all requests to S3 bucket (authorized AND denied)
Logs stored in another S3 bucket (same region)
Use for audit, security analysis
Analyze logs with Amazon Athena (serverless SQL queries on S3 data)

⚠️ Exam trap: Never set logging bucket = monitored bucket → creates infinite loop, bucket grows exponentially

⚠️ Exam trap: “Audit who accessed/tried to access S3” → S3 Access Logs + Athena

Access Logs = captures all requests (including denied)
Athena = serverless SQL to analyze log files in S3
Wrong: CloudTrail = API calls only (not data access patterns)
Wrong: Bucket Policy = controls access, doesn’t log attempts

S3 Pre-Signed URLs

Generate via Console, CLI, or SDK
User inherits permissions of URL generator for GET/PUT

Expiration:

Method	Default	Max
S3 Console	-	720 min (12 hours)
AWS CLI	3600 sec (1 hour)	604800 sec (168 hours / 7 days)

Use Cases:

Allow logged-in users to download premium content
Dynamically generate download URLs for changing user list
Temporary upload access to specific S3 location

S3 Access Points

Simplify security management for S3 buckets at scale
Each Access Point has:
- Own DNS name (Internet Origin or VPC Origin)
- Access Point policy (like bucket policy)
Use case: different teams need access to different prefixes

S3 Access Points:

Users (Finance) ───▶ Finance Access Point ───┐
                     (R/W to /finance/*)     │
                                             ▼
Users (Sales) ─────▶ Sales Access Point ────▶ S3 Bucket ◀── Simple
                     (R/W to /sales/*)       │               Bucket
                                             │               Policy
Users (Analytics) ─▶ Analytics Access Point ─┘
                     (R to entire bucket)

VPC Origin Access Points:

Access Point accessible only from within VPC
Requires VPC Endpoint (Gateway or Interface)
VPC Endpoint Policy must allow access to bucket AND Access Point

VPC Origin:

┌─────────────────────────────────────────────────────────────────────┐
│ VPC                                                                 │
│  EC2 ──▶ VPC Endpoint ──▶ Access Point (VPC Origin) ──▶ S3 Bucket  │
│          (Endpoint       (Access Point                 (Bucket     │
│           Policy)          Policy)                      Policy)    │
└─────────────────────────────────────────────────────────────────────┘

S3 Object Lambda

Use Lambda to transform objects before retrieval
One S3 bucket + multiple Object Lambda Access Points

S3 Object Lambda:

                                    ┌─────────────────────────────────────┐
E-Commerce App ──▶ Original Object ─┤ S3 Access Point ──▶ S3 Bucket       │
                                    │                                     │
Analytics App ───▶ Redacted Object ─┤ Object Lambda AP ──▶ Redacting λ ───┤
                                    │                                     │
Marketing App ───▶ Enriched Object ─┤ Object Lambda AP ──▶ Enriching λ ◀──┼── Customer DB
                                    └─────────────────────────────────────┘

Use Cases:

Redact PII for analytics/non-prod environments
Convert data formats (XML → JSON)
Resize/watermark images on-the-fly with caller-specific details

⚠️ Exam trap: “Transform/redact data before retrieval” → S3 Object Lambda

Wrong: Copy to another bucket (extra storage cost, data duplication)
Wrong: Object Lock (prevents deletion, doesn’t transform data)
Object Lambda = on-the-fly transformation, no data duplication

S3 WORM Protection (Vault Lock & Object Lock)

Feature	Glacier Vault Lock	S3 Object Lock
Applies to	Glacier Vaults only	Any S3 storage class
Requires Versioning	❌	✅
Lock Level	Entire vault	Per object version
Policy Immutable	Yes (after lock)	Depends on mode

S3 Object Lock Modes:

Mode	Who Can Delete?	Change Settings?	Use Case
Compliance	No one (including root)	❌	Regulatory requirements
Governance	Special permission users	✅	Internal policies

Lock Reversal & Override Details:

Lock Type	Can Shorten?	Can Remove?	Can Delete Object?	Who Can Override?
Compliance Retention	❌ Never	❌ Never	❌ Until expires	No one — wait for expiry
Governance Retention	✅	✅	✅	Users with `s3:BypassGovernanceRetention` + header `x-amz-bypass-governance-retention:true`
Legal Hold	N/A	✅	❌ While active	Users with `s3:PutObjectLegalHold` permission
Vault Lock	❌ Never	❌ Never	❌ Per policy	No one — delete vault to remove (loses all data)

Object Lock Features:

Retention Period: fixed duration, can be extended (never shortened in Compliance)
Legal Hold: indefinite, independent of retention (s3:PutObjectLegalHold)

⚠️ Exam traps:

“Immutable archive for compliance” + Glacier → Vault Lock
“Prevent deletion for 7 years” + any S3 class → Object Lock (Compliance)
“Allow admin override” → Governance mode (not Compliance!)
Compliance mode = truly immutable — no one (root, admin, AWS Support) can delete or shorten retention
Legal Hold ≠ Retention Period — can have both on same object
Object Lock requires versioning enabled before activation

🎯 MASTER SUMMARY: S3 Exam Guide

Part 1: Core Principles (Understand WHY → Derive WHAT)

Principle 1: S3 is Object Storage, Not File System

S3 stores objects (files) in buckets (containers). There are no real directories — just keys with slashes.

Key = full path: s3://bucket/folder/subfolder/file.txt
“Folders” are UI illusion — S3 uses flat namespace
Prefix = everything before the filename (important for performance & policies)

Principle 2: Durability vs Availability

Two different concepts:

Durability (11 9’s = 99.999999999%): Will I lose my data? → Almost never
Availability (varies by class): Can I access it right now? → Depends on class

All storage classes have same durability. Availability differs.

Principle 3: Access Control Hierarchy

Access granted if: (IAM allows OR Resource policy allows) AND no explicit DENY

Who’s accessing?	Use…
IAM User in same account	IAM Policy
EC2/Lambda	IAM Role
Cross-account	Bucket Policy
Public/Anonymous	Bucket Policy with `Principal: "*"`

DENY always wins — if any policy denies, access is denied.

Principle 4: Encryption is Automatic (Since Jan 2023)

SSE-S3 is default for all new objects — you don’t need to enable it
SSE-S3 = AWS manages keys, no cost, no limits
SSE-KMS = you control keys, but has API quota limits
SSE-C = you provide keys with every request, S3 never stores them
Client-Side = you encrypt before upload, S3 never sees unencrypted data

Principle 5: Lifecycle = Move or Delete, Batch = Transform

Two different tools for different jobs:

Lifecycle Rules: Automate transitions (Standard → Glacier) or deletions
Batch Operations: One-time bulk actions (encrypt, copy, tag, invoke Lambda)

Lifecycle cannot encrypt. Batch Operations can.

Principle 6: Replication ≠ Backup

Replication is asynchronous copy to another bucket
Only new objects replicated (use Batch Replication for existing)
Requires versioning on both buckets
Delete markers can be replicated; version ID deletes cannot

Principle 7: Performance = Prefixes

S3 scales per prefix:

3,500 PUT/sec and 5,500 GET/sec per prefix
More prefixes = more throughput (linear scaling)
Transfer Acceleration = faster uploads via CloudFront edge
Multi-Part = parallel upload, required >5GB

Principle 8: WORM = Write Once Read Many

Two mechanisms:

Object Lock: Per-object, requires versioning, Compliance or Governance mode
Vault Lock: Glacier only, entire vault, truly immutable

Compliance mode = NO ONE can delete (not even root or AWS Support)

Part 2: Decision Tree (Follow Keywords → Find Answer)

Step 1: What storage class?

                    What's the access pattern?
                              │
    ┌─────────────┬───────────┼───────────┬─────────────┬─────────────┐
    ▼             ▼           ▼           ▼             ▼             ▼
 Frequent     Unknown/     Infrequent  Archive      Archive       Lowest
 Access       Changing     Access      (instant)    (flexible)    Latency
    │             │           │           │             │             │
    ▼             ▼           ▼           ▼             ▼             ▼
Standard   Intelligent-  Standard-IA  Glacier      Glacier       Express
           Tiering       or One Zone  Instant      Flexible/     One Zone
                                                   Deep Archive

Step 2: Encryption decision

                    Who manages the keys?
                              │
        ┌─────────────────────┼─────────────────────┐
        ▼                     ▼                     ▼
   AWS manages           You control          Keys never in AWS
   (no work)             (audit/rotate)           │
        │                     │           ┌───────┴───────┐
        ▼                     ▼           ▼               ▼
     SSE-S3              SSE-KMS      SSE-C         Client-Side
   (default)           (CloudTrail)  (key in       (encrypt before
                                     header)        upload)

Step 3: Feature-Based Decision Table

If question mentions…	Answer is…
“unknown access pattern”	Intelligent-Tiering
“millisecond retrieval from archive”	Glacier Instant
“cheapest archive” / “rarely accessed”	Glacier Deep Archive
“lowest latency” / “AI/ML training”	Express One Zone
“recreatable data” + single AZ OK	One Zone-IA
“encrypt existing objects”	S3 Batch Operations
“transition to cheaper storage”	Lifecycle Rules
“delete old versions”	Lifecycle Expiration Actions
“delete incomplete multipart uploads”	Lifecycle Expiration Actions
“audit object access”	S3 Access Logs + Athena
“customer manages keys outside AWS”	SSE-C or Client-Side
“high throughput + encryption”	SSE-S3 (not KMS — quota limits)
“prevent deletion for X years”	Object Lock (Compliance)
“allow admin override”	Object Lock (Governance)
“cross-account access”	Bucket Policy
“generate temporary download link”	Pre-Signed URL
“different access per team/prefix”	S3 Access Points
“transform data before retrieval”	S3 Object Lambda
“read first X bytes” / “file header”	Byte-Range Fetch
“large file + unreliable network”	Multi-Part Upload
“faster uploads over long distance”	Transfer Acceleration
“analyze storage costs”	S3 Storage Lens
“lifecycle recommendations”	S3 Analytics
“replicate existing objects”	S3 Batch Replication
“CORS error”	Configure CORS on target bucket

The “NOT” Rules (Eliminate Wrong Answers Fast)

Statement	Why It’s Wrong
Lifecycle Rules encrypt objects	Lifecycle = transition/delete only, not encrypt
SSE-KMS for high-throughput	SSE-KMS has API quota limits — use SSE-S3
Replication for existing objects	Only new objects — use Batch Replication for existing
Access Logs for real-time alerts	Access Logs = audit, not notifications — use Event Notifications
CloudTrail for data access patterns	CloudTrail = API calls, not object-level access — use Access Logs
Object Lock without versioning	Versioning is required before enabling Object Lock
Compliance mode with admin override	Compliance = no one can override — use Governance for admin override
Glacier Flexible for instant access	Flexible = hours — use Glacier Instant for milliseconds
Standard-IA for archive	IA = infrequent access, not archive — use Glacier for archive

The “CANNOT” List

Cannot…	Instead…
Create bucket with existing name	Names are globally unique — choose different name
Encrypt with Lifecycle Rules	Use Batch Operations for encryption
Shorten Compliance retention	Wait for expiry (truly immutable)
Delete in Compliance mode	No one can — not even root or AWS Support
Replicate to bucket without versioning	Enable versioning on both buckets
Chain replications (A→B→C)	Set up direct replication from A to C
Use SSE-C without HTTPS	HTTPS is mandatory for SSE-C
Set Object Lock on bucket without versioning	Enable versioning first

Part 3: Scenario Pattern Recognition

Pattern: “Unknown or changing access patterns”

Keywords: unpredictable access, varies over time, don’t know access frequency

Answer: Intelligent-Tiering

Why: Auto-moves objects between tiers, no retrieval fees, small monitoring fee.

Pattern: “Archive with occasional instant access”

Keywords: archive, quarterly access, millisecond retrieval, compliance archive

Answer: Glacier Instant Retrieval

Why: Archive pricing + instant access. Glacier Flexible = hours, not milliseconds.

Pattern: “Cheapest possible storage”

Keywords: rarely accessed, years of retention, 12+ hour retrieval OK

Answer: Glacier Deep Archive

Why: Cheapest class, 12-48 hour retrieval. Use Standard/Bulk retrieval.

Pattern: “Encrypt existing objects”

Keywords: encrypt all current files, change encryption, bulk encrypt

Answer: S3 Batch Operations

Why: Lifecycle Rules can’t encrypt. CRR creates copies. Batch Operations modifies in-place.

Pattern: “Delete old versions / incomplete uploads”

Keywords: reduce costs, clean up, delete versions older than X days, incomplete multipart

Answer: Lifecycle Expiration Actions

Why: Transition = move to cheaper class. Expiration = delete permanently.

Pattern: “Audit who accessed objects”

Keywords: audit access, security analysis, who accessed, access attempts

Answer: S3 Access Logs + Amazon Athena

Why: Access Logs capture all requests (including denied). Athena queries logs with SQL.

Pattern: “Customer manages encryption keys”

Keywords: customer-managed keys, keys not stored in AWS, full key control

Answer: SSE-C (if encryption in S3) or Client-Side (if encryption before upload)

Why: SSE-S3/SSE-KMS store keys in AWS. SSE-C/Client-Side = keys never stored in AWS.

Pattern: “Prevent anyone from deleting for X years”

Keywords: regulatory compliance, immutable, prevent deletion, WORM, SEC 17a-4

Answer: Object Lock in Compliance mode

Why: Compliance mode = truly immutable. No one (root, admin, AWS) can delete until retention expires.

Pattern: “Prevent deletion but allow admin override”

Keywords: internal policy, admin can override, flexible protection

Answer: Object Lock in Governance mode

Why: Users with s3:BypassGovernanceRetention permission can override. Compliance mode has no override.

Pattern: “Temporary download/upload link”

Keywords: temporary access, time-limited URL, download link for logged-in users

Answer: Pre-Signed URL

Why: User inherits permissions of URL generator. Expires after set time (max 7 days via CLI).

Pattern: “Different teams need different bucket access”

Keywords: multiple teams, different prefixes, simplify access management

Answer: S3 Access Points

Why: Each Access Point has own policy, simplifies per-team access vs complex bucket policy.

Pattern: “Transform/redact data before retrieval”

Keywords: redact PII, convert format, resize images, enrich data on-the-fly

Answer: S3 Object Lambda

Why: Lambda transforms during GET request. No data duplication, no extra storage.

Pattern: “Faster uploads over long distances”

Keywords: global users, long-distance upload, slow uploads

Answer: S3 Transfer Acceleration

Why: Uses CloudFront edge locations. Combine with Multi-Part for large files.

Pattern: “Large file upload with unreliable network”

Keywords: large files, unstable connection, retry on failure

Answer: Multi-Part Upload

Why: Parallel upload, retry only failed parts. Required for >5GB files.

Pattern: “Read only beginning of file”

Keywords: file header, first N bytes, metadata extraction

Answer: Byte-Range Fetch

Why: Request specific byte ranges. Efficient for partial data retrieval.

Keywords: cross-origin, CORS error, browser blocking, different domain

Answer: Configure CORS on the target bucket (the one being requested)

Why: CORS is configured where the data is, not where the request originates.

Pattern: “Replicate existing objects”

Keywords: existing objects, current files, replicate everything

Answer: S3 Batch Replication

Why: Normal replication only copies new objects. Batch Replication handles existing.

Part 4: Quick Reference Tables

Storage Class Comparison

Class	Avail.	AZs	Min Duration	Retrieval	Use Case
Standard	99.99%	≥3	-	Instant	Frequently accessed
Intelligent-Tiering	99.9%	≥3	-	Instant	Unknown patterns
Standard-IA	99.9%	≥3	30 days	Instant	Infrequent, rapid access
One Zone-IA	99.5%	1	30 days	Instant	Recreatable data
Glacier Instant	99.9%	≥3	90 days	ms	Once/quarter access
Glacier Flexible	99.99%	≥3	90 days	1min-12hr	Archive, flexible
Glacier Deep Archive	99.99%	≥3	180 days	12-48hr	Long-term archive
Express One Zone	99.95%	1	-	<10ms	AI/ML, lowest latency

Encryption Comparison

Method	Keys Managed By	Keys Stored In AWS?	HTTPS Required?	Quota Limits?
SSE-S3	AWS	✅ Yes	No	❌ No
SSE-KMS	Customer (via KMS)	✅ Yes	No	✅ Yes (API quota)
DSSE-KMS	Customer (via KMS)	✅ Yes	No	✅ Yes
SSE-C	Customer (external)	❌ No	✅ Yes (mandatory)	❌ No
Client-Side	Customer (external)	❌ No	No	❌ No

Object Lock Modes

Mode	Who Can Delete?	Shorten Retention?	Override?	Use Case
Compliance	No one	❌ Never	❌ Never	Regulatory (SEC, FINRA)
Governance	Special permission	✅ Yes	✅ With permission	Internal policies
Legal Hold	No one while active	N/A	✅ Remove hold	Litigation, investigations

Performance Limits

Metric	Limit
Requests per prefix (PUT/POST/DELETE)	3,500/sec
Requests per prefix (GET/HEAD)	5,500/sec
Single PUT max size	5 GB
Object max size	5 TB
Multi-Part Upload max parts	10,000
Multi-Part Upload min part size	5 MB (except last)

Pre-Signed URL Expiration

Method	Default	Maximum
S3 Console	1-720 minutes	12 hours
AWS CLI	3600 seconds	604800 seconds (7 days)

Key APIs to Remember

API/Header	Purpose
`x-amz-server-side-encryption: AES256`	SSE-S3
`x-amz-server-side-encryption: aws:kms`	SSE-KMS
`x-amz-bypass-governance-retention: true`	Override Governance mode
`aws:SecureTransport`	Condition for HTTPS enforcement

Part 5: Ultimate Instant-Answer Table

Question Contains	→ Instant Answer
“unknown access pattern”	Intelligent-Tiering
“millisecond from archive”	Glacier Instant
“cheapest archive”	Glacier Deep Archive
“lowest latency” / “single-digit ms”	Express One Zone
“recreatable” + “single AZ OK”	One Zone-IA
“encrypt existing objects”	S3 Batch Operations
“transition to cheaper”	Lifecycle Transition Actions
“delete old versions”	Lifecycle Expiration Actions
“incomplete multipart”	Lifecycle Expiration Actions
“audit access” / “who accessed”	S3 Access Logs + Athena
“keys never in AWS”	SSE-C or Client-Side
“high throughput + encrypt”	SSE-S3 (not KMS!)
“prevent deletion” + “compliance”	Object Lock Compliance
“admin can override”	Object Lock Governance
“temporary link”	Pre-Signed URL
“per-team access”	S3 Access Points
“transform before GET”	S3 Object Lambda
“read first N bytes”	Byte-Range Fetch
“large file upload”	Multi-Part Upload
“faster long-distance”	Transfer Acceleration
“storage cost analysis”	S3 Storage Lens
“lifecycle recommendations”	S3 Analytics
“replicate existing”	S3 Batch Replication
“CORS error”	Configure CORS on target bucket
“cross-account”	Bucket Policy
“event on upload”	Event Notifications (SNS/SQS/Lambda)
“infinite loop” + logging	Logging bucket ≠ monitored bucket
“can’t create bucket”	Name already taken globally
“global but regional”	Bucket created in region
“overwrite + immediate read”	Always latest (strong consistency)
“eventual consistency S3”	❌ Outdated — S3 is strongly consistent since 2020

Part 6: Elimination Checklist

When stuck between options, eliminate systematically:

□ Is it about ENCRYPTION?
  → Lifecycle Rules = ❌ Can't encrypt
  → Batch Operations = ✅ Can encrypt existing objects

□ Is it about DELETION PROTECTION?
  → Compliance mode = No one can delete/override
  → Governance mode = Admin can override with permission
  → Legal Hold = Indefinite, removable

□ Is it about ACCESS CONTROL?
  → Same account user = IAM Policy
  → EC2/Lambda = IAM Role
  → Cross-account = Bucket Policy
  → Different teams = Access Points

□ Is it about PERFORMANCE?
  → More throughput = Spread across prefixes
  → Large files = Multi-Part Upload
  → Long distance = Transfer Acceleration
  → Partial data = Byte-Range Fetch

□ Is it about STORAGE CLASS?
  → Unknown pattern = Intelligent-Tiering
  → Infrequent = Standard-IA or One Zone-IA
  → Archive (instant) = Glacier Instant
  → Archive (flexible) = Glacier Flexible
  → Archive (cheapest) = Glacier Deep Archive
  → Lowest latency = Express One Zone

□ Is it about AUDITING?
  → Who accessed objects = S3 Access Logs + Athena
  → API calls to S3 = CloudTrail
  → Storage analysis = S3 Storage Lens or Analytics

□ Is it about REPLICATION?
  → New objects = Standard Replication
  → Existing objects = Batch Replication
  → Versioning required = ✅ On both buckets

🏆 The Golden Rules

SSE-S3 is default — all new objects encrypted automatically (since Jan 2023)
SSE-KMS has limits — for high-throughput, use SSE-S3
Lifecycle cannot encrypt — use Batch Operations for encryption
Compliance mode = truly immutable — no one can delete, not even root
Governance mode = admin override — with special permission
Replication only copies new objects — use Batch Replication for existing
Versioning required for Object Lock — enable versioning first
DENY always wins — explicit deny in any policy blocks access
Bucket names are globally unique — across all accounts, all regions
Prefixes = parallelism — spread across prefixes for more throughput
Multi-Part required >5GB — single PUT limited to 5GB
Access Logs ≠ Monitored bucket — avoid infinite loop
CORS on target bucket — configure where data is, not where request originates
Pre-Signed URL max = 7 days (via CLI) — 12 hours via console
Object Lambda = transform on GET — no data duplication

AWS Snow Family:

AWS Snow Family: highly-secure, portable devices to collect and process data at the edge, and migrate data into and out of AWS. Trying to resolve challenges like:

Limited connectivity;
Limited bandwidth;
High network cost (limited bandwidth, bad connection stability).

Data migration and Edge computing:

Snowcone: 2 CPUs, 4 GiB RAM and 8 TB of HDD Storage or 14 TB of SSD Storage (smallest, portable)
Snowball Edge:
- Snowball Edge Storage Optimized: 40 vCPUs, 80 GiB RAM, 80 TB of HDD capacity
- Snowball Edge Compute Optimized: 104 vCPUs, 416 GiB RAM, 42 TB of HDD or 28 TB NVMe capacity
~~Snowmobile~~: 100 PB per truck — discontinued (use multiple Snowball Edge instead)

AWS OpsHub — GUI application to manage Snow Family devices (installed on your computer)

Unlock and configure devices
Transfer files (drag & drop)
Launch/manage EC2 instances running on device
Monitor device metrics
Replaces CLI-only management (previously required CLI commands)

AWS OpsHub Management:

┌─────────────────────────────────────────────────────────┐
│  Your Computer                                          │
│  ┌───────────────────────────────────────────────────┐  │
│  │              AWS OpsHub (GUI)                     │  │
│  │  ┌─────────────┬─────────────┬─────────────────┐  │  │
│  │  │ Unlock &    │ Transfer    │ Launch EC2      │  │  │
│  │  │ Configure   │ Files       │ Manage Storage  │  │  │
│  │  └─────────────┴─────────────┴─────────────────┘  │  │
│  └───────────────────────────────────────────────────┘  │
│         │                                               │
│         │ Local connection (USB/Network)                │
│         ▼                                               │
│  ┌───────────────┐                                      │
│  │ Snow Device   │                                      │
│  │ (Snowcone /   │                                      │
│  │ Snowball Edge)│                                      │
│  └───────────────┘                                      │
└─────────────────────────────────────────────────────────┘

When to Use Snowball

Rule of thumb: If network transfer takes > 1 week → use Snowball

Data Size	100 Mbps	1 Gbps	10 Gbps
10 TB	12 days	30 hours	3 hours
100 TB	124 days	12 days	30 hours
1 PB	3 years	124 days	12 days

Direct Upload vs Snowball:

Direct:    Client ──── www (10Gbit/s) ────▶ S3 Bucket

Snowball:  Client ──▶ Snowball ──▶ [ship] ──▶ AWS ──▶ S3 Bucket
                      (local)                 (import)

Snowball Edge Computing

Edge Computing = process data where it’s created (before sending to cloud)

Locations with limited/no internet: trucks, ships, mining stations
Run EC2 instances or Lambda functions locally

Device	Use Case
Snowball Edge Storage Optimized	Large data + some compute
Snowball Edge Compute Optimized	Heavy processing (ML, transcoding)

Use cases: Preprocess data, machine learning at edge, media transcoding

⚠️ Exam trap: “Large data + process while in transit” → Snowball Edge (not Snowcone)

Snowcone = small (8-14 TB), limited compute
Snowball Edge = large (42-80 TB), full EC2/Lambda compute

Snowball to Glacier

⚠️ Exam trap: Snowball cannot import to Glacier directly

Must import to S3 first, then use S3 Lifecycle Policy to transition to Glacier

Snowball ──▶ Amazon S3 ──▶ (Lifecycle Policy) ──▶ Amazon Glacier

Amazon FSx

Amazon FSx = Launch 3rd party high-performance file systems on AWS (fully managed)

FSx for Windows File Server

Fully managed Windows file system share (SMB protocol + NTFS)
Active Directory integration, ACLs, user quotas
Can be mounted on Linux EC2 instances
Supports DFS Namespaces (group files across multiple FS)
Multi-AZ available, daily backups to S3
Access from on-premises via VPN or Direct Connect

FSx for Lustre

Parallel distributed file system for HPC (Linux + cluster)
Use cases: ML, HPC, video processing, financial modeling
Scales to 100s GB/s, millions of IOPS, sub-ms latency
S3 integration: read S3 as file system, write results back to S3
Access from on-premises via VPN or Direct Connect

Lustre Deployment Options:

Option	Replication	Performance	Use Case
Scratch	❌ No (data lost if fails)	6x faster (200 MBps/TiB)	Short-term processing, cost optimized
Persistent	✅ Within same AZ	Standard	Long-term processing, sensitive data

FSx Lustre Deployment Options:

Scratch File System:                    Persistent File System:
┌─────────────────────────────┐         ┌─────────────────────────────┐
│ Region                      │         │ Region                      │
│  ┌─────────┐   ┌─────────┐  │         │  ┌─────────┐   ┌─────────┐  │
│  │  AZ 1   │   │  AZ 2   │  │         │  │  AZ 1   │   │  AZ 2   │  │
│  │Compute  │   │Compute  │  │         │  │Compute  │   │Compute  │  │
│  └────┬────┘   └────┬────┘  │         │  └────┬────┘   └────┬────┘  │
│       └─────┬───────┘       │         │       └─────┬───────┘       │
│            ENI              │         │            ENI              │
│             │               │         │             │               │
│        ┌────▼────┐          │         │        ┌────▼────┐          │
│        │  FSx    │──▶ S3    │         │        │  FSx    │──▶ S3    │
│        │(Scratch)│ (optional)│        │        │(Persist)│ (optional)│
│        └─────────┘          │         │        └─────────┘          │
│     (No replication)        │         │   (Replicated in AZ)        │
└─────────────────────────────┘         └─────────────────────────────┘

FSx for NetApp ONTAP

Managed NetApp ONTAP on AWS
Multi-protocol: NFS, SMB, iSCSI
Migrate ONTAP/NAS workloads to AWS
Auto-scaling storage, snapshots, replication, compression, deduplication
Point-in-time cloning (great for testing)

FSx for OpenZFS

Managed OpenZFS on AWS
NFS protocol (v3, v4, v4.1, v4.2)
Migrate ZFS workloads to AWS
Up to 1,000,000 IOPS, <0.5ms latency
Snapshots, compression, point-in-time cloning

FSx for NetApp ONTAP / OpenZFS - Compatible Clients:

                    ┌─────────────────────────┐
                    │ FSx NetApp ONTAP        │
                    │  (NFS, SMB, iSCSI)      │
                    │ ─────────────────────── │
                    │ FSx OpenZFS             │
                    │  (NFS v3/v4 only)       │
                    └───────────┬─────────────┘
                                │
       ┌────────────────────────┼────────────────────────┐
       ▼                        ▼                        ▼
┌─────────────┐     ┌─────────────────────┐     ┌──────────────┐
│EC2/ECS/EKS  │     │VMware/AppStream/    │     │On-premises   │
│             │     │WorkSpaces           │     │Server        │
└─────────────┘     └─────────────────────┘     └──────────────┘
  Linux/Win/Mac

FSx Comparison:

FSx Type	Protocol	Best For	Key Feature
Windows	SMB, NTFS	Windows workloads	AD integration, Multi-AZ
Lustre	POSIX	HPC, ML, Linux	S3 integration, sub-ms latency
NetApp ONTAP	NFS, SMB, iSCSI	Multi-OS, NAS migration	Auto-scaling, cloning
OpenZFS	NFS	ZFS migration	1M IOPS, <0.5ms latency, cloning

FSx Use Case Decision Tree:

Scenario	Answer
Windows app needs shared storage + Active Directory	FSx for Windows
HPC cluster needs fast shared storage + read from S3	FSx for Lustre
ML training with large datasets in S3	FSx for Lustre
Migrate existing Windows file server to AWS	FSx for Windows
Migrate NetApp/NAS to AWS	FSx for NetApp ONTAP
Need NFS + SMB + iSCSI on same file system	FSx for NetApp ONTAP
Migrate ZFS-based workloads to AWS	FSx for OpenZFS
Need point-in-time cloning for testing	NetApp ONTAP or OpenZFS
Short-term compute job, optimize cost	FSx Lustre Scratch
Long-term processing, data must survive failure	FSx Lustre Persistent

⚠️ Exam traps:

“Windows file share + AD” → FSx for Windows
“HPC / ML + Linux + S3 integration” → FSx for Lustre
“Multi-protocol (NFS + SMB + iSCSI)” → FSx for NetApp ONTAP
“Migrate ZFS workloads” → FSx for OpenZFS
“Short-term HPC, cost optimized” → FSx Lustre Scratch
“Long-term, data must persist” → FSx Lustre Persistent
NetApp ONTAP vs OpenZFS: both have cloning, but ONTAP = multi-protocol, OpenZFS = NFS only

AWS Storage Gateway

Bridge between on-premises and AWS cloud storage

Use cases: DR, backup/restore, tiered storage, on-prem cache
Deployed as VM (VMware, Hyper-V, KVM)

AWS Storage Gateway Overview:

On-Premises                                         AWS Cloud
┌─────────────────────────────────────┐    ┌────────────────────────────────┐
│                                     │    │                                │
│ File Shares ──NFS/SMB──▶ File GW ───┼────┼──▶ S3 (excl. Glacier) ──▶ Glacier
│                         (cache)     │    │                                │
│                                     │    │                                │
│ App Server ──iSCSI────▶ Volume GW ──┼────┼──▶ S3 ──▶ EBS Snapshots       │
│                         (cache)     │    │                                │
│                                     │    │                                │
│ Backup App ──iSCSI VTL─▶ Tape GW ───┼────┼──▶ S3 (Tape Library) ──▶ Glacier
│                         (cache)     │    │                                │
└─────────────────────────────────────┘    └────────────────────────────────┘
              Encryption in Transit (Internet or Direct Connect)

Gateway Type	Protocol	Backend	Use Case
S3 File Gateway	NFS, SMB	S3 (Standard, IA, One Zone, Intelligent)	Access S3 via file protocols, cached locally
FSx File Gateway	SMB	FSx for Windows	Low-latency access to FSx from on-prem
Volume Gateway	iSCSI	S3 + EBS snapshots	Block storage backed by S3
Tape Gateway	iSCSI (VTL)	S3 + Glacier	Replace physical tapes with cloud

S3 File Gateway:

On-Premises                              AWS Cloud
┌────────────────────┐          ┌─────────────────────────────────┐
│ App Server         │          │  S3 Standard / IA / One Zone-IA │
│      │             │   HTTPS  │  S3 Intelligent-Tiering         │
│      ▼             │          │           │                     │
│ S3 File Gateway ───┼──────────┼──────────▶│                     │
│  (NFS or SMB)      │          │           ▼ (Lifecycle Policy)  │
│  (local cache)     │          │      S3 Glacier                 │
└────────────────────┘          └─────────────────────────────────┘

Volume Gateway:

On-Premises                              AWS Cloud
┌────────────────────┐          ┌─────────────────────────────────┐
│ App Server         │   HTTPS  │                                 │
│      │             │          │      S3 Bucket                  │
│      ▼ iSCSI       │          │         │                       │
│ Volume Gateway ────┼──────────┼────────▶│                       │
│  (local cache)     │          │         ▼                       │
└────────────────────┘          │    EBS Snapshots                │
                                └─────────────────────────────────┘

Tape Gateway:

On-Premises                              AWS Cloud
┌────────────────────────┐      ┌─────────────────────────────────┐
│ Backup Server          │      │                                 │
│      │ iSCSI           │HTTPS │  Virtual Tapes ──▶ Archived Tapes
│      ▼                 │      │  (S3)              (Glacier)    │
│ ┌──────────┬─────────┐ │      │                                 │
│ │Media     │Tape     │ │      │                                 │
│ │Changer   │Drive    │─┼──────┼──────────────────────────────▶  │
│ └──────────┴─────────┘ │      │                                 │
│     Tape Gateway       │      │                                 │
└────────────────────────┘      └─────────────────────────────────┘

Volume Gateway Modes:

Cached: Most recent data cached locally, full dataset in S3
Stored: Full dataset on-prem, scheduled backups to S3

⚠️ Exam traps:

“Expose S3 to on-premises via NFS/SMB” → S3 File Gateway
“S3 File Gateway + reduce costs + Glacier” → S3 Lifecycle Policy (File Gateway can’t write to Glacier directly)

AWS Transfer Family

Managed file transfers into/out of S3 or EFS using FTP protocols
Protocols: FTP (VPC only), FTPS, SFTP
Integrates with AD, LDAP, Okta, Cognito for authentication
Use cases: file sharing, public datasets, CRM/ERP integration

⚠️ Exam trap: TLS is NOT a supported protocol

TLS = encryption layer, not a file transfer protocol
FTPS uses TLS for encryption, but “TLS” alone is not a transfer protocol
Only FTP, FTPS, SFTP are valid answers

AWS Transfer Family:

                     MS Active Directory / LDAP
                              │ authenticate
                              ▼
Users ──▶ Route 53 ──▶ ┌─────────────────────┐      ┌─────────────┐
(FTP      (optional)   │ Transfer for SFTP   │      │             │
client)                │ Transfer for FTPS   │──────▶  Amazon S3  │
                       │ Transfer for FTP    │      │             │
                       │ (VPC only)          │      │  Amazon EFS │
                       └─────────────────────┘      └─────────────┘
                                    │
                               IAM Role

AWS DataSync

Move large data to and from AWS
On-prem → AWS: needs agent (NFS, SMB, HDFS, S3 API)
AWS → AWS: no agent needed
Destinations: S3 (all classes incl. Glacier), EFS, FSx (all types)
Schedule: hourly, daily, weekly
Preserves file permissions and metadata
Up to 10 Gbps per agent task
Snowcone has DataSync agent pre-installed

DataSync: On-Premises to AWS

On-Premises                                    AWS Region
┌────────────────────────┐          ┌─────────────────────────────────┐
│                        │          │  AWS Storage Resources          │
│ NFS/SMB Server         │   TLS    │  ┌─────────┬─────────┬────────┐ │
│      │                 │          │  │S3       │S3 IA    │S3      │ │
│      ▼ NFS/SMB         │          │  │Standard │         │One Zone│ │
│ DataSync Agent ────────┼──────────┼─▶├─────────┼─────────┼────────┤ │
│                        │          │  │S3       │S3       │S3 Deep │ │
│ (or Snowcone with      │          │  │Intell.  │Glacier  │Archive │ │
│  agent pre-installed)  │          │  ├─────────┴─────────┴────────┤ │
└────────────────────────┘          │  │    EFS    │    FSx         │ │
                                    │  └───────────┴────────────────┘ │
                                    └─────────────────────────────────┘

DataSync: AWS to AWS (no agent needed)

┌─────────────┐                              ┌─────────────┐
│  Amazon S3  │                              │  Amazon S3  │
├─────────────┤         ┌──────────┐         ├─────────────┤
│  Amazon EFS │◀───────▶│ DataSync │◀───────▶│  Amazon EFS │
├─────────────┤         └──────────┘         ├─────────────┤
│  Amazon FSx │    (copy data + metadata)    │  Amazon FSx │
└─────────────┘                              └─────────────┘

⚠️ Exam traps:

“Scheduled data sync on-prem to AWS” → DataSync (not Storage Gateway)
“Bad/limited network connectivity” → Snowcone (has DataSync agent pre-installed, ship physically)
“Migrate S3 → EFS” or “S3 → FSx” → DataSync (AWS-to-AWS, no agent)
“Migrate on-prem NFS to S3” → DataSync (with agent)
EBS is NOT a DataSync destination (block storage, not file/object)
Don’t confuse with Transfer Family (FTP access) or Snowball (physical, on-prem only)

DataSync vs Storage Gateway

Aspect	DataSync	Storage Gateway
Purpose	One-time or scheduled migration/sync	Ongoing hybrid access (bridge)
Direction	On-prem → AWS, AWS → AWS	On-prem ↔ AWS (bidirectional access)
Use case	“Move data to cloud”	“Extend on-prem storage to cloud”
Agent	Yes (on-prem), No (AWS-to-AWS)	VM appliance (always)
Caching	No local cache	Yes, local cache for low latency
Protocol	NFS, SMB, HDFS, S3 API	NFS, SMB, iSCSI

⚠️ Exam trap decision:

“Migrate 50TB from on-prem NFS to S3” → DataSync
“On-prem apps need ongoing access to S3 via NFS” → Storage Gateway
“Sync files weekly to S3 for backup” → DataSync
“Replace tape backup with cloud” → Tape Gateway

Storage Services Comparison

AWS Storage Cloud Native Options:

┌─────────────────┬─────────────────┬─────────────────┐
│     Block       │      File       │     Object      │
├─────────────────┼─────────────────┼─────────────────┤
│  Amazon EBS     │  Amazon EFS     │  Amazon S3      │
│  EC2 Instance   │  Amazon FSx     │  Amazon Glacier │
│  Store          │                 │                 │
└─────────────────┴─────────────────┴─────────────────┘

Service	Type	Use Case
S3	Object	General object storage
S3 Glacier	Object	Archival
EBS	Block	Single EC2 instance storage
Instance Store	Block	Ephemeral, high IOPS
EFS	File (NFS)	Linux shared file system
FSx Windows	File (SMB)	Windows shared file system
FSx Lustre	File (POSIX)	HPC, ML, Linux
FSx NetApp ONTAP	File (multi)	Multi-OS, NAS migration
FSx OpenZFS	File (NFS)	ZFS migration
Storage Gateway	Hybrid	On-prem ↔ AWS bridge
Transfer Family	Hybrid	FTP/SFTP to S3/EFS
DataSync	Migration	Scheduled sync to AWS
Snow Family	Migration	Physical data transfer

Migration & Hybrid Services Decision Tree

Scenario	Answer
Large data (>1 week to transfer), limited bandwidth	Snowball Edge
Large data + need to process at edge	Snowball Edge Compute Optimized
Small data + limited connectivity + edge compute	Snowcone
One-time migration from on-prem NFS/SMB to S3	DataSync (with agent)
Scheduled/recurring sync from on-prem to AWS	DataSync
Migrate S3 → EFS or S3 → FSx	DataSync (no agent)
On-prem apps need ongoing NFS/SMB access to S3	S3 File Gateway
On-prem apps need low-latency access to FSx Windows	FSx File Gateway
On-prem apps need iSCSI block storage backed by S3	Volume Gateway
Replace physical tape backup with cloud	Tape Gateway
External users upload via FTP/SFTP to S3	Transfer Family
Import data to Glacier	Snowball → S3 → Lifecycle Policy

⚠️ Key differentiators:

DataSync = move/sync data (migration tool)
Storage Gateway = access data (hybrid bridge)
Transfer Family = FTP access (external users)
Snowball = physical transfer (bad network)

AWS OpsHub is a software to manage Snow Family Devices.

🎯 MASTER SUMMARY: Storage Migration & Hybrid Services Exam Guide

Part 1: Core Principles (Understand WHY → Derive WHAT)

Principle 1: Network Transfer Time = Decision Point

The fundamental question: How long to transfer over network?

1 week → Consider Snowball (physical transfer)
< 1 week → Use DataSync or direct transfer

100 TB at 1 Gbps = 12 days. Snowball wins.

Principle 2: Migration vs Ongoing Access

Two fundamentally different needs:

Migration = move data once (or scheduled sync) → DataSync, Snowball
Ongoing access = continuous hybrid access → Storage Gateway

“Move to cloud” = migration. “Extend to cloud” = hybrid access.

Principle 3: Protocol Determines Service

What protocol do your applications use?

Protocol	AWS Service
NFS/SMB (file)	Storage Gateway, DataSync, EFS, FSx
iSCSI (block)	Volume Gateway, Tape Gateway
FTP/SFTP/FTPS	Transfer Family
S3 API (object)	Direct S3, DataSync

Principle 4: FSx = Third-Party File Systems on AWS

FSx is NOT a generic file system — it’s specific file system software:

Windows = Windows Server + AD + SMB
Lustre = HPC/ML + Linux + S3 integration
NetApp ONTAP = Multi-protocol (NFS + SMB + iSCSI)
OpenZFS = ZFS migration + NFS only

Principle 5: Snowball → S3 First, Then Glacier

Snowball cannot import directly to Glacier.

Import to S3 → Lifecycle Policy → Glacier

This is a common exam trap.

Principle 6: Edge Computing = Process Where Data Lives

Snowball Edge isn’t just for transfer — it’s for computing at the edge:

Run EC2 instances locally
Run Lambda functions locally
Process data before shipping to AWS

Principle 7: Gateways Have Local Cache

Storage Gateway provides low-latency local access with cloud backing:

Frequently accessed data cached locally
Full dataset in AWS (S3, FSx)
Transparent to applications

Principle 8: DataSync Preserves Metadata

DataSync keeps file permissions and metadata intact:

Timestamps, ownership, permissions
Good for migration where these matter
Scheduled: hourly, daily, weekly

Part 2: Decision Tree (Follow Keywords → Find Answer)

Step 1: Physical or Network Transfer?

                    Network quality?
                          │
            ┌─────────────┴─────────────┐
            ▼                           ▼
    Good/Adequate                  Limited/Bad
    (< 1 week)                     (> 1 week)
            │                           │
            ▼                           ▼
   DataSync / Direct              Snow Family
                                        │
                              ┌─────────┴─────────┐
                              ▼                   ▼
                         Small data          Large data
                         (< 14 TB)           (up to 80 TB)
                              │                   │
                              ▼                   ▼
                          Snowcone         Snowball Edge

Step 2: Migration or Ongoing Access?

                    What's the need?
                          │
        ┌─────────────────┼─────────────────┐
        ▼                 ▼                 ▼
   One-time           Scheduled        Ongoing
   Migration           Sync            Access
        │                 │                 │
        ▼                 ▼                 ▼
   DataSync           DataSync        Storage Gateway
   Snowball                                 │
                              ┌─────────────┴─────────────┐
                              ▼                           ▼
                         File access              Block storage
                         (NFS/SMB)                  (iSCSI)
                              │                           │
                              ▼                           ▼
                    S3 File Gateway              Volume Gateway
                    FSx File Gateway              Tape Gateway

Step 3: Feature-Based Decision Table

If question mentions…	Answer is…
“> 1 week transfer” / “limited bandwidth”	Snowball Edge
“limited network + small data”	Snowcone
“process data at edge” / “edge computing”	Snowball Edge Compute Optimized
“migrate to S3/EFS/FSx” (one-time)	DataSync
“scheduled sync” / “weekly backup to S3”	DataSync
“on-prem NFS access to S3” (ongoing)	S3 File Gateway
“on-prem access to FSx Windows”	FSx File Gateway
“on-prem iSCSI block storage”	Volume Gateway
“replace tape backup”	Tape Gateway
“FTP/SFTP access to S3”	Transfer Family
“Windows file share + AD”	FSx for Windows
“HPC / ML + Linux + S3”	FSx for Lustre
“multi-protocol (NFS + SMB + iSCSI)”	FSx for NetApp ONTAP
“migrate ZFS workloads”	FSx for OpenZFS
“import to Glacier”	Snowball → S3 → Lifecycle
“short-term HPC, cost optimized”	FSx Lustre Scratch
“data must persist + HPC”	FSx Lustre Persistent
“point-in-time cloning”	FSx NetApp ONTAP or OpenZFS

The “NOT” Rules (Eliminate Wrong Answers Fast)

Statement	Why It’s Wrong
Snowball imports to Glacier directly	Must go to S3 first, then Lifecycle to Glacier
DataSync for ongoing hybrid access	DataSync = migration/sync, not ongoing access
Storage Gateway for one-time migration	Storage Gateway = ongoing access, not migration tool
Transfer Family for internal apps	Transfer Family = FTP for external users
TLS as Transfer Family protocol	TLS is encryption, not a protocol — use SFTP/FTPS
FSx Lustre for Windows apps	Lustre = Linux/POSIX only
FSx for Windows without AD	Windows File Server integrates with AD
OpenZFS for SMB access	OpenZFS = NFS only
DataSync to EBS	EBS not supported — only S3, EFS, FSx
Snowcone for 50 TB	Snowcone max = 14 TB — use Snowball Edge

⚠️ Exam trap — DataSync over Direct Connect (NFS → EFS):

DataSync can transfer directly NFS → EFS (no S3 intermediary needed)
Over DX: use private VIF + PrivateLink interface VPC endpoint (EFS is a VPC resource)
❌ “DataSync → S3 → Lambda → EFS” = unnecessary complexity, NOT operationally efficient
❌ “VPC peering endpoint” ≠ PrivateLink. DataSync uses PrivateLink, not VPC peering
❌ “Public VIF for EFS” — public VIF = public services (S3, DynamoDB); EFS needs private VIF

The “CANNOT” List

Cannot…	Instead…
Import Snowball to Glacier directly	Snowball → S3 → Lifecycle → Glacier
Use DataSync to EBS	Use EBS snapshots or block-level replication
Use Transfer Family with TLS protocol	Use SFTP (SSH-based) or FTPS (FTP over TLS)
Access FSx Lustre from Windows	Use FSx for Windows or NetApp ONTAP
Use Snowcone for >14 TB	Use Snowball Edge (up to 80 TB)
Run EC2 on Snowcone	Limited compute — use Snowball Edge Compute
Use FTP without VPC (Transfer Family)	FTP = VPC only; use SFTP/FTPS for public

Part 3: Scenario Pattern Recognition

Pattern: “Large data + limited/bad network”

Keywords: petabytes, limited bandwidth, weeks to transfer, offline, remote location

Answer: Snowball Edge

Why: Physical transfer bypasses network limitations. > 1 week transfer → Snowball.

Pattern: “Process data at remote location”

Keywords: edge computing, process before upload, ML at edge, trucks, ships, mining

Answer: Snowball Edge Compute Optimized

Why: Run EC2/Lambda locally, process data, then ship to AWS.

Pattern: “Small data + limited connectivity”

Keywords: small dataset, remote, portable, <14 TB

Answer: Snowcone

Why: Smallest Snow device (8-14 TB), portable, has DataSync agent pre-installed.

Pattern: “Migrate on-prem NFS/SMB to S3”

Keywords: migrate, one-time transfer, move to cloud, NFS to S3

Answer: DataSync (with agent)

Why: DataSync = migration tool. Preserves metadata. Scheduled or one-time.

Pattern: “Scheduled backup to S3/EFS/FSx”

Keywords: weekly sync, daily backup, recurring, scheduled

Answer: DataSync

Why: DataSync supports hourly/daily/weekly schedules.

Pattern: “On-prem apps need ongoing access to S3”

Keywords: hybrid, continuous access, extend storage, NFS/SMB to S3

Answer: S3 File Gateway

Why: Storage Gateway = ongoing hybrid access with local cache.

Pattern: “Replace physical tape backup”

Keywords: tape, VTL, virtual tape library, backup to cloud

Answer: Tape Gateway

Why: Presents virtual tapes via iSCSI, stores in S3/Glacier.

Pattern: “External users upload via FTP”

Keywords: FTP, SFTP, file transfer, external partners

Answer: AWS Transfer Family

Why: Managed FTP/SFTP/FTPS service to S3 or EFS.

Keywords: Windows, SMB, NTFS, Active Directory, DFS

Answer: FSx for Windows File Server

Why: Fully managed Windows file system with AD integration.

Pattern: “HPC / ML with Linux cluster”

Keywords: HPC, high-performance computing, ML training, Linux, Lustre

Answer: FSx for Lustre

Why: Parallel file system, S3 integration, sub-ms latency, 100s GB/s.

Pattern: “Read from S3 as file system for HPC”

Keywords: S3 integration, lazy load, HPC reads from S3

Answer: FSx for Lustre

Why: Can mount S3 as file system, lazy-load data on access.

Pattern: “Multi-protocol (NFS + SMB + iSCSI)”

Keywords: NFS and SMB, multi-OS, migrate NAS

Answer: FSx for NetApp ONTAP

Why: Only FSx that supports all three protocols.

Pattern: “Migrate ZFS workloads to AWS”

Keywords: ZFS, OpenZFS, migrate ZFS

Answer: FSx for OpenZFS

Why: Managed OpenZFS, NFS protocol, snapshots, cloning.

Pattern: “Import data to Glacier”

Keywords: Snowball to Glacier, archive imported data

Answer: Snowball → S3 → S3 Lifecycle Policy → Glacier

Why: Snowball cannot import directly to Glacier.

Pattern: “Short-term HPC job, optimize cost”

Keywords: temporary processing, cost optimized, short-term

Answer: FSx for Lustre (Scratch)

Why: Scratch = no replication, 6x faster, cheaper. Data lost if fails.

Pattern: “Long-term processing, data must survive”

Keywords: persistent, data durability, long-term HPC

Answer: FSx for Lustre (Persistent)

Why: Replicated within AZ, data survives failures.

Part 4: Quick Reference Tables

Snow Family Comparison

Device	Storage	Compute	Use Case
Snowcone	8-14 TB	2 vCPU, 4 GB	Small data, portable, DataSync agent
Snowball Edge Storage	80 TB	40 vCPU, 80 GB	Large data + some compute
Snowball Edge Compute	42-80 TB	104 vCPU, 416 GB	Heavy processing at edge
~~Snowmobile~~	~~100 PB~~	-	Discontinued

FSx Comparison

FSx Type	Protocol	OS	Best For
Windows	SMB, NTFS	Windows	Windows apps, AD integration
Lustre	POSIX	Linux	HPC, ML, S3 integration
NetApp ONTAP	NFS, SMB, iSCSI	Multi-OS	NAS migration, multi-protocol
OpenZFS	NFS	Linux/Unix	ZFS migration, cloning

Storage Gateway Types

Gateway Type	Protocol	Backend	Use Case
S3 File Gateway	NFS, SMB	S3	File access to S3
FSx File Gateway	SMB	FSx Windows	Low-latency FSx access
Volume Gateway	iSCSI	S3 + EBS	Block storage to S3
Tape Gateway	iSCSI (VTL)	S3 + Glacier	Replace physical tapes

DataSync vs Storage Gateway

Aspect	DataSync	Storage Gateway
Purpose	Migration / Sync	Ongoing hybrid access
Use case	“Move to cloud”	“Extend to cloud”
Caching	No	Yes (low latency)
Direction	One-way or scheduled	Bidirectional access
Agent	Yes (on-prem)	VM appliance

Transfer Family Protocols

Protocol	Encryption	Access
SFTP	SSH-based	Public or VPC
FTPS	TLS-based	Public or VPC
FTP	None	VPC only

⚠️ TLS is NOT a protocol — it’s encryption layer used BY FTPS

Migration Service Selection

Scenario	Service
> 1 week transfer time	Snowball Edge
< 14 TB + limited network	Snowcone
One-time NFS/SMB → S3 migration	DataSync
Scheduled sync to S3/EFS/FSx	DataSync
S3 → EFS or S3 → FSx migration	DataSync (no agent)
Ongoing NFS/SMB access to S3	S3 File Gateway
FTP/SFTP uploads to S3	Transfer Family
Replace tape backup	Tape Gateway
iSCSI block storage to S3	Volume Gateway

Key Numbers to Remember

Item	Value
Snowcone storage	8 TB HDD / 14 TB SSD
Snowball Edge Storage	80 TB
Snowball Edge Compute	42 TB HDD / 28 TB NVMe
DataSync throughput	Up to 10 Gbps per agent
FSx Lustre throughput	100s GB/s
FSx OpenZFS IOPS	1,000,000 IOPS
Volume Gateway cache	Local + S3

Part 5: Ultimate Instant-Answer Table

Question Contains	→ Instant Answer
“> 1 week transfer” / “bad network”	Snowball Edge
“small data + remote”	Snowcone
“edge computing” / “process at edge”	Snowball Edge Compute
“migrate NFS/SMB to S3”	DataSync
“scheduled sync to AWS”	DataSync
“S3 → EFS” or “S3 → FSx”	DataSync (no agent)
“on-prem NFS access to S3” (ongoing)	S3 File Gateway
“on-prem access to FSx Windows”	FSx File Gateway
“iSCSI block storage to cloud”	Volume Gateway
“replace tape backup”	Tape Gateway
“FTP/SFTP to S3”	Transfer Family
“TLS protocol”	❌ Wrong — use SFTP/FTPS
“Windows file share + AD”	FSx for Windows
“HPC + Linux + S3”	FSx for Lustre
“multi-protocol (NFS+SMB+iSCSI)”	FSx for NetApp ONTAP
“migrate ZFS”	FSx for OpenZFS
“Snowball → Glacier”	S3 first → Lifecycle
“short-term HPC, cheap”	FSx Lustre Scratch
“HPC data must persist”	FSx Lustre Persistent
“point-in-time cloning”	FSx NetApp ONTAP or OpenZFS
“Snowmobile”	Discontinued — use multiple Snowball

Part 6: Elimination Checklist

When stuck between options, eliminate systematically:

□ Is network transfer > 1 week?
  → Yes = Snow Family (Snowball/Snowcone)
  → No = DataSync or direct transfer

□ Is it MIGRATION or ONGOING ACCESS?
  → Migration = DataSync, Snowball
  → Ongoing = Storage Gateway

□ What PROTOCOL do apps use?
  → NFS/SMB (file) = File Gateway, DataSync, FSx
  → iSCSI (block) = Volume Gateway, Tape Gateway
  → FTP/SFTP = Transfer Family

□ Is it WINDOWS or LINUX?
  → Windows + AD = FSx for Windows
  → Linux + HPC = FSx for Lustre
  → Both = FSx for NetApp ONTAP

□ Do they need EDGE COMPUTING?
  → Yes + small = Snowcone (limited)
  → Yes + heavy = Snowball Edge Compute

□ Is data going to GLACIER?
  → Via Snowball = S3 first → Lifecycle → Glacier
  → Direct = S3 Lifecycle Policy

□ Is it SCHEDULED SYNC?
  → Yes = DataSync (hourly/daily/weekly)
  → No = One-time DataSync or Snowball

□ Do they need LOCAL CACHE?
  → Yes = Storage Gateway
  → No = DataSync or direct

□ Is it for EXTERNAL USERS?
  → FTP/SFTP = Transfer Family
  → Internal apps = Storage Gateway

🏆 The Golden Rules

> 1 week transfer = Snowball — physical beats network
Migration = DataSync, Ongoing = Storage Gateway — different tools for different needs
Snowball can’t import to Glacier directly — S3 first, then Lifecycle
TLS is NOT a Transfer Family protocol — SFTP/FTPS use TLS, but “TLS” isn’t a protocol
FTP = VPC only — SFTP/FTPS can be public
FSx for Windows = AD integration — always mention AD for Windows file shares
FSx for Lustre = HPC + Linux + S3 — the HPC file system
FSx NetApp ONTAP = multi-protocol — only one with NFS + SMB + iSCSI
OpenZFS = NFS only — no SMB, no iSCSI
Lustre Scratch = temporary, fast, cheap — data lost on failure
Lustre Persistent = durable — replicated within AZ
Snowcone max = 14 TB — use Snowball Edge for larger
Snowmobile is discontinued — use multiple Snowball Edge instead
DataSync preserves metadata — timestamps, permissions, ownership
Storage Gateway has local cache — low-latency hybrid access
EBS is NOT a DataSync destination — only S3, EFS, FSx

Databases:

        ┌─────┐ ┌─────┐ ┌─────┐
        │User │ │User │ │User │
        └──┬──┘ └──┬──┘ └──┬──┘
           │       │       │
           └───────┼───────┘
                   ▼
        ┌─────────────────────┐
        │    Application      │
        └──────────┬──────────┘
                   │ Read/Write
                   ▼
            ┌────────────┐
            │ Amazon RDS │
            └──────┬─────┘
                   │
                   ▼
     <────────── Storage ──────────>

RDS (Relational Database Service) is a distributed relational database service (SQL).

Supported Engines
PostgreSQL, MySQL, MariaDB, Oracle, MS SQL Server, IBM DB2, Aurora

                      ┌───────────────────┐
                      │    Application    │
                      └────────┬─┬────────┘
                        writes ↓ ↑ reads
                           ┌────┴────┐
                           │    M    │  ← Master (writes + reads)
                           └────┬────┘
                    ASYNC       │       ASYNC
               replication ←────┴────→ replication
              ┌─────┴─────┐       ┌─────┴─────┐
              │     R     │       │     R     │  ← Read Replicas
              └─────┬─────┘       └─────┬─────┘
                    ↑ reads             ↑ reads

Read Replicas: Up to 15 replicas, ASYNC replication (eventually consistent), can be cross-AZ/cross-Region.

Can be promoted to standalone DB (breaks replication)
App must update connection string to use replicas

⚠️ Exam trap: ASYNC = eventual consistency = replication lag

“Users don’t see updated data right away” → this is expected Read Replica behavior
Need strong consistency? Read from master (not replica)
“Analytics/reporting slowing down production” → offload to Read Replica

Read Replica Network Cost:

┌─────────────────────────────┐      ┌────────────────────────────┐
│ Same Region / Different AZ  │      │        Cross-Region        │
│  us-east-1a    us-east-1b   │      │ us-east-1a    eu-west-1b   │
│   ┌───┐   ASYNC  ┌───┐      │  vs  │   ┌───┐   ASYNC  ┌───┐     │
│   │ M │ ───────→ │ R │      │      │   │ M │ ───────→ │ R │     │
│   └───┘          └───┘      │      │   └───┘          └───┘     │
│    FREE (same region)       │      │     $$$ (cross-region)     │
└─────────────────────────────┘      └────────────────────────────┘

⚠️ Exam trap: Same region replication = FREE, Cross-region = costs $$$

RDS Cross-Region DR Strategy:

Cross-region Read Replica + Multi-AZ on the replica = HA DR
On disaster: promote replica → becomes read/write Master
RDS has no “Multi-Region option” (that’s Aurora Global DB)

RDS Multi-AZ (Disaster Recovery):

              ┌───────────────────┐
              │    Application    │
              └────────┬─┬────────┘
                writes ↓ ↑ reads
    ┌─────────────────────────────────────┐
    │   One DNS name – automatic failover │
    └──────────────────┬──────────────────┘
                       │
         ┌─────────────┴─────────────┐
         ▼                           │
    ┌─────────┐       SYNC      ┌────┴────┐
    │    M    │ ──────────────→ │    S    │
    └─────────┘   replication   └─────────┘
    Master (AZ A)              Standby (AZ B)

SYNC replication (data always consistent)
Automatic failover on: AZ loss, network/instance/storage failure
Not for scaling — standby cannot serve reads
Read Replicas can also be Multi-AZ for DR

⚠️ Exam trap: Multi-AZ = High Availability (failover), Read Replicas = Scalability (read performance)

 READ REPLICA (ASYNC)                MULTI-AZ (SYNC)
 ┌───┐         ┌───┐               ┌───┐         ┌───┐
 │ M │ ──────→ │ R │               │ M │ ──────→ │ S │
 └───┘  async  └───┘               └───┘  sync   └───┘
       (lag OK)                         (no lag!)
"eventually consistent"            "always consistent"

Read Replica = ASYNC (eventual consistency), Multi-AZ = SYNC (always consistent)
Multi-AZ standby cannot serve reads — only for failover
Read scaling? → Read Replicas or ElastiCache
Multi-AZ: same connection string (DNS auto-failover)
Read Replicas: different endpoint (app must update connection string)
Watch for “NOT” questions — they flip the logic!

Single-AZ → Multi-AZ Migration (zero downtime):

┌─────────┐   SYNC replication   ┌─────────┐
│    M    │ ──────────────────→  │    S    │
└────┬────┘                      └─────────┘
     │                           Standby DB
     ↓ snapshot
┌─────────-┐
│    DB    |
| snapshot │ ← restore to new AZ
└─────────-┘

Click “modify” on DB (no downtime)
Snapshot taken automatically
New standby restored from snapshot in different AZ
SYNC replication established

Use Case: Reporting without impacting production

┌────────────────┐               ┌────────────────┐
│   Production   │               │    Reporting   │
│   Application  │               │   Application  │
└─────-┬─┬───────┘               └───────┬────────┘
       ↓ ↑                               ↑ reads
   writes/reads                          │
        │                                │
   ┌────┴────┐   ASYNC replication  ┌----┴────┐
   │    M    │ ───────────────────→ │    R    │
   └─────────┘                      └─────────┘
   RDS Master                      Read Replica

Read replicas for SELECT only (not INSERT/UPDATE/DELETE)
Production app unaffected by reporting workload

RDS Storage Auto Scaling:

Automatically increases storage when free space <10%
Must set Maximum Storage Threshold
Triggers after: low-storage 5+ min AND 6+ hours since last change
Supports all RDS engines

Why RDS over EC2-hosted DB?

RDS Manages For You	You Still Control
OS patching	Database schema
Automated backups (Point in Time Restore)	Application queries
Monitoring dashboards	Security groups
Hardware provisioning	Parameter groups
Read replicas & Multi-AZ setup
Storage scaling (EBS-backed)

⚠️ Exam trap: You can’t SSH into RDS instances (except RDS Custom for Oracle/SQL Server).

RDS Custom (Oracle & MS SQL Server only):

        ┌───────┐
        │ User  │
        └───┬───┘
   apply    │    SSH / SSM
   customs  │
            ▼
    ┌───────────────┐
    │ EC2 Instance  │
    ├───────────────┤
    │  Amazon RDS   │  Automation Mode: DISABLED
    └───────────────┘

RDS	RDS Custom
AWS manages OS + DB	Full admin access to OS + DB
No SSH	SSH / SSM Session Manager
No custom patches	Install patches, configure settings

⚠️ Disable Automation Mode before customizing. Take snapshot first!

⚠️ Exam trap: “Full customization of Oracle/SQL Server” + “benefit from AWS services” = RDS Custom

Amazon Aurora is AWS cloud optimized (5x faster than MySQL, 3x faster than PostgreSQL on RDS) an enterprise-class relational database, proprietary technology from AWS (not open source). Automatically growing storage. Costs more than RDS on 20%, but it’s more efficient, Amazon Aurora helps to reduce your database costs by reducing unnecessary input/output (I/O) operations, while ensuring that your database resources remain reliable and available.
Amazon Aurora replicates six copies of your data across three Availability Zones and continuously backs up your data to Amazon S3.

Feature	Details
Engines	PostgreSQL, MySQL (compatible drivers)
Performance	5x MySQL, 3x PostgreSQL on RDS
Storage	Auto-grows 10GB → 128TB
Replicas	Up to 15, <10ms replica lag
Failover	Instantaneous (HA native)
Cost	20% more than RDS, but more efficient
Durability	6 copies across 3 AZs, continuous backup to S3

⚠️ Exam trap: “OLTP” + “auto-scaling storage” + “maximum replicas” = Aurora

OLTP = relational DB (not DynamoDB/NoSQL)
Aurora storage auto-scales (RDS requires manual provisioning)
Both RDS and Aurora have 15 replicas, but Aurora’s storage is self-healing/auto-expanding

Aurora High Availability:

       AZ 1           AZ 2           AZ 3
    ┌───┐ ┌───┐    ┌───┐ ┌───┐    ┌───┐ ┌───┐
    │ M │ │ R │    │ R │ │ R │    │ R │ │ R │
    └─┬─┘ └─┬─┘    └─┬─┘ └─┬─┘    └─┬─┘ └─┬─┘
      ↓W    ↑R       ↑R    ↑R       ↑R    ↑R
    ══════════════════════════════════════════
         Shared Storage Volume (100s of volumes)
         Replication + Self Healing + Auto Expanding
    ══════════════════════════════════════════

6 copies across 3 AZs: 4/6 needed for writes, 3/6 for reads
Self-healing with peer-to-peer replication
Failover <30 seconds (automatic)
Up to 15 Read Replicas, Cross-Region supported

Aurora Quorum (failure tolerance):

Scenario	Writes	Reads
1 AZ down (2 copies lost)	✅ Works (4 remaining)	✅ Works
3 copies lost	✅ Works	✅ Works
4+ copies lost	❌ Write outage	❌ Read outage

Aurora DB Cluster Endpoints:

                    ┌──────────┐
                    │  Client  │
                    └────┬─────┘
           ┌─────────────┴──────--───────┐
           ▼                             ▼
┌─────────────────────┐    ┌────────────────────────────┐
│  Writer Endpoint    │    │     Reader Endpoint        │
│ (points to master)  │    │ (load balances to replicas)│
└──────────┬──────────┘    └─────────────┬──────────────┘
           │                    ┌────────┼────────┐
           ▼                    ▼        ▼        ▼
       ┌───────┐           ┌───────┐ ┌───────┐ ┌───────┐
       │   M   │←──────────│   R   │ │   R   │ │   R   │ ← Auto Scaling
       └───┬───┘           └───┬───┘ └───┬───┘ └───┬───┘
           ↓W                  ↑R        ↑R        ↑R
    ════════════════════════════════════════════════════
           Shared Storage (10GB → 128TB auto-expanding)
    ════════════════════════════════════════════════════

Writer Endpoint: Always points to master (auto-updates on failover)
Reader Endpoint: Load balances across all replicas
Replicas auto-scale based on demand

⚠️ Exam trap — “Separate reads from writes” in Aurora:

✅ Set up Aurora Read Replica + point app to Reader Endpoint (built-in, shared storage, no data copy)
❌ “Provision another Aurora database as read replica” — unnecessary separate DB, extra cost, not how Aurora works
❌ “Read from Multi-AZ standby” — standby CANNOT serve reads (failover only)
❌ “Activate read-through caching” — Aurora has no built-in read-through cache (use ElastiCache/DAX externally)

Aurora Features:

Automatic fail-over (<30s)
Automated patching with zero downtime
Backtrack: Restore to any point in time without using backups (rewind DB in-place)

Aurora Replicas Auto Scaling:

                         ┌──────────┐
                         │  Client  │
                         └────┬─────┘
              ┌───────────────┴───────────────┐
              ▼                               ▼ Many Requests
   ┌─────────────────────┐       ┌────────────────────────────┐
   │  Writer Endpoint    │       │     Reader Endpoint        │
   └──────────┬──────────┘       └─────────────┬──────────────┘
              │                       ┌────────┼────────┐
              ▼                       ▼        ▼        ▼
          ┌───────┐              ┌───────┐ ┌───────┐ ┌───────┐
          │   M   │              │   R   │ │   R   │ │   R   │ ← Added by
          └───┬───┘   CPU↑  CPU↑ └───┬───┘ └───┬───┘ └───┬───┘   Auto Scaling
              ↓W                     ↑R        ↑R        ↑R
    ════════════════════════════════════════════════════════════
           Shared Storage (10GB → 128TB auto-expanding)
    ════════════════════════════════════════════════════════════

Replicas scale based on CPU/connections metrics
Reader Endpoint auto-extends to new replicas

Aurora Custom Endpoints:

                            ┌──────────┐
                            │  Client  │
                            └────┬─────┘
         ┌───────────────--──────┼──────────────────────┐
         ▼                       ▼                      ▼
┌─────────────────┐     ┌─────────────────┐    ┌──────────────────┐
│ Writer Endpoint │     │ Reader Endpoint │    │ Custom Endpoint  │
└────────┬────────┘     └────────┬────────┘    │(Analytical Query)│
         │                       │             └────────┬─────────┘
         ▼                       ▼                      ▼
     ┌───────┐          ┌───────┐ ┌───────┐    ┌───────┐ ┌───────┐
     │   M   │          │   R   │ │   R   │    │   R   │ │   R   │
     └───┬───┘          └───────┘ └───────┘    └───────┘ └───────┘
         ↓W             db.r3.large (small)    db.r5.2xlarge (large)
    ════════════════════════════════════════════════════════════════
                      Shared Storage Volume
    ════════════════════════════════════════════════════════════════

Define subset of replicas for specific workloads (e.g., analytics on larger instances)
Reader Endpoint generally not used after defining Custom Endpoints

Aurora Serverless:

                    ┌──────────┐
                    │  Client  │
                    └────┬─────┘
                         │
           ┌─────────────────────────────┐
           │       Proxy Fleet           │
           │    (managed by Aurora)      │
           └──────────────┬──────────────┘
                ┌────┬────┼────┬────┐
                ▼    ▼    ▼    ▼    ▼
              ┌──┐ ┌──┐ ┌──┐ ┌──┐ ┌──┐  ← Auto-scales
              │DB│ │DB│ │DB│ │DB│ │DB│    based on load
              └──┘ └──┘ └──┘ └──┘ └──┘
    ════════════════════════════════════════════
              Shared Storage Volume
    ════════════════════════════════════════════

Auto-scales compute based on actual usage (no capacity planning)
Use case: Infrequent, intermittent, unpredictable workloads
Pay per second — can be more cost-effective

⚠️ Exam trap: “Dev/test environment” + “unused most of time” + “minimize costs” = Aurora Serverless

Aurora Global Database:

┌──────────────────────────────────────────┐
│        us-east-1 (PRIMARY REGION)        │
│  ┌─────────────┐       ┌─────────────┐   │
│  │ Application │       │   Aurora    │   │
│  │ Read/Write  │ ────→ │   Primary   │   │
│  └─────────────┘       └──────┬──────┘   │
└───────────────────────────────┼──────────┘
                                │ replication
                                │ (<1 second)
┌───────────────────────────────┼──────────┐
│        eu-west-1 (SECONDARY REGION)      │
│  ┌─────────────┐       ┌──────┴──────┐   │
│  │ Application │       │   Aurora    │   │
│  │  Read Only  │ ←──── │  Secondary  │   │
│  └─────────────┘       └─────────────┘   │
└──────────────────────────────────────────┘

Feature	Details
Primary Region	1 (read/write)
Secondary Regions	Up to 5 (read-only)
Replicas per Region	Up to 16
Replication Lag	<1 second
DR Promotion RTO	<1 minute

Use case: Global reads with low latency, disaster recovery

⚠️ Exam trap: “Cross-region Disaster Recovery” or “replica in another region” = Aurora Global Database

RDS Multi-AZ = same region only (AZ ≠ Region!)
Aurora Read Replicas = same region (within cluster)
RDS Read Replicas can be cross-region but not designed for easy Disaster Recovery.

Aurora Machine Learning:

              ┌─────────────┐
              │ Application │
              └──────┬──────┘
       SQL query     │     query results
  (recommendations?) │  (red shirt, blue...)
                     ▼
              ┌─────────────┐
              │   Aurora    │
              └──────┬──────┘
          data       │        predictions
   (user profile,    │     (red shirt,
    shopping...)     │      blue pants...)
            ┌────────┴────────┐
            ▼                 ▼
     ┌────────────┐    ┌─────────────┐
     │ SageMaker  │    │ Comprehend  │
     │ (any ML)   │    │ (sentiment) │
     └────────────┘    └─────────────┘

Add ML predictions to apps via SQL (no ML experience needed)
Use cases: fraud detection, ads targeting, sentiment analysis, product recommendations

Babelfish for Aurora PostgreSQL:

 ┌─────────────────┐            ┌─────────────────┐
 │   Application   │            │   Application   │
 │    SQL Server   │            │    PostgreSQL   │
 │  Client Driver  │            │      Driver     │
 └────────┬────────┘            └────────┬────────┘
          │ T-SQL                        │ PL/pgSQL
          │                              │
          │    ┌────────────────────┐    │
          │    │ Aurora PostgreSQL  │    │
          │    ├─────────┬──────────┤    │
          └───→│Babelfish│PostgreSQL│←───┘
               └─────────┴──────────┘
                         ↑
                      migrate
                         │
                 ┌───────────────┐
                 │     MS SQL    │
                 │     Server    │
                 └───────────────┘

Aurora PostgreSQL understands T-SQL (MS SQL Server commands)
Migrate from SQL Server with no/little code changes (same client driver)
Use with AWS SCT + DMS for migration

RDS & Aurora Backups:

	RDS	Aurora
Automated Backups	1-35 days (0 = disable retention)	1-35 days (cannot disable)
Transaction Logs	Every 5 min	Continuous
Point-in-Time Recovery	Up to 5 min ago	Within retention window
Manual Snapshots (On-Demand)	Unlimited retention	Unlimited retention

Automated: Daily full backup + transaction logs
Restore creates a NEW database (not in-place)

⚠️ Exam traps:

Stopped RDS still pays for storage. For long stops: snapshot → delete DB → restore DB from snapshot later.
“Long-term backup” + “audit/compliance” = Manual Snapshots (unlimited retention)
Automated backups = max 35 days only!

RDS & Aurora Restore Options:

Restoring backup/snapshot → creates NEW database
MySQL RDS from S3: On-prem backup → S3 → restore to new RDS MySQL
Aurora MySQL from S3: On-prem backup (Percona XtraBackup) → S3 → new Aurora cluster

Aurora Database Cloning:

Creation of a new Aurora cluster from existing one

CLONING (instant)                    SNAPSHOT/RESTORE (slow)
┌─────────────┐                      ┌─────────────┐
│  Production │                      │  Production │
└──────┬──────┘                      └──────┬──────┘
       │ shared storage                     │ snapshot
       ▼ (no copy!)                         ▼ (copy all!)
┌─────────────┐                      ┌─────────────┐
│    Clone    │                      │   New DB    │
└─────────────┘                      └─────────────┘
  Only new writes                      Full duplicate
  use extra storage                    storage cost

Copy-on-write: Initially shares same data volume (instant, no copying)
Storage allocated only when changes are made
Use case: Create staging/test from production without impacting prod
Faster & cheaper than snapshot/restore

⚠️ Exam trap: “Need production data ASAP” + “read/write tests” = Aurora Cloning (instant)

Snapshot/Restore = too slow (copies all data)
Read Replica = can’t do write tests

RDS & Aurora Security:

Security Layer	Details
At-rest encryption	AWS KMS, must enable at launch time
In-flight encryption	TLS by default
IAM Authentication	IAM roles instead of username/password
Security Groups	Control network access
Audit Logs	Send to CloudWatch Logs

⚠️ Exam traps:

Master not encrypted → replicas cannot be encrypted
To encrypt unencrypted DB: snapshot → restore as encrypted
“Many developers need DB access” → IAM Database Authentication (no individual DB users)
- Works for: MySQL, PostgreSQL, MariaDB (RDS & Aurora)
- NOT supported: Oracle, SQL Server, DB2
- IAM users do NOT have DB access by default!

⚠️ Exam trap - “End-to-end security for data-in-transit to RDS”:

✅ SSL/TLS = encrypts data between EC2 and RDS (the correct answer for in-transit)
❌ IAM authentication = solves who can connect, not encryption of data flowing
❌ NACL/SG blocking SSH = SSH is for server admin, not app ↔ DB traffic (port 5432/3306)
❌ KMS = encryption at rest, not in transit

Amazon RDS Proxy:

┌─────────────────────────────────────────────────┐
│                      VPC                        │
│  ┌───────────────────────────────────────────┐  │
│  │         Lambda functions                  │  │
│  │    λ    λ    λ    λ    λ    ...           │  │
│  └───────────────────┬───────────────────────┘  │
│                      │ IAM Authentication       │
│  ┌───────────────────┼───────────────────────┐  │
│  │           Private subnet                  │  │
│  │                   ▼                       │  │
│  │           ┌─────────────┐                 │  │
│  │           │  RDS Proxy  │ ← Connection    │  │
│  │           └──────┬──────┘   Pooling       │  │
│  │                  ▼                        │  │
│  │           ┌─────────────┐                 │  │
│  │           │ RDS / Aurora│                 │  │
│  │           └─────────────┘                 │  │
│  └───────────────────────────────────────────┘  │
└─────────────────────────────────────────────────┘

Feature	Details
Connection Pooling	Reduces DB stress (CPU, RAM, connections)
Failover	Reduces RDS/Aurora failover by 66%
Supports	RDS (MySQL, PostgreSQL, MariaDB, MS SQL), Aurora
Security	IAM Auth, credentials in Secrets Manager
Access	Never publicly accessible (VPC only)

No code changes required
Great for Lambda (many short-lived connections)

⚠️ Exam trap: “Many EC2s” + “slow reconnection after failover” = RDS Proxy

Reduces failover time by 66%
Connection pooling keeps apps connected

RDS & Aurora Lambda Integration:

Two Different Ways to Connect Lambda with RDS/Aurora:

Aspect	RDS Event Notifications	Invoke Lambda from RDS/Aurora
Setup	AWS Console (RDS settings)	Inside the database (SQL)
Access to DB Data	❌ No (metadata only)	✅ Yes (full data access)
Trigger Source	DB instance events	Data changes (triggers)
Use Case	DB state changes (failover, snapshot)	React to data (new row, update)
Engines	All RDS engines	Aurora MySQL, Aurora PostgreSQL

RDS Event Notifications:

Configured in AWS Console (RDS → Event Subscriptions)
Sends to SNS → can trigger Lambda
Events: DB instance state, snapshots, parameter groups, security groups
No access to actual data — only metadata/events
Example: “DB stopped”, “Snapshot created”, “Failover completed”

RDS Event Notifications Flow:
RDS Instance ──► RDS Event ──► SNS Topic ──► Lambda
(state change)   Subscription               (no DB data)

Invoke Lambda from RDS/Aurora:

Configured inside the database using SQL
Aurora only (MySQL/PostgreSQL stored procedures)
Lambda function has access to DB data passed as parameters
Use case: trigger external processing when data changes

Invoke Lambda from Aurora:
App ──► Aurora ──► Trigger/Stored Proc ──► Lambda ──► External Service
        (data)    (calls Lambda)           (has data)  (notifications, etc)

⚠️ Exam trap: “React to DB failover/snapshot events” → RDS Event Notifications (via SNS). ⚠️ Exam trap: “Process data when inserted/updated” → Invoke Lambda from Aurora (configured in DB).

Amazon ElastiCache managed Redis or Memcached in-memory databases with high performance and low latency.

Reduce load off databases for read-intensive workloads
Makes applications stateless (session storage)
AWS manages: patching, setup, monitoring, backups, failure recovery

⚠️ Exam trap: Using ElastiCache requires heavy application code changes

ElastiCache - DB Cache Pattern:

                      ┌───────────────────┐
                      │   ElastiCache     │
           Cache hit  │                   │
         ←────────────│   ┌───────────┐   │
         ─────────────│──→│   Cache   │   │
                      │   └───────────┘   │
┌─────────────┐       └─────────┬─────────┘
│ Application │                 │ Cache miss
└──────┬──────┘                 │
       │                        ▼
       │ Read from DB    ┌─────────────┐
       └────────────────→│  Amazon RDS │
       ←─────────────────└─────────────┘
       │
       └──→ Write to cache

App queries cache first → if miss, read from RDS → store in cache
Relieves load on RDS
Must implement cache invalidation strategy

ElastiCache - User Session Store:

        ┌──────┐
        │ User │
        └──┬───┘
           │
     ┌─────┴─────┬────────────┐
     ▼           ▼            ▼
┌─────────┐ ┌─────────┐  ┌─────────┐
│   App   │ │   App   │  │   App   │
└────┬────┘ └────┬────┘  └────┬────┘
     │           │            │
     │  Write    │  Retrieve  │
     │  session  │  session   │
     │           │            │
     └───────────┴────────────┘
                 │
                 ▼
          ┌─────────────┐
          │ ElastiCache │
          └─────────────┘

User logs in → app writes session to ElastiCache
User hits different app instance → retrieves session from cache
User stays logged in across all instances (stateless app)

⚠️ Exam trap: “Users keep logging out” + ALB + Auto Scaling = ElastiCache for sessions

NOT Sticky Sessions (uneven load)
NOT RDS (too slow for sessions)
NOT EBS (can’t share across instances)

ElastiCache - Redis vs Memcached:

Feature	Redis	Memcached
High Availability	Multi-AZ with Auto-Failover	❌ No HA
Read Replicas	✅ Yes (scale reads)	❌ No
Persistence	✅ AOF (durable)	❌ Non-persistent
Backup/Restore	✅ Yes	Serverless only
Data Structures	Sets, Sorted Sets	Simple key-value
Architecture	Replication	Sharding (multi-node)
Threading	Single-threaded	Multi-threaded

    REDIS (HA + Durability)          MEMCACHED (Sharding)
    ┌───┐  Replication  ┌───┐        ┌───┐    +    ┌───┐
    │ R │ ────────────→ │ R │        │ M │ shards  │ M │
    └───┘               └───┘        └───┘         └───┘

⚠️ Exam trap:

Need HA/persistence/backups → Redis
Need simple sharding/multi-threaded → Memcached

ElastiCache - Security:

┌───────────────────────┐
│  EC2 Security Group   │
│       ┌─────┐         │
│       │ EC2 │ Client  │
│       └──┬──┘         │
└──────────┼────────────┘
           │ SSL encryption
           │ Redis AUTH
           ▼
┌───────────────────────┐
│  Redis Security Group │
│        ┌─────┐        │
│        │Redis│        │
│        └─────┘        │
└───────────────────────┘

Engine	Authentication	Notes
Redis	IAM Authentication	For Redis only
Redis	Redis AUTH	Password/token at cluster creation
Redis	SSL/TLS	In-flight encryption
Memcached	SASL-based	Advanced auth

⚠️ Exam trap:

“Use IAM identities to access Redis” → IAM Authentication (Redis only!)
IAM policies on ElastiCache = AWS API-level only (create/delete clusters, not data access)
Security Groups = network access, not user identity

ElastiCache - Caching Patterns:

   LAZY LOADING                         WRITE THROUGH
   ┌─────────┐                          ┌─────────┐
   │   App   │                          │   App   │
   └────┬────┘                          └────┬────┘
        │ 1. Cache hit? ←───┐                │
        ▼                   │                │ 1. Write to DB
   ┌─────────┐         ┌────┴────┐      ┌────┴────┐
   │  Cache  │         │  Cache  │      │   RDS   │
   └────┬────┘         └─────────┘      └────┬────┘
        │ 2. Miss                            │ 2. Write to cache
        ▼                                    ▼ 
   ┌─────────┐                          ┌─────────┐
   │   RDS   │                          │  Cache  │
   └────┬────┘                          └─────────┘
        │ 3. Write to cache
        ▼
   ┌─────────┐
   │  Cache  │
   └─────────┘

Pattern	Description	Trade-off
Lazy Loading	Cache on read (miss → fetch → cache)	Data can become stale
Write Through	Cache on write (DB + cache updated together)	No stale data, more writes
Session Store	Store temp session data with TTL	Sessions auto-expire

ElastiCache - Redis Use Case (Gaming Leaderboards):

                ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐       ┌─────────────────────┐
                  ElastiCache Redis         │ Real-time           │
  ┌─────────┐   │                   │       │ Leaderboard         │
  │ Clients │──→   ┌─────┐ ┌─────┐   ──────→│ ┌─────────────┐     │
  └─────────┘   │  │Redis│ │Redis│  │       │ │ 1. Player A │     │
                   └─────┘ └─────┘          │ │ 2. Player B │     │
                │  ┌─────┐          │       │ │ 3. Player C │     │
                   │Redis│                  │ └─────────────┘     │
                │  └─────┘          │       └─────────────────────┘
                └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘

Redis Sorted Sets: Guarantee uniqueness + element ordering
New score → auto-ranked in real-time → inserted in correct order

⚠️ Exam trap: “Real-time leaderboard” computationally complex without Redis Sorted Sets (not Memcached — no sorted sets!)

DynamoDB fully managed highly available (with replication across 3 AZ), NoSQL (key/value) database that scales to massive workloads and single-digit millisecond latency.
DynamoDB Accelerator - DAX fully managed in-memory cache for DynamoDB (x10 performance improvement). Like ElastiCache, but only for DynamoDB.

Amazon Redshift is a fully managed OLAP (Online Analytical Processing) data warehouse for PB-scale analytics.

Based on PostgreSQL, but NOT for OLTP (transactional workloads)
10x better performance than other data warehouses
Columnar storage + parallel query engine
Two modes: Provisioned cluster or Serverless
SQL interface, integrates with QuickSight, Tableau

Redshift Cluster Architecture:

Query (JDBC/ODBC)
       │
       ▼
┌─────────────────────────────┐
│   Amazon Redshift Cluster   │
│  ┌───────────────────────┐  │
│  │     Leader Node       │  │  ← Query planning, results aggregation
│  └───────────┬───────────┘  │
│       ┌──────┼──────┐       │
│       ▼      ▼      ▼       │
│  ┌────────┐┌────────┐┌────────┐
│  │Compute ││Compute ││Compute │  ← Perform queries, send to leader
│  │ Node   ││ Node   ││ Node   │
│  └────────┘└────────┘└────────┘
└─────────────────────────────┘

Redshift Modes:

Mode	Description	Cost Model
Provisioned	Choose instance types upfront	Reserved instances for savings
Serverless	Auto-scales, no management	Pay per use

Loading Data into Redshift:

Method	Description	Best For
Kinesis Firehose	Stream → S3 → Redshift (COPY)	Real-time streaming
S3 COPY command	Bulk load from S3	Large batch imports
EC2 JDBC driver	Insert via application	Small batches (less efficient)

⚠️ Exam trap: “Load data into Redshift” → Large inserts are MUCH better. Use S3 COPY or Firehose, not row-by-row JDBC inserts.

Enhanced VPC Routing:

Forces COPY and UNLOAD traffic through your VPC
Without: S3 → Redshift traffic goes over public internet
With: S3 → Redshift traffic stays in VPC (more secure, use VPC features like security groups, NACLs, VPC endpoints)

⚠️ Exam trap: “COPY/UNLOAD through VPC” or “Redshift traffic stays in VPC” → Enhanced VPC Routing. “Improved VPC Routing” doesn’t exist!

Redshift Spectrum:

Query data directly in S3 without loading it
Requires Redshift cluster to start query (not serverless like Athena)
Query submitted to thousands of Spectrum nodes
Useful for querying cold/historical data without loading

Redshift Spectrum:
Query ──► Redshift Cluster ──► Spectrum Nodes (1000s) ──► S3 Bucket
          (Leader + Compute)      (query S3 directly)

Redshift vs Athena:

Aspect	Redshift	Athena
Type	Data warehouse	Query service
Infrastructure	Cluster (Provisioned/Serverless)	Fully serverless
Best for	Complex joins, aggregations, dashboards	Ad-hoc queries on S3
Performance	Faster (indexes, columnar)	Slower (full S3 scan)
Data location	Loaded into Redshift	Stays in S3
Cost model	Cluster time	$5/TB scanned

⚠️ Exam trap: “Faster joins/aggregations” or “BI dashboards on data warehouse” → Redshift. “Serverless ad-hoc S3 queries” → Athena.

Redshift Snapshots & DR:

Multi-AZ mode available for some clusters
Snapshots = point-in-time backups stored in S3 (incremental)
Restore creates a new cluster

Snapshot Type	Frequency	Retention
Automated	Every 8 hours or 5 GB	1-35 days
Manual	On-demand	Until you delete

Cross-Region DR:

Configure Redshift to automatically copy snapshots to another region
Restore snapshot in DR region → new cluster

Cross-Region Snapshot Copy:
Region A                          Region B
┌─────────────┐   Auto Copy   ┌──────────────┐
│  Redshift   │──────────────►│   Snapshot   │
│  Cluster    │               │   (copied)   │
└──────┬──────┘               └──────┬───────┘
       │ Snapshot                    │ Restore
       ▼                             ▼
┌─────────────┐               ┌──────────────┐
│  Snapshot   │               │  New Cluster │
│  (original) │               │  (DR region) │
└─────────────┘               └──────────────┘

⚠️ Exam trap: “Redshift cross-region DR” → Cross-region snapshot copy. Restore snapshot in target region.

⚠️ Exam trap: “Redshift Global cluster” → Doesn’t exist! Aurora has Global Database, Redshift uses cross-region snapshot copy instead.

⚠️ Exam trap: Redshift vs Athena vs EMR:

Redshift = data warehouse, complex analytics, BI dashboards
Athena = serverless ad-hoc queries on S3
EMR = big data processing (Hadoop, Spark), ML training

Amazon Elastic MapReduce (EMR) = managed Hadoop clusters for big data processing.

Clusters of hundreds of EC2 instances
Bundled tools: Apache Spark, HBase, Presto, Flink
Auto-scaling + Spot Instances integration
Use cases: data processing, ML, web indexing, big data analytics

EMR Node Types:

Node Type	Purpose	Lifecycle
Master Node	Manage cluster, coordinate, health	Long-running
Core Node	Run tasks + store data	Long-running
Task Node	Run tasks only (no storage)	Usually Spot

EMR Purchasing Options:

Option	Use Case
On-Demand	Reliable, won’t be terminated
Reserved	Cost savings (min 1 year), auto-used if available
Spot	Cheaper, can be terminated (for Task Nodes)

Cluster Types:

Long-running cluster = always on, for continuous workloads
Transient cluster = temporary, terminate after job completes (cost-effective)

⚠️ Exam trap: “Cost-optimize EMR” → Use Spot for Task Nodes (can lose them), Reserved/On-Demand for Master/Core (need reliability).

⚠️ Exam trap: EMR vs Athena vs Redshift:

EMR = big data processing (Hadoop/Spark jobs, ML training)
Athena = serverless ad-hoc SQL queries on S3
Redshift = data warehouse, complex analytics, BI dashboards

Amazon Athena serverless SQL query service to analyze data stored in Amazon S3.

Uses standard SQL (built on Presto)
Supports: CSV, JSON, ORC, Avro, Parquet
Pricing: $5.00 per TB scanned
Commonly paired with QuickSight for dashboards

Athena Use Cases:

Business intelligence / analytics / reporting
Analyze VPC Flow Logs, ELB Logs, CloudTrail trails
Ad-hoc queries on S3 data

Athena Architecture:
Users ──► S3 Bucket ──► Amazon Athena ──► Amazon QuickSight
          (data)       (Query & Analyze) (Reporting & Dashboards)

Athena Federated Query:

Query data across multiple sources (not just S3)
Uses Data Source Connectors (Lambda functions)
Sources: CloudWatch Logs, DynamoDB, RDS, Aurora, Redshift, ElastiCache, DocumentDB, on-premises DBs, HBase on EMR
Results stored back in S3

Federated Query:
                        ┌─► S3 Bucket
                        ├─► ElastiCache
                        ├─► DocumentDB
Amazon Athena ◄─────────┼─► DynamoDB        ◄── Lambda (Data Source Connector)
                        ├─► Redshift
                        ├─► Aurora/RDS
                        ├─► HBase in EMR
                        └─► On-Premises DB

Athena Performance Optimization:

Optimization	Why
Columnar format (Parquet/ORC)	Scan less data → lower cost
Glue ETL	Convert CSV/JSON to Parquet/ORC
Compress data	Smaller scans (gzip, snappy, lz4, zstd)
Partition datasets	Query specific partitions only
Large files (> 128 MB)	Minimize overhead

S3 Partitioning Example:

s3://bucket/table/year=1991/month=1/day=1/data.parquet
                   └── partition columns as virtual columns

⚠️ Exam trap: “Analyze data in S3 using serverless SQL” → Athena. Not Redshift (requires provisioning), not EMR (requires cluster).

⚠️ Exam trap: “Reduce Athena costs” → Parquet/ORC (columnar = scan less). Glue can convert formats.

⚠️ Exam trap: “Query multiple data sources with SQL” → Athena Federated Query (uses Lambda connectors).

Amazon QuickSight = serverless ML-powered BI service for interactive dashboards.

Fast, auto-scalable, embeddable, per-session pricing
SPICE engine = in-memory computation (when data imported into QuickSight)
Enterprise edition: Column-Level Security (CLS)

⚠️ Exam trap: “Column-level security” services:

QuickSight Enterprise = CLS for dashboard data
Lake Formation = CLS + Row-level for data lake (Athena, Redshift, EMR)
Redshift = column-level grants via SQL Know which service provides CLS for the given scenario!

QuickSight Use Cases:

Business analytics & visualizations
Ad-hoc analysis
Business insights from data

QuickSight Data Sources:

Source Type	Examples
AWS Services	RDS, Aurora, Redshift, Athena, S3, OpenSearch, Timestream
On-Premises	Databases via JDBC (Teradata)
SaaS	Salesforce, Jira
File Imports	XLSX, CSV, JSON, TSV, ELF/CLF (log formats)

QuickSight Integrations:
┌─────────────────────────────────────────────────────────┐
│                   Amazon QuickSight                     │
└────────────────────────┬────────────────────────────────┘
                         │
    ┌────────────────────┼────────────────────────────────┐
    │                    │                                │
    ▼                    ▼                                ▼
AWS Services      On-Premises/SaaS                  File Imports
RDS, Aurora,      Teradata (JDBC),                  XLSX, CSV,
Redshift, Athena, Salesforce, Jira                  JSON, TSV,
S3, OpenSearch,                                     Log files
Timestream

QuickSight Users & Sharing:

Users (Standard) and Groups (Enterprise) exist only within QuickSight (NOT IAM!)
Dashboard = read-only snapshot of an analysis
- Preserves: filtering, parameters, controls, sort
- Must publish before sharing
- Users who see dashboard can see underlying data

⚠️ Exam trap: “BI dashboards from multiple AWS sources” → QuickSight. Integrates with Athena, Redshift, RDS, S3, etc.

⚠️ Exam trap: QuickSight users/groups ≠ IAM. They are QuickSight-specific identities.

Amazon OpenSearch Service (successor to ElasticSearch) = managed search and analytics engine.

Search any field, even partial matches (unlike DynamoDB: primary key/indexes only)
Common pattern: complement to another database (DynamoDB for storage, OpenSearch for search)
Two modes: Managed cluster or Serverless
Does NOT natively support SQL (plugin available)
Ingestion: Kinesis Data Firehose, AWS IoT, CloudWatch Logs
Security: Cognito & IAM, KMS encryption, TLS
Visualization: OpenSearch Dashboards

OpenSearch Ingestion Patterns:

Source	Path	Latency
Kinesis Data Streams	→ Firehose → Lambda (transform) → OpenSearch	Near real-time
Kinesis Data Streams	→ Lambda → OpenSearch	Real-time
CloudWatch Logs	→ Subscription Filter → Lambda → OpenSearch	Real-time
CloudWatch Logs	→ Subscription Filter → Firehose → OpenSearch	Near real-time
DynamoDB	→ DynamoDB Streams → Lambda → OpenSearch	Real-time

DynamoDB + OpenSearch Pattern:

CRUD ──► DynamoDB ──► DynamoDB Stream ──► Lambda ──► OpenSearch
              │                                          │
              │                                          │
              └─── API to retrieve items ◄── App ──► API to search items ───┘

DynamoDB = store data (fast key-value access)
OpenSearch = search data (full-text, partial match, any field)
App uses both APIs: retrieve by key from DynamoDB, search from OpenSearch

⚠️ Exam trap: “Search any field” or “partial text match” or “full-text search” → OpenSearch. DynamoDB only queries by primary key or indexes.

⚠️ Exam trap: “Real-time” vs “Near real-time” ingestion:

Real-time = Lambda directly (Kinesis → Lambda → OpenSearch)
Near real-time = via Firehose (buffers data, slight delay)

DocumentDB is a document database (NoSQL) service that supports MongoDB workloads, proprietary fully managed and highly available across 3 AZ. Automatically grows and scales to workloads with millions of requests per second.

⚠️ Exam trap: DynamoDB vs DocumentDB — “MongoDB migration” doesn’t always mean DocumentDB!

Requirement	Answer
MongoDB compatibility + no code changes	DocumentDB
Serverless + Global Tables + no server management	DynamoDB

Key decision point:

“Migrate MongoDB” + “no code changes” + “same drivers” → DocumentDB (MongoDB-compatible API)
“Migrate MongoDB” + “serverless” + “global” → DynamoDB (different API, requires code changes)

DocumentDB requires provisioned instances (not truly serverless), but preserves MongoDB compatibility.

Amazon Neptune is a fully managed graph database. Usually for graph data sets like social network, knowledge graphs (Wikipedia), recommendation engines and fraud detection. Highly available across 3 AZ, with up to 15 replicas.

⚠️ Exam trap: Graph queries = Neptune. Classic example:

“Friends of Mike who liked posts by friends of Mike” → multi-hop relationship traversal
RDS/DynamoDB would require complex JOINs or multiple queries
Neptune handles this natively with graph traversals (Gremlin/SPARQL)

Neptune Use Cases: Social networks, recommendation engines, fraud detection, knowledge graphs

Amazon Timestream fully managed, fast, scalable, serverless time series database. Built-in time series analytics functions (helps you identify patterns in your data in near real-time).

Timestream Use Cases: IoT sensors (temperature, humidity, pressure), application metrics, DevOps monitoring, industrial telemetry

⚠️ Exam trap: “Thousands of sensors” + “readings per second” + “fast analytics” = Timestream

NOT S3 (storage only, no built-in time-series analytics)
NOT DynamoDB (not optimized for time-series queries)
Timestream = 1000x faster, 1/10th cost vs relational DBs for time-series

Amazon Keyspaces (for Apache Cassandra) is a fully managed, serverless, Cassandra-compatible database. Highly available and scalable with no servers to manage.

⚠️ Exam trap: Cassandra migration → Keyspaces (not DynamoDB!)

Keyspaces = Cassandra Query Language (CQL) compatible → no code changes
DynamoDB = different API → requires code rewrite

Amazon QLDB (Quantum Ledger Database) is a fully managed, serverless, highly available book recording financial transactions. (Unlike Amazon Managed Blockchain there is no decentralization component).

Amazon Managed Blockchain is managed blockchain service to join public blockchain networks or create your own scalable private network, without the need for a trusted, central authority. Compatible with Hyperledger Fabric and Ethereum.

AWS Glue = fully serverless managed ETL (Extract, Transform, Load) service.

Prepare and transform data for analytics
Used by: Athena, Redshift, EMR

Glue Components:

Component	Purpose
Glue Data Crawler	Scans data sources, writes metadata to Data Catalog
Glue Data Catalog	Central metadata repository (databases, tables)
Glue ETL Jobs	Transform and load data
Glue Job Bookmarks	Prevent re-processing old data
Glue DataBrew	Clean/normalize data with pre-built transformations
Glue Studio	GUI to create, run, monitor ETL jobs
Glue Streaming ETL	Real-time ETL (Spark Streaming) for Kinesis, Kafka, MSK

Glue Data Catalog Architecture:

Data Sources                     Glue Data Catalog              Consumers
┌─────────────┐                 ┌─────────────────┐
│ Amazon S3   │                 │   Databases     │            ┌─────────────┐
│ Amazon RDS  │──► Glue ───────►│   Tables        │──────────►│ Athena      │
│ DynamoDB    │   Crawler       │   (Metadata)    │            │ Redshift    │
│ JDBC        │  (writes        └─────────────────┘            │ EMR         │
└─────────────┘   metadata)            ▲                       └─────────────┘
                                       │
                                Glue ETL Jobs

Glue ETL Pattern — Convert to Parquet:

S3 Put ──► Input S3 ──► Glue ETL ──► Output S3 ──► Athena
           (CSV)        (transform)   (Parquet)    (analyze)
              │
              ▼
        S3 Event ──► Lambda ──► Trigger Glue Job
                (or EventBridge)

Common Glue Use Cases:

Convert CSV → Parquet for Athena cost savings
ETL to Redshift from S3/RDS sources
Data Catalog for unified metadata across analytics tools

⚠️ Exam trap: “Convert CSV to Parquet for Athena” → Glue ETL. Glue can be triggered by S3 events via Lambda or EventBridge.

⚠️ Exam trap: “Centralized metadata catalog” or “data discovery” → Glue Data Catalog. Used by Athena, Redshift Spectrum, EMR.

⚠️ Exam trap: “Streaming ETL” → Glue Streaming ETL (Spark Streaming). Compatible with Kinesis, Kafka, MSK.

⚠️ Exam trap: “Prevent re-processing old data” or “incremental ETL” → Glue Job Bookmarks. Tracks what’s already processed, only processes new data.

AWS Lake Formation = fully managed service to set up a data lake in days.

Data lake = central place for ALL data (structured + unstructured) for analytics
Built on top of AWS Glue
Automates: collecting, cleansing, moving, cataloging, de-duplicating (ML Transforms)

Lake Formation Features:

Feature	Description
Source Blueprints	Pre-built connectors for S3, RDS, Aurora, on-premises DBs
ETL and Data Prep	Transform and prepare data
Data Catalog	Central metadata repository
Fine-grained Access Control	Row-level and Column-level security
Security Settings	Centralized permissions management

Lake Formation Architecture:

Data Sources                    Lake Formation              Consumers
┌─────────────┐              ┌────────────────────┐
│ Amazon S3   │              │ • Source Crawlers  │       ┌─────────────┐
│ RDS/Aurora  │──► ingest ──►│ • ETL & Data Prep  │──────►│ Athena      │
│ On-Premises │              │ • Data Catalog     │       │ Redshift    │
│ (SQL/NoSQL) │              │ • Access Control   │       │ EMR/Spark   │
└─────────────┘              │   (row/column)     │       └─────────────┘
                             └─────────┬──────────┘              │
                                       │                         ▼
                               ┌───────▼───────┐              Users
                               │   Data Lake   │
                               │ (stored in S3)│
                               └───────────────┘

Lake Formation vs Glue:

Aspect	Glue	Lake Formation
Focus	ETL + Data Catalog	Complete data lake management
Security	Basic IAM	Fine-grained (row/column-level)
Scope	ETL jobs	End-to-end data lake
Built on	-	AWS Glue

⚠️ Exam trap: “Data lake” + “fine-grained access control” or “row/column-level security” → Lake Formation. Not just Glue (Glue = ETL only, no fine-grained permissions).

⚠️ Exam trap: “Centralized permissions for data lake” → Lake Formation. Manages access across Athena, Redshift, EMR in one place.

Amazon MSK (Managed Streaming for Apache Kafka) = fully managed Apache Kafka on AWS.

Alternative to Kinesis Data Streams for streaming
Creates & manages Kafka brokers + Zookeeper nodes
Deploy in VPC, Multi-AZ (up to 3 for HA)
Data stored on EBS volumes (as long as you want)
Automatic recovery from Kafka failures
MSK Serverless = auto-provisions resources, scales compute & storage

MSK Architecture:

Producers               MSK Cluster                    Consumers
(Kinesis, IoT,    ┌─────────────────────────┐
 RDS, etc.)       │     ┌──────────┐        │    ┌──────────────────┐
       │          │     │ Broker 1 │◄──┐    │    │ Kinesis Data     │
       ▼          │     └──────────┘   │    │    │ Analytics (Flink)│
┌──────────┐      │          │    replication  ──►│ Glue Streaming   │
│ Your     │──────┼──►┌──────────┐   │    │    │ Lambda           │
│ Code     │      │   │ Broker 2 │◄──┤    │    │ EC2/ECS/EKS      │
└──────────┘      │   └──────────┘   │    │    └──────────────────┘
                  │          │       │    │
                  │     ┌──────────┐ │    │
                  │     │ Broker 3 │◄┘    │
                  │     └──────────┘      │
                  └─────────────────────────┘

Kinesis Data Streams vs Amazon MSK:

Aspect	Kinesis Data Streams	Amazon MSK
Message size	1 MB limit	1 MB default, configurable to 10 MB
Data structure	Shards	Kafka Topics with Partitions
Scaling	Shard splitting & merging	Can only add partitions
In-flight encryption	TLS only	PLAINTEXT or TLS
At-rest encryption	KMS	KMS
Retention	1-365 days	Unlimited (EBS)

MSK Consumers:

Kinesis Data Analytics for Apache Flink (now: Amazon Managed Service for Apache Flink)
AWS Glue Streaming ETL (Spark Streaming)
Lambda
Applications on EC2, ECS, EKS

Amazon Managed Service for Apache Flink (previously: Kinesis Data Analytics for Apache Flink)

Flink = framework for processing data streams (Java, Scala, SQL)
Reads from: Kinesis Data Streams or Amazon MSK
⚠️ Does NOT read from Firehose!
Managed cluster, auto-scaling, parallel computation
Backups via checkpoints and snapshots

Flink Sources:
Kinesis Data Streams ──┐
                       ├──► Amazon Managed Service ──► (destinations)
Amazon MSK ────────────┘    for Apache Flink

⚠️ Exam trap: “Kafka on AWS” or “migrate Kafka” → Amazon MSK. Kinesis is AWS-native, MSK is Kafka-compatible.

⚠️ Exam trap: “Message > 1 MB” streaming → MSK (configurable up to 10 MB). Kinesis = hard 1 MB limit.

⚠️ Exam trap: “Apache Flink” or “real-time stream analytics” → Amazon Managed Service for Apache Flink. Note: Flink does NOT read from Firehose!

⚠️ Exam trap: Kinesis vs MSK decision:

Kinesis = AWS-native, simpler, auto-scaling shards
MSK = Kafka-compatible, existing Kafka apps, larger messages, unlimited retention

Big Data Ingestion Pipeline (Serverless)

Requirements: Real-time collection → Transform → SQL query → Reports in S3 → Warehouse + Dashboards

IoT Devices
    │
    ▼ (real-time)
┌─────────────────┐     Every 1 min    ┌─────────────┐
│ Kinesis Data    │───────────────────►│ Ingestion   │
│ Streams         │                    │ Bucket (S3) │
└─────────────────┘                    └──────┬──────┘
         │                                    │
    ┌────┴────┐                          (optional)
    ▼         │                               │
┌─────────┐   │                          ┌────▼────┐
│ Kinesis │   │                          │   SQS   │
│ Firehose│◄──┘                          └────┬────┘
└────┬────┘                                   │
     │                                   ┌────▼────┐    Pull data
     │ Lambda                            │ Lambda  │◄──────────┐
     │ (transform)                       └────┬────┘           │
     ▼                                        │                │
                                         ┌────▼────┐     ┌─────┴─────┐
                                         │ Athena  │────►│ Reporting │
                                         │ (SQL)   │     │ Bucket    │
                                         └─────────┘     └─────┬─────┘
                                                               │
                                              ┌────────────────┼────────────────┐
                                              ▼                ▼                ▼
                                        QuickSight      Redshift          (other BI)
                                        (dashboards)    Serverless

Pipeline Components:

Stage	Service	Why
Ingest real-time	Kinesis Data Streams	Real-time data collection
Buffer + Deliver	Kinesis Firehose	Near real-time delivery to S3 (1 min)
Transform	Lambda + Firehose	Data transformations during delivery
Store	S3 (Ingestion Bucket)	Durable storage, triggers events
Decouple	SQS (optional)	Buffer between S3 and processing
Query	Athena	Serverless SQL on S3
Output	S3 (Reporting Bucket)	Query results storage
Visualize	QuickSight / Redshift	Dashboards and analytics

Key Points:

Fully serverless = no servers to manage
Firehose = near real-time (1 minute buffer), NOT real-time
S3 → SQS → Lambda or S3 → Lambda directly (both work)
Athena results stored in S3 automatically
Reporting bucket feeds QuickSight or Redshift for dashboards

⚠️ Exam trap: “Serverless” + “real-time ingestion” + “SQL query” + “dashboards” → This full pipeline. Know each component’s role!

Database Selection Guide:

Need	Use
SQL, ACID, complex queries	RDS / Aurora
Key-value, massive scale, single-digit ms	DynamoDB
Key-value, large objects (100MB+ files)	S3
Caching, sessions, leaderboards	ElastiCache (Redis/Memcached)
Data warehouse, analytics (PB scale)	Redshift
Graph relationships (social, fraud)	Neptune
Time series (IoT, metrics)	Timestream
Document store (MongoDB compatible)	DocumentDB
Immutable ledger (financial)	QLDB
ETL / Data catalog	Glue

⚠️ Exam traps:

S3 is a key-value store! Key = object path, Value = object content. For large files (100MB+), use S3 not DynamoDB (400KB item limit)
Aurora Backtrack is Aurora-only (not RDS), works in-place (no new DB created)
DAX vs ElastiCache: DAX = DynamoDB only, no code changes; ElastiCache = any DB, requires code changes

⚠️ Exam trap - “In-memory + caching SQL queries + HIPAA”:

✅ ElastiCache = caches results from any DB including SQL (RDS/Aurora), HIPAA eligible
❌ DAX = DynamoDB only → “SQL queries” keyword instantly eliminates DAX (DynamoDB is NoSQL)
❌ DynamoDB / DocumentDB = not in-memory caches

🎯 MASTER SUMMARY: Database Exam Guide

Part 1: Core Principles (Understand WHY → Derive WHAT)

Principle 1: RDBMS vs NoSQL = Structure vs Flexibility

RDBMS (RDS/Aurora): Structured data, complex joins, ACID transactions, fixed schema NoSQL (DynamoDB): Flexible schema, massive scale, key-value access, millisecond latency

Rule: Need JOINs or transactions? → RDS/Aurora. Need scale + flexibility? → DynamoDB.

Principle 2: Read Replica = ASYNC, Multi-AZ = SYNC

This is THE most tested concept:

Read Replica: ASYNC replication → eventual consistency → replication lag is EXPECTED
Multi-AZ: SYNC replication → always consistent → standby CANNOT serve reads

Key insight: Multi-AZ standby is for failover ONLY. It cannot be read from.

Principle 3: Aurora = RDS with Superpowers

Aurora is AWS’s cloud-optimized relational DB. Same concept as RDS, but:

5x faster than MySQL, 3x faster than PostgreSQL
6 copies across 3 AZs, auto-healing storage
Aurora-exclusive features: Backtrack, Cloning, Serverless, Global Database

If question mentions RDS + wants better performance/features → think Aurora.

Principle 4: Caching = Code Changes Required (Except DAX)

Cache	Code Changes?	Works With
ElastiCache	✅ Required	Any application
DAX	❌ Not required	DynamoDB only

DAX uses the same DynamoDB API. ElastiCache requires application modifications.

Principle 5: Encryption = Launch Time Decision

You cannot encrypt an existing unencrypted database directly. Solution: Snapshot → Restore as encrypted → Switch applications

Same applies: Master not encrypted → Replicas CANNOT be encrypted.

Principle 6: Cross-Region = Different Services, Different Behaviors

Service	Cross-Region Feature	Behavior
RDS	Read Replica (cross-region)	Manual promotion, costs $$$
Aurora	Global Database	<1s replication, <1min failover
DynamoDB	Global Tables	Active-active (writes anywhere!)

Key insight: Only DynamoDB Global Tables allows writes in multiple regions.

Principle 7: Restore = NEW Database

Restoring from backup/snapshot ALWAYS creates a new database instance:

RDS snapshot restore → new RDS instance
Aurora restore → new Aurora cluster
DynamoDB PITR → new DynamoDB table

Exception: Aurora Backtrack → in-place rewind (no new DB created).

Principle 8: Right Tool for the Data Type

Match the data type to the purpose-built database:

Graph data (relationships, social) → Neptune
Time series (IoT, metrics) → Timestream
Immutable ledger (financial, audit) → QLDB
Document/JSON (MongoDB workload) → DocumentDB
Wide-column (Cassandra workload) → Keyspaces

Part 2: Decision Tree (Follow Keywords → Find Answer)

Step 1: What type of data/workload?

                        What's the requirement?
                              │
    ┌─────────────┬───────────┼───────────┬─────────────┬────────────┐
    ▼             ▼           ▼           ▼             ▼            ▼
 SQL + Joins   Key-Value   Caching    Analytics    Specialized   Big Objects
    │             │           │           │             │            │
    ▼             ▼           ▼           ▼             ▼            ▼
 RDS/Aurora   DynamoDB   ElastiCache  Redshift     See Step 3      S3
                           /DAX       /Athena

Step 2: Which RDS/Aurora variant?

                    Need relational database?
                              │
        ┌─────────────────────┼─────────────────────┐
        ▼                     ▼                     ▼
   Standard needs      Cloud-optimized        Full OS access?
        │                     │                     │
        ▼                     ▼                     ▼
      RDS               Aurora                RDS Custom
        │                     │             (Oracle/SQL only)
        │                     │
        ▼                     ▼
   Cross-region?        Unpredictable
        │               workload?
        ▼                     │
   Read Replica              ▼
   (manual failover)   Aurora Serverless

Step 3: Specialized Database Selection

If the data is…	Use…
Graph relationships (social, fraud)	Neptune
Time series (IoT, metrics, logs)	Timestream
Immutable ledger (financial, compliance)	QLDB
MongoDB-compatible JSON documents	DocumentDB
Cassandra-compatible wide-column	Keyspaces
Free-text search	OpenSearch
Blockchain (decentralized)	Managed Blockchain

Feature-Based Decision Table

If question mentions…	Answer is…
“Users don’t see updated data” + Read Replica	Expected behavior (ASYNC lag)
“Analytics slowing production”	Offload to Read Replica
“Cross-region disaster recovery” + Aurora	Aurora Global Database
“Dev/test” + “unused most of time”	Aurora Serverless
“Production data ASAP” + “read/write tests”	Aurora Cloning
“Full OS customization” + Oracle/SQL Server	RDS Custom
“Lambda” + “DB connections” + “failover”	RDS Proxy
“Users keep logging out” + Auto Scaling	ElastiCache (sessions)
“Real-time leaderboard” + “ranked”	Redis Sorted Sets
“DynamoDB” + “microsecond reads”	DAX
“Multi-region active-active writes”	DynamoDB Global Tables
“Social network” + “relationships”	Neptune
“IoT” + “time-series”	Timestream
“Immutable” + “financial audit”	QLDB

Kinesis Family Decision Tree

What do you need to do with streaming data?
                    │
    ┌───────────────┼───────────────┬───────────────────┐
    ▼               ▼               ▼                   ▼
 INGEST         DELIVER          ANALYZE            KAFKA
 (collect)      (to S3/etc)      (real-time)        (compatible)
    │               │               │                   │
    ▼               ▼               ▼                   ▼
 Kinesis        Kinesis         Kinesis Data        Amazon
 Data Streams   Firehose        Analytics/Flink     MSK

Service	Purpose	Key Feature
Kinesis Data Streams	Ingest real-time data	Custom consumers, 1-365 day retention
Kinesis Firehose	Deliver to destinations	Near real-time (1 min buffer), auto-scaling
Kinesis Data Analytics	Real-time analytics	Apache Flink, SQL on streams
Amazon MSK	Managed Kafka	Kafka-compatible, 10 MB messages, unlimited retention

The “CANNOT” List

Cannot…	Instead…
Read from Multi-AZ standby	Use Read Replica for read scaling
Write to Read Replica	Promote it first (breaks replication)
Encrypt existing DB directly	Snapshot → Restore as encrypted
Use IAM Auth with Oracle/SQL Server	Only MySQL, PostgreSQL, MariaDB
Use Backtrack on RDS	Aurora-only feature
Use DAX with non-DynamoDB	Use ElastiCache instead
Use ElastiCache without code changes	Use DAX for DynamoDB (same API)
Cross-region failover with Multi-AZ	Multi-AZ = same region only
Use “Redshift Global cluster”	Doesn’t exist! Use cross-region snapshot copy
Read from Firehose with Flink	Flink reads from Streams or MSK only

Part 3: Scenario Pattern Recognition

Pattern: “OLTP with auto-scaling storage”

Keywords: OLTP, auto-scaling, maximum replicas, transactional

Answer: Aurora

Why: OLTP = relational (not NoSQL). Aurora has auto-scaling storage (10GB→128TB) + 15 replicas. RDS storage requires manual provisioning.

Pattern: “Analytics queries slowing down production”

Keywords: reporting, analytics, BI tools, production performance

Answer: Create Read Replica for analytics workload

Why: Read Replicas are ASYNC, so heavy queries won’t affect the master.

Pattern: “Users don’t see updated data immediately”

Keywords: stale data, eventually consistent, lag, Read Replica

Answer: This is expected behavior (ASYNC replication)

Why: Read Replica uses ASYNC replication. If strong consistency needed → read from master.

Pattern: “Cross-region disaster recovery with fast failover”

Keywords: cross-region, DR, RTO <1 minute, Aurora

Answer: Aurora Global Database

Why: <1 second replication, <1 minute RTO. RDS cross-region Read Replica = manual promotion.

Pattern: “Dev/test environment, unused most of the time”

Keywords: development, testing, intermittent, unpredictable, minimize costs

Answer: Aurora Serverless

Why: Scales to zero, pay per second. Provisioned = pay even when idle.

Pattern: “Need production data immediately for testing”

Keywords: clone production, read/write tests, staging environment, fast copy

Answer: Aurora Cloning (instant copy-on-write)

Why: Snapshot/restore = slow (copies all data). Read Replica = read-only. Cloning = instant + writable.

Pattern: “Full OS access for Oracle/SQL Server”

Keywords: customize OS, install patches, SSH access, Oracle/SQL Server

Answer: RDS Custom

Why: Standard RDS = no SSH. EC2 = no AWS management. RDS Custom = both.

Pattern: “Key-value store for large files”

Keywords: key-value, large files, 100MB, store files, durable storage

Answer: S3 (NOT DynamoDB!)

Why: S3 IS a key-value store (key = path, value = object). DynamoDB has 400KB item limit. For files 100MB+ → S3.

Pattern: “Lambda functions + database connections + slow failover”

Keywords: Lambda, connection pooling, many connections, failover time

Answer: RDS Proxy

Why: Connection pooling reduces DB load. 66% faster failover. Works great with Lambda.

Pattern: “Users keep getting logged out across instances”

Keywords: sessions, logged out, ALB, Auto Scaling, stateless

Answer: ElastiCache (session store) or DynamoDB with TTL

Why: Sessions stored in shared cache → any instance can retrieve. NOT sticky sessions (uneven load).

Pattern: “Real-time gaming leaderboard with rankings”

Keywords: leaderboard, ranking, sorted scores, real-time

Answer: Redis Sorted Sets

Why: Redis Sorted Sets guarantee uniqueness + ordering. Memcached has no sorted sets.

Pattern: “DynamoDB with microsecond read latency”

Keywords: DynamoDB, faster reads, microsecond, cache

Answer: DAX (DynamoDB Accelerator)

Why: 10x faster reads, no code changes (same API). ElastiCache = different API.

Pattern: “Multi-region active-active with writes anywhere”

Keywords: active-active, write to any region, global users

Answer: DynamoDB Global Tables

Why: Aurora Global = read-only replicas. Only DynamoDB Global Tables = writes in any region.

Pattern: “MongoDB migration + no code changes”

Keywords: MongoDB, migrate, no code changes, same drivers, existing application

Answer: DocumentDB

Why: DocumentDB is MongoDB-compatible (same API/drivers). Application code works unchanged. Note: NOT RDS — there’s no “RDS for MongoDB”!

Pattern: “MongoDB migration + serverless + global”

Keywords: MongoDB, NoSQL, serverless, global, no server management

Answer: DynamoDB (NOT DocumentDB!)

Why: DocumentDB requires provisioned instances (not serverless). DynamoDB = truly serverless + Global Tables. “MongoDB” in question is a distractor — focus on requirements.

Keywords: friends of friends, social graph, relationships, connections, likes, multi-hop queries

Answer: Neptune (Graph database)

Why: Graph databases are optimized for relationship traversals. Example: “likes on posts by friends of Mike” = multi-hop graph query. RDS would need complex JOINs; DynamoDB can’t do JOINs at all.

Pattern: “Cassandra migration to AWS”

Keywords: Cassandra, migrate, CQL, wide-column, no code changes

Answer: Amazon Keyspaces

Why: Keyspaces is Cassandra-compatible (CQL). Existing Cassandra code works unchanged. Fully managed, serverless, highly available.

Pattern: “IoT sensors with readings over time”

Keywords: IoT, sensors, time-series, metrics, trends, readings per second, temperature, humidity, pressure, fast analytics, predict

Answer: Timestream

Why: Purpose-built for time-series data. 1000x faster + 1/10th cost vs relational. Built-in analytics functions for pattern detection.

Pattern: “Financial transactions with immutable audit trail”

Keywords: immutable, ledger, financial, compliance, audit, cannot modify

Answer: QLDB

Why: Cryptographically verifiable history. Note: QLDB ≠ blockchain (centralized, no decentralization).

Pattern: “Encrypt existing unencrypted database”

Keywords: encrypt, existing database, unencrypted, enable encryption

Answer: Snapshot → Restore as encrypted

Why: Cannot enable encryption on existing DB. Must create new encrypted DB from snapshot.

Pattern: “Full-text search on DynamoDB data”

Keywords: search any field, partial match, full-text search, DynamoDB + search

Answer: DynamoDB + OpenSearch

Why: DynamoDB only queries by primary key/indexes. OpenSearch enables full-text search. Use DynamoDB Streams → Lambda → OpenSearch to sync data.

Pattern: “Real-time log analytics and search”

Keywords: logs, search, CloudWatch, real-time, analytics, dashboards

Answer: CloudWatch Logs → OpenSearch (via Lambda or Firehose)

Why: OpenSearch provides search + OpenSearch Dashboards for visualization. Lambda = real-time, Firehose = near real-time.

Pattern: “Serverless SQL on logs in S3”

Keywords: logs in S3, serverless, quick analysis, SQL, ad-hoc

Answer: Amazon Athena

Why: Athena = serverless SQL directly on S3. No infrastructure to manage. Pay $5/TB scanned.

Pattern: “Columnar analytics + BI dashboards”

Keywords: data warehouse, OLAP, columnar, analytics, QuickSight, Tableau, dashboards

Answer: Amazon Redshift + QuickSight

Why: Redshift = OLAP data warehouse (columnar storage). QuickSight = native BI integration for dashboards.

Pattern: “Convert file format for Athena”

Keywords: convert JSON/CSV to Parquet, optimize Athena, reduce costs

Answer: AWS Glue ETL

Why: Glue transforms data formats. Parquet = columnar = Athena scans less = cheaper.

Pattern: “Real-time analytics on streaming data”

Keywords: real-time analytics, stream processing, Kinesis, SQL on streams

Answer: Kinesis Data Analytics (Amazon Managed Service for Apache Flink)

Why: Flink processes streams in real-time. Reads from Kinesis Data Streams or MSK. NOT Firehose!

Pattern: “Migrate Kafka with no code changes”

Keywords: Apache Kafka, migrate, Kafka-compatible, existing application

Answer: Amazon MSK

Why: MSK = managed Kafka. Same APIs, no code changes. Kinesis requires code changes.

Pattern: “Redshift cross-region disaster recovery”

Keywords: Redshift, cross-region, DR, disaster recovery

Answer: Cross-region snapshot copy

Why: Enable automated snapshots + configure cross-region copy. Restore in DR region. “Redshift Global” doesn’t exist!

Part 4: Quick Reference Tables

RDS vs Aurora Comparison

Feature	RDS	Aurora
Engines	PostgreSQL, MySQL, MariaDB, Oracle, SQL Server, DB2	PostgreSQL, MySQL
Performance	Standard	5x MySQL, 3x PostgreSQL
Storage	EBS-backed, auto-scaling	6 copies across 3 AZ, 128TB max
Read Replicas	Up to 15	Up to 15, <10ms lag
Failover	Slower	<30 seconds
Backtrack	❌ No	✅ Yes (in-place rewind)
Serverless	❌ No	✅ Yes
Global Database	Read Replica only	<1s replication, <1min RTO
Cloning	Snapshot only	Instant copy-on-write

Replication & HA Quick Reference

Feature	Read Replica	Multi-AZ	Aurora Global
Purpose	Read scaling	HA/Failover	Cross-region DR
Replication	ASYNC	SYNC	ASYNC (<1 sec)
Serve reads?	✅ Yes	❌ No	✅ Yes
Auto failover?	❌ Manual	✅ Auto	✅ Auto (<1 min)
Cross-region?	✅ Yes	❌ No	✅ Yes

Caching Options

Feature	ElastiCache Redis	ElastiCache Memcached	DAX
Works with	Any app	Any app	DynamoDB only
Code changes	✅ Required	✅ Required	❌ Not required
HA	Multi-AZ + failover	❌ No	Multi-AZ
Persistence	✅ AOF	❌ No	N/A
Sorted Sets	✅ Yes	❌ No	N/A

NoSQL Database Selection

Service	Data Model	Compatible With	Use Case
DynamoDB	Key-value	-	Serverless, sessions
DocumentDB	Document	MongoDB	MongoDB workloads
Keyspaces	Wide-column	Cassandra	Cassandra workloads
Neptune	Graph	Gremlin, SPARQL	Social, fraud
Timestream	Time series	SQL	IoT, metrics
QLDB	Ledger	SQL	Immutable audit

Key Numbers to Remember

Item	Value
Read Replicas max	15
Aurora storage max	128 TB
Aurora copies	6 across 3 AZ
Aurora failover	<30 seconds
Aurora Global replication	<1 second
Aurora Global RTO	<1 minute
DynamoDB item size limit	400 KB
S3 object size max	5 TB
Automated backup retention	1-35 days
Manual snapshot retention	Unlimited
RDS Proxy failover improvement	66% faster

Part 5: Ultimate Instant-Answer Table

Question Contains	→ Instant Answer
SQL + Joins + Transactions	RDS / Aurora
“OLTP” + “auto-scaling storage”	Aurora
“5x MySQL performance”	Aurora
“cross-region” + “RTO <1 min”	Aurora Global
“intermittent workload”	Aurora Serverless
“in-place rewind”	Aurora Backtrack
“clone production instantly”	Aurora Cloning
“OS access” + Oracle/SQL Server	RDS Custom
“analytics slowing production”	Read Replica
“can’t read from standby”	Multi-AZ (expected)
“Lambda + connections”	RDS Proxy
“66% faster failover”	RDS Proxy
“key-value” + “large files” (MB+)	S3 (not DynamoDB!)
“serverless NoSQL”	DynamoDB
“serverless” + “global” + NoSQL	DynamoDB (not DocumentDB)
“microsecond DynamoDB reads”	DAX
“active-active writes”	DynamoDB Global Tables
“sessions across instances”	ElastiCache / DynamoDB TTL
“leaderboard + rankings”	Redis Sorted Sets
“HA + persistence cache”	Redis
“MongoDB” + “no code changes”	DocumentDB
“MongoDB compatible” (same drivers)	DocumentDB
“MongoDB” + “serverless” + “global”	DynamoDB (trap!)
“RDS for MongoDB”	Doesn’t exist! (trap)
“graph + relationships”	Neptune
“social network analysis”	Neptune
“friends of friends” queries	Neptune
“likes on posts by friends”	Neptune
“fraud detection patterns”	Neptune
“IoT + time-series”	Timestream
“sensors” + “readings per second”	Timestream
“temperature/humidity/pressure”	Timestream
“immutable financial ledger”	QLDB
“Cassandra compatible”	Keyspaces
“free-text search”	OpenSearch
“partial match” + “any field”	OpenSearch
“search DynamoDB data”	DynamoDB + OpenSearch
“logs to dashboards”	CloudWatch → OpenSearch
“ETL + data catalog”	Glue
“convert CSV to Parquet”	Glue ETL
“centralized metadata”	Glue Data Catalog
“streaming ETL”	Glue Streaming ETL
“prevent re-processing”	Glue Job Bookmarks
“serverless SQL on S3”	Athena
“PB-scale analytics”	Redshift
“BI dashboards”	QuickSight
“visualizations from Athena/Redshift”	QuickSight
“embeddable analytics”	QuickSight
“data lake”	Lake Formation
“row/column-level security”	Lake Formation (data lake) or QuickSight Enterprise (dashboards)
“centralized data lake permissions”	Lake Formation
“column-level security” + “dashboards”	QuickSight Enterprise
“Kafka on AWS”	Amazon MSK
“migrate Kafka”	Amazon MSK
“message > 1 MB streaming”	MSK (up to 10 MB)
“unlimited stream retention”	MSK
“Apache Flink”	Managed Service for Apache Flink
“real-time stream analytics”	Managed Service for Apache Flink
“Flink + Kinesis or MSK”	Managed Service for Apache Flink
“logs in S3” + “quick analysis”	Athena
“OLAP” + “columnar” + “warehouse”	Redshift
“Redshift Global cluster”	Doesn’t exist! (trap)
“Redshift cross-region DR”	Cross-region snapshot copy
“COPY/UNLOAD through VPC”	Enhanced VPC Routing
“Spark/Hive/Presto” + “big data”	EMR
“open source big data frameworks”	EMR
“deliver to S3/Redshift” + “near real-time”	Kinesis Firehose
“ingest real-time data”	Kinesis Data Streams

Part 6: Elimination Checklist

When stuck between options, eliminate systematically:

□ Is it about READING from standby?
  → Multi-AZ standby can't serve reads
  → Use Read Replica for read scaling

□ Is it CROSS-REGION?
  → Multi-AZ = same region only (eliminate it)
  → Aurora Global or DynamoDB Global Tables

□ Does it need WRITES in multiple regions?
  → Aurora Global = read-only replicas (eliminate it)
  → DynamoDB Global Tables = active-active writes

□ Is it about CACHING without code changes?
  → ElastiCache requires code changes (eliminate it)
  → DAX works with DynamoDB API (no changes)

□ Does it mention ORACLE or SQL SERVER customization?
  → Standard RDS = no SSH (eliminate it)
  → RDS Custom allows full access

□ Is it asking for INSTANT clone?
  → RDS Snapshot = slow (eliminate it)
  → Aurora Cloning = instant

□ Is it GRAPH data?
  → DynamoDB/RDS = complex (eliminate them)
  → Neptune is purpose-built

□ Is it TIME SERIES?
  → DynamoDB/RDS = not optimized (eliminate them)
  → Timestream is purpose-built

□ Is it IMMUTABLE ledger?
  → DynamoDB = mutable (eliminate it)
  → QLDB is immutable

🏆 The Golden Rules

ASYNC = Eventually Consistent (Read Replica lag is expected behavior)
SYNC = Always Consistent (Multi-AZ, but can’t read from standby)
Multi-AZ = HA, Read Replica = Scaling (different purposes!)
Aurora = RDS++ (same engines, better everything)
DAX = no code changes, ElastiCache = code changes required
Redis = HA + features, Memcached = simple + sharding
Aurora Global = read replicas, DynamoDB Global = write anywhere
Backtrack, Cloning, Serverless = Aurora-only features
RDS Custom = Oracle/SQL Server customization only
RDS Proxy = Lambda + 66% faster failover
Restore = NEW database (never overwrites existing)
Encryption at launch (later = snapshot → restore encrypted)
35 days automated backup, unlimited manual snapshots
QLDB ≠ Blockchain (QLDB is centralized)
Athena = serverless SQL on S3 ($5/TB, use Parquet for cost savings)
Redshift = OLAP, Athena = ad-hoc (Redshift faster for complex joins/BI)
OpenSearch = full-text search (complement DynamoDB for search capabilities)
EMR = Hadoop/Spark big data (Spot for Task Nodes, Reserved for Master/Core)
QuickSight = BI dashboards (SPICE for in-memory, users ≠ IAM)
Glue = serverless ETL + Data Catalog (CSV→Parquet, metadata for Athena/Redshift)
Lake Formation = data lake + fine-grained security (row/column-level, built on Glue)
MSK = managed Kafka (alternative to Kinesis for Kafka workloads, larger messages, unlimited retention)
Flink reads from Kinesis or MSK (NOT Firehose — real-time stream processing

Amazon Elastic Container Service (ECS):

Amazon Elastic Container Service (Amazon ECS) is a fully managed container orchestration service that helps you easily deploy, manage, and scale containerized applications. Amazon ESC intergrated with Application Load Balancer (ALB).

Types of provisioning:

EC2 instances based provisioning, where customer is maintaining the infrastructure and Amazon managing starting and stopping of containers.
Fargate: serverless managed service to run containes just based on the required CPU/RAM. Fargate works with both Amazon ECS and Amazon EKS.

Amazon Elastic Container Registry (ECR) is a public or private registry to store container images, so they can be run by ECS.

ECS task execution role is capabilities of ECS agent (and container instance), e.g:

Pulling a container image from Amazon ECR;
Using the awslogs log driver; ECS task role is specific capabilities within the task itself, e.g:
When your actual code runs.

Amazon Lambda and Batch:

Amazon Lambda is serverless, autoscaled, event-driven service to run on-demand virual functions. Supports many programming languages.

Amazon API Gateway fully managed, serverless and scalable service for developers to easily create, publish, maintain and monitor APIs. Support RESTful APIs and WebSocket APIs.

AWS Batch fully managed batch processing at any scale. Batch will dynaicatlly launch EC2 instances or Spot Instances. Batch jobs are defined as Docker images and run on ECS. AWS Batch has no time limits unlike AWS Lambda, not limited by runtimes as long as it packaged in Docker container and relies on EBS or instance store for disk space.

Serverless Overview:

Serverless = paradigm where developers don’t manage servers — just deploy code/functions.

Initially: Serverless = FaaS (Function as a Service)
Now: includes ANY managed service (databases, messaging, storage)
Serverless ≠ no servers — means you don’t manage/provision/see them

AWS Serverless Services:

Service	Type
AWS Lambda	Compute (FaaS)
DynamoDB	Database (NoSQL)
Aurora Serverless	Database (SQL)
API Gateway	API management
S3	Object storage
SNS & SQS	Messaging
Kinesis Data Firehose	Streaming
Step Functions	Workflow orchestration
Fargate	Serverless containers
Cognito	Authentication
CloudFront	CDN

AWS Lambda:

Lambda vs EC2:

Aspect	Lambda	EC2
Management	Virtual functions — no servers	Virtual servers to manage
Duration	Limited by time (15 min max)	Continuously running
Execution	On-demand, event-driven	Always on
Scaling	Automatic	Manual intervention
RAM/CPU	Limited (up to 10GB RAM)	Choose instance type

Lambda Benefits:

Pricing: Pay per request + compute time
- Free tier: 1M requests/month + 400K GB-seconds compute
Integrated with entire AWS ecosystem
Auto-scaling (no capacity planning)
Easy monitoring via CloudWatch
More RAM = more CPU + network (coupled)

⚠️ Exam trap: “Which service has NO built-in caching?” → Lambda. Lambda is stateless by design. API Gateway has response caching, DynamoDB has DAX. Lambda needs external cache (ElastiCache, DAX).

Lambda Language Support:

Runtime	Languages
Native	Node.js, Python, Java, C#/.NET, PowerShell, Ruby
Custom Runtime API	Rust, Golang (community-supported)
Container Image	Any language (must implement Lambda Runtime API)

⚠️ Exam trap: Lambda Container Image ≠ arbitrary Docker. Must implement Lambda Runtime API. For arbitrary Docker → ECS/Fargate.

Lambda Integrations (Main ones):

API Gateway — RESTful APIs
DynamoDB — NoSQL triggers
S3 — Object events (upload, delete)
CloudWatch Events/EventBridge — Scheduled (cron) or event-driven
CloudWatch Logs — Log processing
SNS — Pub/sub messaging
SQS — Queue processing
Kinesis — Stream processing
Cognito — Authentication events
CloudFront — Lambda@Edge

Lambda Use Cases (from screenshots):

Serverless CRON Job: EventBridge (every 1 hour) → Lambda
Serverless Thumbnail Creation: S3 (new image) → Lambda → creates thumbnail → S3 + DynamoDB metadata

Lambda SnapStart:

Improves function performance up to 10x at no extra cost (Java, Python, .NET)
Function invoked from pre-initialized state (no cold start)
When you publish a new version:
- Lambda initializes the function
- Takes a snapshot of memory/disk state
- Snapshot is cached for low-latency access

SnapStart Enabled:          SnapStart Disabled:
invoke                      invoke
  ↓                          ↓
Lambda (pre-initialized)    Lambda
  ↓                          ↓ Init
Invoke                      Invoke
  ↓                          ↓
Shutdown                    Shutdown

⚠️ Exam trap: “Reduce Lambda cold start” → SnapStart (or Provisioned Concurrency).

Lambda Concurrency:

Concurrency limit: Up to 1000 concurrent executions per region (default)
Reserved Concurrency: Set a limit at function level (guarantees capacity)
Provisioned Concurrency: Pre-initialized functions (reduces cold starts, costs extra)

Type	Purpose	Cold Start	Cost
Unreserved	Default pool	Yes	Pay per use
Reserved	Guarantee capacity for function	Yes	Pay per use
Provisioned	Pre-warm instances	No	Pay for provisioned + invocations

Throttling Behavior:

If no concurrency available → Throttle (429 error)
Synchronous invocation: Returns ThrottleError (429) immediately
Asynchronous invocation: Retries automatically, then goes to DLQ
- Retry for up to 6 hours
- Retry interval: exponential backoff (1s → max 5 min)

Lambda Concurrency Issue Example:

Many users → ALB → Lambda (1000 executions) ✓
Few users → API Gateway → Lambda → THROTTLE! ❌
SDK/CLI → Lambda → THROTTLE! ❌

Without reserved concurrency, one source can consume all capacity.

⚠️ Exam trap: “Lambda throttling from one service” → Use Reserved Concurrency to limit/isolate capacity per function.

⚠️ Exam trap: The 1000 concurrent limit is shared across ALL functions in the account/region. One busy function can starve others → use Reserved Concurrency to isolate.

Amazon API Gateway:

Amazon API Gateway = fully managed, serverless service to create, publish, maintain, and monitor APIs.

Supports RESTful APIs and WebSocket APIs
Integrates with Lambda, HTTP backends, AWS services
Features: authentication, rate limiting, caching, monitoring

Lambda Layers & Destinations:

Lambda Layers:

Reusable packages of libraries, dependencies, custom runtimes
Up to 5 layers per function
Reduces deployment size (dependencies separate from code)
Share across multiple functions

⚠️ Exam trap: “Share code/libraries between Lambda functions” → Lambda Layers. Not copying code into each function.

Lambda Destinations:

Route async invocation results to other services
On Success: SNS, SQS, Lambda, EventBridge
On Failure: SNS, SQS, Lambda, EventBridge
Better than DLQ (more options, includes success routing)

⚠️ Exam trap: “Route Lambda async result on success” → Destinations. DLQ only handles failures.

AWS Batch:

AWS Batch = fully managed batch processing at any scale.

Dynamically launches EC2 or Spot Instances
Batch jobs = Docker images running on ECS
No time limits (unlike Lambda’s 15 min)
Not limited by runtime (any Docker image)
Uses EBS or instance store for disk space
Serverless option: Batch on Fargate (no EC2 management)

AWS Batch Architecture:

Job Queue ──► Compute Environment ──► Job Execution
                     │
        ┌────────────┼────────────┐
        ▼            ▼            ▼
    On-Demand     Spot         Fargate
      EC2         EC2       (serverless)

Batch Components:

Component	Description
Job Definition	How to run: Docker image, vCPU, memory, IAM role
Job Queue	Where jobs wait; priority-based
Compute Environment	Managed EC2/Spot/Fargate instances

AWS Batch Use Cases:

ETL jobs (data transformation, hours-long)
Video/media transcoding (variable length)
Financial modeling (high compute, memory)
Scientific simulations (HPC, ML training)
Log/data processing (large files)
Anything > 15 min or > 10 GB RAM/disk

Lambda vs Batch:

Aspect	Lambda	AWS Batch
Time limit	15 minutes	No limit
RAM	10 GB max	Up to 100s GB
Disk	10 GB /tmp	EBS volumes (TBs)
Runtime	Limited languages	Any Docker image
Invocation	Event-driven, sync/async	Job queue, scheduled
Scaling	Instant (1000 concurrent)	Launches instances (minutes)
Pricing	Per request + duration	Per EC2/Spot/Fargate time
Use case	Short, event-driven	Long-running batch jobs

When to Choose Batch over Lambda:

Scenario	Why Batch
Job > 15 min	Lambda hard limit
Needs > 10 GB RAM	Lambda hard limit
Needs > 10 GB disk	Lambda hard limit
GPU required	Lambda has no GPU
Large file processing	EBS storage available
Cost optimization	Spot instances (up to 90% savings)
Complex dependencies	Full Docker flexibility

⚠️ Exam trap: “Batch job > 15 minutes” or “needs Docker flexibility” or “> 10 GB memory/disk” → AWS Batch. “Event-driven, quick tasks” → Lambda.

⚠️ Exam trap: “Cost-optimize long-running batch jobs” → AWS Batch with Spot Instances. Up to 90% savings vs On-Demand. Lambda has no Spot option.

⚠️ Exam trap: “Serverless batch processing” → AWS Batch on Fargate (no EC2 to manage). Still not Lambda if > 15 min.

Lambda Limits & Pricing:

Lambda Pricing:

Component	Free Tier	After Free Tier
Requests	1M requests/month	$0.20 per 1M requests
Duration	400K GB-seconds/month	$1.00 per 600K GB-seconds

Invocation vs Duration:

Metric	What It Is	Depends On	Cost Impact
Invocation	1 call = 1 request	Number of triggers	$0.20 per 1M
Duration	Time function runs	Code complexity, I/O	GB-seconds

Cost = (Invocations × $0.20/1M) + (GB-seconds × $1.00/600K)
        └── count only ──┘         └── complexity matters ──┘

Example: Simple vs complex function

Both = 1 invocation (same request cost)
Simple (100ms, 128MB) = 0.0128 GB-sec
Complex (5s, 1GB) = 5 GB-sec (40x more duration cost)

Duration examples:

400K seconds if function is 1GB RAM
3.2M seconds if function is 128MB RAM
Billed in 1ms increments

Lambda Limits (per region):

Limit	Value
Memory	128 MB – 10 GB (1 MB increments)
Max execution time	900 seconds (15 minutes)
Environment variables	4 KB
/tmp disk	512 MB – 10 GB
Concurrency	1000 (can increase via support ticket)
Deployment (zip)	50 MB compressed
Deployment (uncompressed)	250 MB (code + dependencies)

⚠️ Exam trap: Lambda limit questions — know the key numbers: 15 min timeout, 10 GB RAM, 1000 concurrency, 250 MB uncompressed.

⚠️ Exam trap — Lambda Disqualifiers: If question mentions ANY of these → Lambda is WRONG answer:

> 10 GB RAM (e.g., “30 GB memory”, “needs 20 GB RAM”)
> 15 min execution (e.g., “takes 1 hour”, “30 min job”, “2 hour process”, “25 min video encoding”)
> 10 GB disk (e.g., “process 15 GB file”)
GPU required (Lambda has no GPU support)
Typical question pattern: “Code works locally but Lambda times out after 3 seconds” + long job → increase timeout won’t help if job > 15 min
→ Best alternatives: AWS Batch (batch jobs), ECS/Fargate (containers), EC2 (custom)

Lambda vs Alternatives Decision:

Scenario	Best Choice	Why
Event-driven, < 15 min	Lambda	Instant scaling, pay per use
Batch job > 15 min	AWS Batch	No time limit, Docker
Long-running + cost optimize	AWS Batch + Spot	90% savings
Containers, always running	ECS/Fargate	Long-running services
GPU, HPC, ML training	AWS Batch	GPU instances available

⚠️ Exam trap: “Long job + retry + can pause/resume days later” → SQS + AWS Batch or SQS + EC2. SQS retains messages up to 14 days. SNS has no retention (push and forget). Lambda has 15 min limit.

⚠️ Exam trap: Default Lambda timeout = 3 seconds. “Timeout error after 3 seconds” = default wasn’t changed, but if job needs > 15 min, Lambda is wrong choice entirely.

Cold Starts & Provisioned Concurrency:

Cold Start: New instance → code loaded + init runs → higher latency on first request
Provisioned Concurrency: Pre-allocate instances before invocation → no cold starts
- Can use Application Auto Scaling (schedule or target utilization)
VPC cold starts significantly improved (Oct 2019)

Solution	Cold Start?	Cost
Default	Yes	Pay per use
SnapStart	No (for Java/Python/.NET)	No extra cost
Provisioned Concurrency	No	Pay for provisioned capacity

⚠️ Exam trap: “Eliminate cold starts” → Provisioned Concurrency or SnapStart. SnapStart is free but limited to Java/Python/.NET.

Lambda in VPC:

Default Lambda Deployment:

By default, Lambda runs outside your VPC (in AWS-owned VPC)
Can access public internet and AWS services (DynamoDB, S3)
Cannot access private VPC resources (RDS, ElastiCache, internal ELB)

Default Lambda Deployment:
                          ┌─────────────────────────────────┐
                          │          AWS Cloud              │
   Internet ◄────────────►│  Lambda ──────► DynamoDB   ✓    │
   (Public)               │    │                            │
                          │    │   ┌──────────────────────┐ │
                          │    └──►│ VPC & Private Subnet │ │
                          │        │  Private RDS    ✗    │ │
                          │        └──────────────────────┘ │
                          └─────────────────────────────────┘

Lambda in VPC Configuration:

Define: VPC ID, Subnets, Security Groups
Lambda creates ENI (Elastic Network Interface) in your subnets
Can then access private resources (RDS, ElastiCache, internal ALB)
Needs NAT Gateway + IGW to access public internet from private subnet

⚠️ Exam trap: “Lambda access RDS in private subnet” → Must configure Lambda in VPC. Default Lambda cannot reach private resources.

⚠️ Exam trap: “Lambda can read DynamoDB but can’t write to SQS” → IAM Role missing permissions (needs sqs:SendMessage). Not security groups — SQS is accessed via API, not network. SQS doesn’t have security groups.

Edge Functions (CloudFront):

Customization at the Edge:

Execute logic at edge locations (close to users = low latency)
Attach code to CloudFront distributions
Fully serverless, globally deployed, pay-per-use
Use case: Customize CDN content before/after origin

Two Types:

CloudFront Functions — lightweight, JavaScript only
Lambda@Edge — more powerful, Node.js/Python

Use Cases:

Website security & privacy
Dynamic web application at the edge
SEO, A/B testing
Bot mitigation
Real-time image transformation
User authentication & authorization
Intelligent routing across origins

CloudFront Functions vs Lambda@Edge:

Aspect	CloudFront Functions	Lambda@Edge
Runtime	JavaScript only	Node.js, Python
Scale	Millions req/sec	Thousands req/sec
Triggers	Viewer Request/Response only	Viewer + Origin Request/Response
Max Execution	< 1 ms	5–10 seconds
Max Memory	2 MB	128 MB – 10 GB
Package Size	10 KB	1 MB – 50 MB
Network Access	No	Yes
File System Access	No	Yes
Request Body Access	No	Yes
Pricing	Free tier, 1/6 price of @Edge	No free tier, per request + duration
Managed In	CloudFront console	Lambda (us-east-1 only)

CloudFront Request/Response Flow:

User ──► Viewer Request ──► Origin Request ──► Origin
                │                   │
                │                   │
         CloudFront Func      Lambda@Edge
         or Lambda@Edge       only

Origin ──► Origin Response ──► Viewer Response ──► User
                │                    │
          Lambda@Edge          CloudFront Func
          only                 or Lambda@Edge

When to Use Which:

Use Case	Best Choice
Cache key normalization	CloudFront Functions
Header manipulation	CloudFront Functions
URL rewrites/redirects	CloudFront Functions
JWT validation (simple)	CloudFront Functions
Needs AWS SDK	Lambda@Edge
Access request body	Lambda@Edge
External API calls	Lambda@Edge
Complex processing	Lambda@Edge

⚠️ Exam trap: “Millions of requests, simple manipulation” → CloudFront Functions. “Need network/file access or origin manipulation” → Lambda@Edge.

⚠️ Exam trap: “Authenticate at CloudFront Edge” or “auth before reaching origin” → Lambda@Edge (or CloudFront Functions for simple JWT). Not API Gateway — it lives in one region, not at edge.

⚠️ Exam trap: Lambda@Edge must be authored in us-east-1 — CloudFront replicates globally.

📌 RDS & Aurora Lambda Integration — see Database section above for details (RDS Event Notifications vs Invoke Lambda from Aurora).

Amazon DynamoDB:

Overview:

Fully managed NoSQL database (not relational) with transaction support
Serverless — no instances to provision, patch, or manage
Highly available with replication across multiple AZs
Scales to massive workloads: millions req/sec, trillions of rows, 100s TB storage
Single-digit millisecond performance (fast and consistent)
Integrated with IAM for security, authorization, administration
Low cost, auto-scaling, no maintenance/patching
Table classes: Standard and Infrequent Access (IA)

⚠️ Exam trap: “Provision EC2 for DynamoDB” = False. DynamoDB is serverless — no servers/instances. Unlike RDS where you choose instance type.

DynamoDB Basics:

Concept	Details
Structure	Tables → Items (rows) → Attributes (columns)
Primary Key	Must be decided at creation time
Items	Infinite number per table, max 400 KB per item
Schema	Flexible — attributes can be added over time

DynamoDB Indexes:

Index Type	When Created	Key	Separate Throughput
LSI (Local Secondary Index)	Table creation only	Same partition key, different sort key	No (uses table’s)
GSI (Global Secondary Index)	Anytime	Different partition key	Yes (own RCU/WCU)

⚠️ Exam trap: “Query by different attribute” → add GSI. “Alternative sort key, same partition” → LSI (must define at creation).

Data Types:

Scalar: String, Number, Binary, Boolean, Null
Document: List, Map
Set: String Set, Number Set, Binary Set

⚠️ Exam trap: “Schema must rapidly evolve” or “flexible schema” → DynamoDB (NoSQL). RDS requires schema migrations.

Read/Write Capacity Modes:

Mode	Capacity Planning	Pricing	Best For
Provisioned (default)	Specify RCU/WCU upfront	Pay for provisioned	Predictable workloads
On-Demand	Automatic, instant scaling	2-3x more expensive	Unpredictable, steep spikes

Provisioned supports auto-scaling — but scales gradually (not instant)
On-Demand scales instantly — handles 0 → millions in seconds
RCU and WCU are decoupled — adjust each independently (read-heavy? increase RCU only)

⚠️ Exam trap: “Load increases from thousands to millions in < 1 minute” or “unpredictable steep spikes” → On-Demand Mode. Provisioned auto-scaling is too slow for sudden bursts.

⚠️ Exam trap: “Cost-effective” + mixed workloads → Match mode to pattern:

Production (predictable, sustained) → Provisioned (cheaper)
Development (unpredictable, variable) → On-Demand (no wasted capacity)

DynamoDB Accelerator (DAX):

What is DAX?

Fully managed, highly available in-memory cache for DynamoDB
Solves read congestion by caching frequently accessed data
Microseconds latency for cached reads
No code changes — compatible with existing DynamoDB APIs
Default TTL: 5 minutes

DAX Architecture:
┌─────────────┐
│ Application │
└──────┬──────┘
       ▼
┌─────────────────────────┐
│     DAX Cluster         │
│ ┌─────┐ ┌─────┐ ┌─────┐ │
│ │Cache│ │Cache│ │Cache│ │
│ └─────┘ └─────┘ └─────┘ │
└──────────┬──────────────┘
           ▼
┌─────────────────────────┐
│   Amazon DynamoDB       │
│   ┌───┐ ┌───┐ ┌───┐     │
│   │Tbl│ │Tbl│ │Tbl│     │
│   └───┘ └───┘ └───┘     │
└─────────────────────────┘

DAX vs ElastiCache:

Use Case	Solution
Cache individual objects, Query/Scan results	DAX
Store aggregation results (computed data)	ElastiCache

Application
    │
    ├── Aggregation Results ──────► ElastiCache
    │
    └── Individual objects ───────► DAX ──► DynamoDB
        Query & Scan cache

⚠️ Exam trap: “Cache DynamoDB reads” → DAX. “Store computed/aggregated results” → ElastiCache.

⚠️ Exam trap: “ProvisionedThroughputExceededException” + “hot keys/popular items” → DAX. Caches hot keys, offloads reads, prevents throughput errors. Increasing RCU alone won’t fix hot partition problem.

⚠️ Exam trap: “Migrate to Aurora/RDS” vs “Add DAX” → Choose DAX. Migration = dev effort, downtime risk, loses serverless benefits. DAX = no code changes, immediate fix, stays serverless.

DynamoDB Streams:

What are Streams?

Ordered stream of item-level modifications (create/update/delete)
24 hours retention (DynamoDB Streams) or 1 year (Kinesis Data Streams)

Use Cases:

React to changes in real-time (welcome email to new users)
Real-time usage analytics
Insert into derivative tables
Cross-region replication (powers Global Tables)
Invoke Lambda on changes

⚠️ Exam trap: “React to DynamoDB changes” (e.g., “send email when user signs up”) → DynamoDB Streams + Lambda. Never poll/scan — use event-driven streams.

DynamoDB Streams Architecture:

App ──► Table ──► DynamoDB Streams ──┬──► Lambda/KCL ──► SNS (notifications)
                         │           │                  ──► DDB Table (filtering)
                         │           │
                         ▼           │
                  Kinesis Data   ────┴──► Kinesis Firehose ──► S3 (archiving)
                  Streams                                  ──► Redshift (analytics)
                                                           ──► OpenSearch (indexing)

DynamoDB Streams vs Kinesis Data Streams:

Feature	DynamoDB Streams	Kinesis Data Streams
Retention	24 hours	1 year
Consumers	Limited (2 simultaneous)	High # of consumers
Processing	Lambda Triggers, KCL Adapter	Lambda, Analytics, Firehose, Glue
Ordering	Per-item ordered	Per-shard ordered
Cost	Included (no extra charge)	Pay for shards

When to Use Which:

Scenario	Best Choice
Simple Lambda trigger on DDB changes	DynamoDB Streams
Need > 2 consumers reading same stream	Kinesis Data Streams
Retention > 24 hours needed	Kinesis Data Streams
Archive to S3/Redshift/OpenSearch	Kinesis Data Streams → Firehose
Real-time analytics on changes	Kinesis Data Streams → Analytics
Just trigger notifications/updates	DynamoDB Streams → Lambda

⚠️ Exam trap: “Multiple consumers” or “long retention” or “replay” or “analytics pipeline” or “GB/sec real-time” → Kinesis Data Streams. “Simple Lambda trigger” → DynamoDB Streams. SQS/SNS have no replay.

DynamoDB Global Tables:

What are Global Tables?

Multi-region, active-active replication (Read + Write in ALL regions)
Low latency access in multiple regions
Like Read/Write replicas (not just read replicas like RDS)

                    GLOBAL TABLE
    ┌─────────────────────────────────────────┐
    │                                         │
    │   ┌──────────┐  two-way  ┌──────────┐   │
    │   │  Table   │◄────────►│  Table   │   │
    │   │US-EAST-1 │replication│AP-SE-2  │   │
    │   └──────────┘           └──────────┘   │
    │    Read+Write            Read+Write     │
    └─────────────────────────────────────────┘

Key Points:

Pre-requisite: Must enable DynamoDB Streams first
Applications can READ and WRITE to table in any region
Automatic two-way replication (eventually consistent across regions)

Global Tables vs RDS Read Replicas:

Aspect	DynamoDB Global Tables	RDS Read Replicas
Write	Any region (active-active)	Primary only
Read	Any region	Any replica
Replication	Two-way (bi-directional)	One-way (primary → replica)
Use case	Global apps, DR	Read scaling

⚠️ Exam trap: “Low latency global access to DynamoDB” → Global Tables. Requires Streams enabled (Streams provide changelog for replication). Not DAX (caching), not Backups (recovery), not “Versioning” (doesn’t exist). ⚠️ Exam trap: Global Tables = active-active (write anywhere). RDS Read Replicas = active-passive (write to primary only).

DynamoDB Additional Features:

Time To Live (TTL):

Automatically delete items after expiry timestamp
Use cases: session handling, reduce storage, regulatory compliance

⚠️ Exam trap: “Web session handling” + “auto-expire” → DynamoDB with TTL. Sessions stored in DynamoDB, TTL auto-cleans expired sessions.

Backups:

Type	Details
PITR (Point-in-Time Recovery)	Last 35 days, continuous, creates new table
On-Demand	Manual, long-term retention, no performance impact
AWS Backup	Cross-region copy support

S3 Integration:

Operation	Details
Export to S3	Requires PITR, last 35 days, DynamoDB JSON or ION format, no RCU consumed
Import from S3	CSV/JSON/ION, creates new table, no write capacity consumed

⚠️ Exam trap: “Export DynamoDB for analytics” → Export to S3 (native feature). Not Lambda — Export uses PITR backup, no RCU, no code. Transfer Family/DataSync are for files, not databases.

AWS API Gateway:

Overview:

Fully managed service to create, publish, maintain, monitor APIs
Lambda + API Gateway = No infrastructure to manage
Supports REST APIs and WebSocket APIs

Features:

API versioning (v1, v2…)
Environment handling (dev, test, prod)
Security (authentication, authorization)
API keys, request throttling
Swagger/OpenAPI import
Request/response transformation & validation
SDK generation, API specifications
Response caching

Integrations:

Integration Type	Use Case	Example
Lambda	Serverless backend	REST API → Lambda
HTTP	Existing HTTP endpoints	On-prem API, ALB
AWS Service	Direct AWS API exposure	Start Step Function, post to SQS

⚠️ Exam trap: “Serverless REST API” → API Gateway + Lambda. Why others fail:

ALB + EC2 = servers to manage (not serverless)
ECS + EBS = containers + block storage (not serverless, EBS is for EC2/ECS)
CloudFront + S3 = static content only (not REST API, no compute)

API Gateway → Kinesis Data Streams Example:

Client ──► API Gateway ──► Kinesis Data ──► Kinesis Data ──► S3
           (requests)      Streams          Firehose         (.json files)

Endpoint Types:

Type	Description	CloudFront
Edge-Optimized (default)	Global clients, routed via CloudFront edge	Built-in
Regional	Same-region clients	Optional (manual)
Private	VPC only, via Interface Endpoint (ENI)	N/A

⚠️ Exam trap: “Edge-Optimized API Gateway lives in all regions” = False. Requests route through global CloudFront edges, but API Gateway itself stays in ONE region.

Security:

Method	Use Case
IAM Roles	Internal applications
Cognito	External users (mobile apps)
Custom Authorizer	Your own auth logic (Lambda)

API Gateway Limits:

Limit	Value
Throttling	10,000 req/sec (account level, can increase)
Burst	5,000 concurrent requests
Timeout	29 seconds max (Lambda can run 15 min, but API GW times out at 29s)
Payload	10 MB max

⚠️ Exam trap: “API Gateway timeout” = 29 seconds (not Lambda’s 15 min). Long-running → use async pattern (API GW → SQS → Lambda).

HTTPS/Certificates:

Integration with AWS Certificate Manager (ACM)
Edge-Optimized → certificate must be in us-east-1
Regional → certificate in API Gateway region
Requires CNAME or A-alias in Route 53

⚠️ Exam trap: “API Gateway + global users” → Edge-Optimized. Certificate must be in us-east-1.

AWS Step Functions:

Overview:

Build serverless visual workflows to orchestrate Lambda functions
Features: sequence, parallel, conditions, timeouts, error handling
Integrates with: EC2, ECS, on-premises, API Gateway, SQS, etc.
Human approval feature available

Use Cases:

Order fulfillment
Data processing pipelines
Web applications
Any multi-step workflow

Step Functions Workflow Types:

Type	Duration	Execution	Pricing	Use Case
Standard	Up to 1 year	Exactly-once	Per state transition	Long-running, audit
Express	Up to 5 min	At-least-once	Per execution + duration	High-volume, short

⚠️ Exam trap: “Serverless workflow” + “human approval” → Step Functions. Only service with built-in human approval feature.

⚠️ Exam trap: “High-volume, short-lived workflows” → Express Workflows. “Long-running, exactly-once” → Standard Workflows.

Amazon Cognito:

Overview:

Give users an identity to interact with web/mobile applications

Cognito vs IAM:

Aspect	Cognito	IAM
Users	Hundreds/thousands/millions	Handful (employees, services)
Type	External users (customers)	Internal users (admins, devs)
Scale	Web/mobile app users	AWS account management
Federation	SAML, social (Google, FB)	SAML, OIDC (for roles)

⚠️ Exam trap keywords → Cognito:

“Hundreds of users” / “millions of users”
“Mobile users” / “web application users”
“Authenticate with SAML” (for app users)
“Social login” (Google, Facebook)
“External users” / “customers”

Two Components:

Component	Purpose	Key Feature
User Pools (CUP)	Authentication (sign-in)	Serverless user database
Identity Pools	Authorization (AWS credentials)	Temporary AWS access

Cognito User Pools (CUP):

Features:

Serverless database for web/mobile app users
Username/email + password login
Password reset, email/phone verification
Multi-factor authentication (MFA)
Federated identities: Facebook, Google, SAML

⚠️ Exam trap: “Easiest/best way to add authentication” to serverless app → Cognito User Pools. Not DynamoDB/S3 + KMS (DIY auth = complex), not Secrets Manager (for app secrets, not user auth).

Integrations:

      [CUP + API Gateway]                      [CUP + ALB]

      Cognito User Pools                    Cognito User Pools   
   (authenticate, get token)                  (authenticate)
              ▲                                     ▲ 
              │                                     │
              ▼                                     ▼
User ──► API Gateway ──► Lambda        User ─────► ALB ──► Target Group
     (REST API + token)                       (authenticate)
  (evaluate Cognito token)

⚠️ Exam trap: CUP integrates with API Gateway and ALB for authentication.

Cognito Identity Pools (Federated Identity):

Purpose:

Provide temporary AWS credentials to users for direct AWS access
Users can access AWS services directly or through API Gateway

Identity Sources:

Cognito User Pools
Social providers (Google, Facebook)
SAML 2.0
OpenID Connect

Cognito Identity Pools Flow:

Web/Mobile App ──► Identity Provider ──► Cognito Identity Pools ──► AWS Services
                  (Google, Facebook,      (validate, exchange     (S3, DynamoDB)
                   SAML, CUP)             for AWS credentials)
                                                │
                                          IAM policies define
                                          what user can access

Key Points:

IAM policies applied to credentials are defined in Cognito
Can customize based on user_id for fine-grained control
Default IAM roles for authenticated and guest users
Use IAM policy variables (${cognito-identity.amazonaws.com:sub}) for per-user S3 folders

⚠️ Exam trap: “Mobile app needs direct access to S3/DynamoDB” → Cognito Identity Pools (provides temporary AWS credentials).

⚠️ Exam trap: “Per-user personal space in S3” → Cognito Identity Pools + IAM policy variables. Not IAM users (doesn’t scale), not public bucket (no security).

⚠️ Exam trap: User Pools = WHO you are (authentication). Identity Pools = WHAT you can access (authorization/credentials).

Serverless Architecture Use Case: Mobile App

Requirements → Solution Mapping:

Requirement	AWS Solution
REST API with HTTPS	API Gateway
Serverless architecture	Lambda, DynamoDB, Cognito, S3
Users interact with own S3 folder	Cognito Identity Pools (per-user IAM policy)
Managed serverless authentication	Cognito User Pools
Mostly reads, some writes	DAX (caching layer for read throughput)
Database scales, high read throughput	DynamoDB + DAX

Complete Architecture:

                                    ┌──────────┐
                   Store/retrieve   │    S3    │
                   files ──────────►│ (files)  │
                        │           └──────────┘
                   Permissions
                   (Cognito)
                        │
┌────────────┐    REST HTTPS    ┌─────────────┐      ┌────────┐     ┌─────┐     ┌──────────┐
│   Mobile   │◄────────────────►│ API Gateway │─────►│ Lambda │────►│ DAX │────►│ DynamoDB │
│   Client   │                  │  (caching)  │      │        │     │cache│     │          │
└────────────┘                  └──────┬──────┘      └────────┘     └─────┘     └──────────┘
       │                               │
       │ authenticate                  │ verify auth
       ▼                               ▼
                              ┌─────────────────┐
                              │ Amazon Cognito  │
                              │ (User Pools +   │
                              │  Identity Pools)│
                              └─────────────────┘

Why each service:

API Gateway — REST HTTPS + response caching at API level
Cognito User Pools — managed auth (signup/login)
Cognito Identity Pools — temporary AWS creds for direct S3 access
Lambda — serverless compute
DynamoDB — serverless, scalable database
DAX — caching layer for high read throughput (mostly reads)
S3 — per-user file storage

⚠️ Exam trap: “Read-heavy workload” + “DynamoDB” → add DAX for caching. “Per-user S3 access” → Cognito Identity Pools.

Serverless Architecture Use Case: Global Website

Requirements → Solution Mapping:

Requirement	AWS Solution
Scale globally	CloudFront (CDN, edge locations)
Rarely written, often read	DynamoDB + DAX (caching)
Static files	S3 + CloudFront
Dynamic REST API	API Gateway + Lambda
Caching where possible	CloudFront (static) + DAX (DB reads)
Welcome email on signup	DynamoDB Streams + Lambda + SES
Thumbnail on photo upload	S3 trigger + Lambda

Architecture Overview:

STATIC CONTENT (Global):
                                     OAC: Origin Access Control
Client ◄───────► CloudFront ◄──────────────────► S3 (static files)
            (edge locations)                     Bucket policy: only CloudFront

DYNAMIC API:
Client ◄──REST──► API Gateway ──► Lambda ──► DAX ──► DynamoDB
                                              cache

PHOTO UPLOAD + THUMBNAIL:
Client ──► CloudFront ──► S3 (photos) ──► Lambda (trigger) ──► S3 (thumbnails)
      (Transfer Acceleration)    OAC              │
                                                  ▼ optional
                                              SQS / SNS

WELCOME EMAIL:
DynamoDB ──► DynamoDB Streams ──► Lambda ──► SES (send email)
(new user)                        (trigger)

Key Patterns:

Pattern	Implementation
Static hosting	S3 + CloudFront + OAC (bucket only allows CloudFront)
Global distribution	CloudFront edge locations
Read-heavy DB	DynamoDB + DAX caching
Event-driven processing	S3 trigger → Lambda (thumbnails)
React to DB changes	DynamoDB Streams → Lambda (welcome email)
Fast uploads	CloudFront + S3 Transfer Acceleration

OAC (Origin Access Control):

Restricts S3 access to CloudFront only
Bucket policy denies direct S3 access
Replaces legacy OAI (Origin Access Identity)

⚠️ Exam trap: “Static website + global” → S3 + CloudFront. “Secure S3 from direct access” → OAC (Origin Access Control).

⚠️ Exam trap: “Generate thumbnail on upload” → S3 event → Lambda. “Welcome email on signup” → DynamoDB Streams → Lambda → SES.

Summary — Serverless Website Key Points:

Component	Purpose
CloudFront + S3	Static content distribution
API Gateway + Lambda	Serverless REST API (public, no Cognito needed)
DynamoDB Global Tables	Global data serving (alternative: Aurora Global)
DynamoDB Streams → Lambda	React to DB changes (new user → welcome email)
Lambda + SES	Serverless email sending (Lambda needs IAM role for SES)
S3 Events	Trigger SQS / SNS / Lambda on upload

⚠️ Exam trap: “Public API” → no Cognito needed, just API Gateway + Lambda. “Global database” → DynamoDB Global Tables or Aurora Global Database.

Microservices Architecture

Why Microservices?

Leaner development lifecycle per service
Each service can have different architecture
Services interact via REST API
Independent scaling and deployment

Communication Patterns:

Pattern	Services	Use Case
Synchronous	API Gateway, Load Balancers	Direct request/response
Asynchronous	SQS, Kinesis, SNS, Lambda triggers (S3)	Decoupled, event-driven

Architecture Example:

                          Route 53 (DNS)
                               │
           ┌───────────────────┼───────────────────┐
           ▼                   ▼                   ▼
   service1.example.com  service2.example.com  service3.example.com
           │                   │                   │
           ▼                   ▼                   ▼
    ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
    │     ELB     │     │ API Gateway │     │     ELB     │
    └──────┬──────┘     └──────┬──────┘     └──────┬──────┘
           ▼                   ▼                   ▼
    ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
    │     ECS     │     │   Lambda    │     │ EC2 + ASG   │
    └──────┬──────┘     └──────┬──────┘     └──────┬──────┘
           ▼                   ▼                   ▼
    ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
    │  DynamoDB   │     │ ElastiCache │     │     RDS     │
    └─────────────┘     └─────────────┘     └─────────────┘

Microservices Challenges:

Challenge	Description
Repeated overhead	Creating each new microservice requires setup
Server utilization	Hard to optimize density across services
Version complexity	Running multiple versions simultaneously
Client SDK proliferation	Clients need to integrate with many services

How Serverless Helps:

Challenge	Serverless Solution
Overhead	API Gateway + Lambda = minimal setup
Scaling	Automatic scaling, pay per usage
Environments	Clone API, reproduce environments easily
Client SDKs	Generate SDK through Swagger/OpenAPI integration

⚠️ Exam trap: “Reduce microservices overhead” → API Gateway + Lambda. “Generate client SDK” → API Gateway + Swagger/OpenAPI.

Software Updates Offloading

Problem:

EC2 application distributes software updates
New releases = massive traffic spikes
High cost (network, CPU, scaling)
Don’t want to change application architecture

Solution: Add CloudFront

BEFORE (expensive):
Users ──► EC2 (ASG) ──► distributes updates
          scales up    high CPU, network cost

AFTER (optimized):
                                     ┌────────────────────────────────┐
                                     │      Auto Scaling group        │
                                     │  ┌────────────────────────┐    │
                                     │  │   Availability Zone 1  │    │
                                     │  │   ┌────┐    ┌────┐     │    │
                                     │  │   │ M5 │    │ M5 │     │    │
Users ──► CloudFront ──► ALB ────────┼──┤   └────┘    └────┘     │    │──► EFS
          (edge cache)   (AZ 1-3)    │  ├────────────────────────┤    │  (shared storage)
          handles load               │  │   Availability Zone 2  │    │
                                     │  │   ┌────┐    ┌────┐     │    │
                                     │  │   │ M5 │    │ M5 │     │    │
                                     │  │   └────┘    └────┘     │    │
                                     │  ├────────────────────────┤    │
                                     │  │   Availability Zone 3  │    │
                                     │  │   ┌────┐               │    │
                                     │  │   │ M5 │               │    │
                                     │  │   └────┘               │    │
                                     │  └────────────────────────┘    │
                                     └────────────────────────────────┘

Why CloudFront Works:

Benefit	Explanation
No architecture changes	Just add CloudFront in front
Edge caching	Software files cached globally
Static content	Update files don’t change = perfect for CDN
EC2 not serverless, CloudFront is	CloudFront scales automatically
Cost savings	Less ASG scaling, less EC2, less bandwidth

EFS for Multi-AZ Shared Storage:

EC2 instances across multiple AZs share EFS (Elastic File System)
Software update files stored once, accessible from all instances

⚠️ Exam trap: “Reduce EC2 load for static file distribution” + “no architecture changes” → CloudFront. Works with existing EC2, caches at edge, reduces origin load. ALB has no caching feature.

⚠️ Exam trap: “Multi-AZ EC2 + shared filesystem” → EFS. Not EBS (single AZ), not S3 (object storage, not filesystem).

🎯 MASTER SUMMARY: Serverless Exam Guide

Part 1: Core Principles (Understand WHY → Derive WHAT)

Principle 1: Serverless = No Server Management, Not “No Servers”

Serverless means AWS manages infrastructure. You don’t provision, patch, or scale servers.

Lambda = compute without EC2
DynamoDB = database without RDS instances
API Gateway = API without ALB/nginx
Fargate = containers without EC2

Derive: If question asks “provision instance” for DynamoDB/Lambda → Wrong answer.

Principle 2: Lambda Has Hard Limits — Know the Disqualifiers

Lambda is NOT suitable for every workload. Hard limits exist:

15 min max execution time
10 GB RAM max
10 GB disk max (/tmp)
No GPU support

Derive: Any job > 15 min, > 10 GB RAM/disk, or needs GPU → Lambda is wrong. AWS Batch is the natural alternative for batch workloads.

Principle 3: Stateless vs Stateful — Lambda Has No Memory

Lambda functions are stateless. Each invocation is independent.

No built-in caching (unlike API Gateway, DynamoDB/DAX)
/tmp is ephemeral (only persists within same execution environment)

Derive: “Cache results between invocations” → needs external cache (DAX, ElastiCache).

Principle 4: IAM for APIs, Security Groups for Networks

SQS, DynamoDB, S3, SNS = accessed via AWS API → IAM permissions
RDS, ElastiCache, EC2 = network resources → Security Groups + VPC

Derive: “Lambda can’t write to SQS” → IAM role issue, not security groups. SQS has no SG.

Principle 5: Sync vs Async — Changes Everything

Pattern	Behavior	Retry	Use Case
Sync	Wait for response	No auto-retry	API calls, user-facing
Async	Fire and forget	Auto-retry 6 hrs	Background jobs, events

Derive: “Retry on failure” → Async invocation or SQS. “Immediate response” → Sync.

Principle 6: Retention Determines Service Choice

Service	Retention	Replay
SNS	None (push & forget)	❌
SQS	14 days max	❌ (deleted after read)
Kinesis	1-365 days	✅ Multiple consumers
DynamoDB Streams	24 hours	✅

Derive: “Pause for a day, resume later” → SQS. “Replay events” → Kinesis. “Multiple consumers” → Kinesis.

Principle 7: Edge vs Region — Where Code Runs

CloudFront Functions = edge locations, < 1ms, simple logic
Lambda@Edge = edge, 5-10s, needs network/SDK
API Gateway = one region (even Edge-Optimized routes via CF, but API is regional)
Lambda = one region

Derive: “Auth at edge before reaching origin” → Lambda@Edge. “Millions req/sec simple” → CloudFront Functions.

Principle 8: Caching Layers Stack

User → CloudFront (edge cache) → API Gateway (response cache) → Lambda → DAX → DynamoDB

Each layer reduces load on the next. Know which service provides which cache.

Principle 9: Event-Driven = Streams + Triggers

React to changes without polling:

S3 events → Lambda (thumbnails, processing)
DynamoDB Streams → Lambda (welcome emails, replication)
Kinesis → Lambda (real-time analytics)

Derive: “Send email when user signs up” → DynamoDB Streams + Lambda + SES. Not polling.

Principle 10: Authentication vs Authorization

Cognito Component	What It Does
User Pools	WHO you are (authentication, tokens)
Identity Pools	WHAT you can access (temporary AWS creds)

Derive: “Mobile app login” → User Pools. “Direct S3 access from mobile” → Identity Pools.

Part 2: Decision Tree (Follow Keywords → Find Answer)

"Serverless" mentioned?
├── REST API → API Gateway + Lambda
├── Database → DynamoDB (NoSQL) or Aurora Serverless (SQL)
├── Workflow → Step Functions
├── Auth → Cognito
└── Containers → Fargate

"Long-running job" (> 15 min)?
├── Batch workload → AWS Batch (Docker, Spot, no time limit)
├── Always-on service → ECS/Fargate
├── Custom/legacy → EC2
└── < 15 min → Lambda OK

"Cold start" problem?
├── Java/Python/.NET → SnapStart (free)
└── All languages → Provisioned Concurrency (costs $)

"Cache DynamoDB reads"?
├── Individual objects → DAX
└── Aggregated/computed → ElastiCache

"Global distribution"?
├── Static content → S3 + CloudFront
├── Dynamic API → API Gateway (Edge-Optimized) + Lambda
├── Database → DynamoDB Global Tables or Aurora Global

"React to changes"?
├── S3 upload → S3 Event → Lambda
├── DynamoDB insert → DynamoDB Streams → Lambda
├── Multiple consumers/replay → Kinesis

"Per-user S3 folders"?
└── Cognito Identity Pools + IAM policy variables

The CANNOT List:

What	Why
Lambda > 15 min	Hard limit
Lambda > 10 GB RAM	Hard limit
Lambda > 10 GB disk	Hard limit
Lambda GPU	Not supported → AWS Batch
Lambda arbitrary Docker	Must implement Runtime API
API Gateway > 29 sec	Timeout limit
DynamoDB change LSI after creation	LSI defined at table creation
SNS message replay	No retention
SQS message replay	Deleted after processing
ALB caching	ALB has no cache
SQS security groups	API-based, no network access control

Part 3: Scenario Pattern Recognition

Pattern: “Video encoding takes 25+ minutes”

Keywords: video, encoding, > 15 min, long-running Answer: SQS + EC2 (or ECS/Batch) Why: Lambda max 15 min. SQS provides retry + retention up to 14 days.

Pattern: “Send welcome email when user signs up”

Keywords: welcome email, new user, react to signup Answer: DynamoDB Streams → Lambda → SES Why: Event-driven. Streams capture new items, Lambda sends email via SES.

Pattern: “Thumbnail generation on image upload”

Keywords: thumbnail, upload, S3, image processing Answer: S3 Event → Lambda → S3 (thumbnails) Why: S3 triggers Lambda on PutObject. Lambda processes and saves.

Pattern: “Reduce EC2 load for static file distribution”

Keywords: static files, reduce load, no architecture changes Answer: CloudFront (CDN) Why: Edge caching, no origin changes needed. ALB has no caching.

Pattern: “Mobile app needs direct S3/DynamoDB access”

Keywords: mobile, direct access, temporary credentials Answer: Cognito Identity Pools Why: Provides temporary AWS credentials with IAM policies.

Pattern: “Per-user personal folder in S3”

Keywords: per-user, personal space, S3 folders Answer: Cognito Identity Pools + IAM policy variables Why: ${cognito-identity.amazonaws.com:sub} in policy restricts to user’s folder.

Pattern: “Read-heavy workload with DynamoDB”

Keywords: read-heavy, DynamoDB, cache, hot keys Answer: DAX Why: In-memory cache, microsecond latency, no code changes.

Pattern: “ProvisionedThroughputExceededException on hot keys”

Keywords: throughput exceeded, hot partition, popular items Answer: DAX Why: Caches hot keys, offloads reads. RCU increase alone doesn’t fix hot partition.

Pattern: “Unpredictable, steep traffic spikes (0 to millions)”

Keywords: unpredictable, millions, instant scaling, spikes Answer: DynamoDB On-Demand mode Why: Instant scaling. Provisioned auto-scaling is gradual.

Pattern: “Global low-latency database access”

Keywords: global, multi-region, low latency, DynamoDB Answer: DynamoDB Global Tables Why: Active-active replication. Requires Streams enabled.

Pattern: “Human approval in workflow”

Keywords: human approval, manual step, workflow Answer: Step Functions Why: Built-in human approval feature. No other service has it.

Pattern: “Generate client SDK for API”

Keywords: client SDK, API, mobile/web developers Answer: API Gateway + Swagger/OpenAPI Why: API Gateway generates SDKs from OpenAPI specs.

Pattern: “Authenticate at CloudFront edge”

Keywords: edge authentication, before origin, CDN auth Answer: Lambda@Edge Why: Runs at edge, can validate JWT/tokens before hitting origin.

Pattern: “Multiple consumers need same stream data”

Keywords: multiple consumers, replay, analytics pipeline Answer: Kinesis Data Streams Why: Multiple consumers, 1-365 day retention, replay capability.

Pattern: “Millions requests/sec, simple header manipulation”

Keywords: millions, simple, headers, URL rewrite Answer: CloudFront Functions Why: Sub-millisecond, JavaScript only, cheaper than Lambda@Edge.

Pattern: “Long-running batch job, cost optimization”

Keywords: batch, long-running, hours, cost-effective, Spot Answer: AWS Batch with Spot Instances Why: No time limit, Docker flexibility, up to 90% savings with Spot.

Pattern: “Video/media transcoding at scale”

Keywords: video, transcoding, encoding, media processing Answer: AWS Batch (or Elastic Transcoder/MediaConvert) Why: Variable duration (could be hours), Docker flexibility, Spot for cost.

Part 4: Quick Reference Tables

Lambda Limits:

Limit	Value
Timeout	15 min (900 sec)
RAM	128 MB - 10 GB
/tmp disk	512 MB - 10 GB
Deployment (zip)	50 MB compressed, 250 MB uncompressed
Concurrency	1000 default (regional)
Layers	5 per function

AWS Batch Capabilities (vs Lambda):

Capability	AWS Batch	Lambda
Time limit	Unlimited	15 min
RAM	100s of GB	10 GB
Disk	EBS (TBs)	10 GB
GPU	✅ Yes	❌ No
Spot pricing	✅ Yes (90% savings)	❌ No
Docker	Any image	Runtime API required
Startup time	Minutes	Milliseconds

API Gateway Limits:

Limit	Value
Timeout	29 seconds
Throttle	10,000 req/sec (account)
Payload	10 MB

DynamoDB Numbers:

Metric	Value
Item size max	400 KB
Streams retention	24 hours
On-Demand cost	2-3x Provisioned
DAX TTL default	5 minutes
PITR window	35 days

Retention Comparison:

Service	Retention
SNS	0 (immediate delivery)
SQS	1 min - 14 days
Kinesis	1 - 365 days
DynamoDB Streams	24 hours

Part 5: Ultimate Instant-Answer Table

Question Contains	→ Instant Answer
“Serverless REST API”	API Gateway + Lambda
“Job > 15 minutes”	NOT Lambda → EC2/ECS/Batch
“Cold start”	SnapStart or Provisioned Concurrency
“Cache DynamoDB”	DAX
“Cache aggregated results”	ElastiCache
“React to DynamoDB changes”	DynamoDB Streams + Lambda
“React to S3 upload”	S3 Event + Lambda
“Global static website”	S3 + CloudFront
“Global DynamoDB”	Global Tables (needs Streams)
“Send email serverless”	Lambda + SES
“Per-user S3 folders”	Cognito Identity Pools
“Mobile app auth”	Cognito User Pools
“Mobile direct AWS access”	Cognito Identity Pools
“Workflow with human approval”	Step Functions
“Generate client SDK”	API Gateway + Swagger
“Edge authentication”	Lambda@Edge
“Millions req/sec simple”	CloudFront Functions
“Multiple stream consumers”	Kinesis Data Streams
“Replay events”	Kinesis Data Streams
“Pause/resume days later”	SQS (14 day retention)
“Reduce EC2 load, no changes”	CloudFront
“Share code between Lambdas”	Lambda Layers
“Route Lambda success/failure”	Lambda Destinations
“High-volume short workflows”	Step Functions Express
“Long audit workflows”	Step Functions Standard
“Query by different attribute”	DynamoDB GSI
“Steep instant scaling”	DynamoDB On-Demand
“Predictable steady load”	DynamoDB Provisioned
“Lambda timeout 3 sec”	Default not changed
“Lambda can’t reach RDS”	Configure Lambda in VPC
“Lambda can’t write to SQS”	IAM Role missing permissions
“Long-running batch job”	AWS Batch
“Cost-optimize batch processing”	AWS Batch + Spot
“GPU required”	AWS Batch (not Lambda)
“> 10 GB RAM/disk”	AWS Batch (not Lambda)
“Video/media transcoding”	AWS Batch or MediaConvert
“ETL, data processing hours”	AWS Batch

Part 6: Elimination Checklist

□ Does it need > 15 min execution?
  → Yes = Eliminate Lambda → AWS Batch preferred for batch jobs
  → No = Lambda possible

□ Does it need > 10 GB RAM or disk?
  → Yes = Eliminate Lambda → AWS Batch
  → No = Lambda possible

□ Does it need GPU?
  → Yes = Eliminate Lambda → AWS Batch (GPU instances)

□ Is it "serverless REST API"?
  → API Gateway + Lambda (not ALB+EC2)

□ Does it mention "cache"?
  → DynamoDB reads = DAX
  → Aggregated data = ElastiCache
  → Static content = CloudFront
  → API responses = API Gateway caching
  
□ Does it mention "global"?
  → Static = CloudFront
  → Database = Global Tables / Aurora Global
  
□ Does it need "replay" or "multiple consumers"?
  → Kinesis (not SQS/SNS)

□ Does it mention "edge"?
  → Simple/fast = CloudFront Functions
  → Complex/network = Lambda@Edge

□ "Security group" for SQS/SNS/DynamoDB?
  → Wrong answer (API services, not network)

□ "Provision instance" for DynamoDB/Lambda?
  → Wrong answer (serverless)

🏆 The Golden Rules

15/10/10 Rule — Lambda max: 15 min, 10 GB RAM, 10 GB disk → AWS Batch if exceeded
29 sec API Gateway — API Gateway times out before Lambda (use async for long jobs)
DAX for reads, ElastiCache for compute — know which cache layer
Streams enable replication — Global Tables require DynamoDB Streams
User Pools = Auth, Identity Pools = Creds — Cognito split
Edge-Optimized cert in us-east-1 — API Gateway + CloudFront
Lambda@Edge authored in us-east-1 — replicated globally
SNS = no retention, SQS = 14 days, Kinesis = 365 days
IAM for APIs, SG for networks — SQS/SNS/DynamoDB = no security groups
CloudFront + existing EC2 = no refactor — just add CDN in front
On-Demand = 2-3x cost but instant scale — DynamoDB mode trade-off
SnapStart = free cold start fix — but only Java/Python/.NET
Step Functions = only human approval — no other serverless workflow has it
S3 + CloudFront + OAC — secure static hosting pattern
DynamoDB Streams → Lambda → SES — serverless email pattern
AWS Batch = Lambda alternative — when limits exceeded (time, RAM, disk, GPU)
Batch + Spot = 90% savings — for cost-optimized batch processing

Amazon Lightsail:

Amazon Lightsail simplified alternative version of AWS services, used for simple web applications (has templates for LAMP, Nginx, MEAN, Node.js..), websites (templates for Wordpress, Magento, Plesk, Joomla), Dev/Test environment. Has high availability but no auto-scaling, limited AWS integrations.

Pricing:

Pricing Models in AWS:

Pay as you go: pay for what you use, remain agile, responsive, meet scale demands;
Save when you reserve: minimize risks, predictably manage budgets, comply with long-terms requirements;
Pay less by using more: volume-based discounts;
Pay less as AWS grows: get discount for long-term loyalty to AWS.

Examples of spending categories:

Compute: Pay for compute time;
Storage: Pay for data stored in the Cloud;
Data transers OUT of the Cloud: Data transfer IN is free.

Free services & free tier in AWS:

IAM;
VPC;
Consolidated Billing;
Elastic Beanstalk (pay for the resources created);
CloudFormation (pay for the resources created);
Auto Scaling Groups (pay for the resources created);
EC2 Image Builder (pay for the resources created);
ECS with EC2 Launch Type Model (pay for the resources created);
Free Tier:
- EC2 (t2.micro);
- S3, EBS, ELB, AWS Data transfer.

EC2 Instances Purchasing Options:

On-Demand Instances: (Pay for what you use) short-term and un-interrupted workload, no upfront payment - predictable pricing, highest cost - pay by second;
Reserved instances (up to 75% discount, with 1 OR 3 years):
- Standard Reserved Instances (fixed Instance Type, Region, Tenancy, OS): long steady-state workloads (like database); can sell unused on RI Marketplace;
- Convertible Reserved Instances: (mutable Instance type, Region, Tenancy, OS): long workloads with flexible instances; CANNOT sell on Marketplace;
Saving Plans (up to 72%, with 1 OR 3 years): commited to an amount of usage (ex. 10$/hour) of Instance Type in Region, long workload;
Compute Saving Plan (up to 66% discount): regardless of Instance Type, Region, Size, OS, Tenancy, Compute options; applies to EC2, Fargate, Lambda;
Spot Instances (up to 90% discount): cost-efficient short-term workload, resilient to failure and less reliable instances (randomly dies with 2 min warning);
- Spot Requests: One-time (provision once, request disappears) or Persistent (auto-provisions NEW instance after termination - no migration, old instance gone, data lost unless EBS preserved); States: open, active, disabled, cancelled; must cancel request first, then terminate instances;
- Spot Block: request instances for 1-6 hours without interruptions (deprecated, may appear in older exam versions);
- Spot Fleets: set of Spot + optional On-Demand instances to meet target capacity with price constraints; stops launching when capacity or max cost reached;
  - Launch Pools: define instance type, OS, AZ combinations; fleet chooses from multiple pools;
  - Allocation Strategies: lowestPrice (cost optimization, short workload), diversified (across all pools, availability, long workload), capacityOptimized (optimal capacity), priceCapacityOptimized (highest capacity then lowest price, recommended for most);
- Max Price: set maximum price per hour; if spot price > max price, instance terminated; 2-min warning via EC2 metadata & CloudWatch Events;
Dedicated Host: book an entire physical server, control instance placement to address compliance requirements and use your existing server-bound software licenses; socket/core visibility for BYOL;
- On-demand (most expensive option): pay per second for active Dedicated Host;
- Reserved (with 1 or 3 years): with No Upfront, Partial Upfront and All Upfront payment options.
Dedicated Instances: no other customers will share dedicated to you hardware. But there is no control over hardware placement; per-instance billing (vs per-host for Dedicated Host);
Capacity Reservations: reserve capacity (Instance Type, Tenancy, and OS) in a specific AZ for any duration; no billing discount, combines with RI/Savings Plan; cancel anytime.

EC2 Image Builder only pay for the underlying resources.

EBS Storage billed:

For all provisioned capacity;
For IOPS if SSD and for number of request if HDD;
Snapshots added data coast per GB per month;
Outbound traffic are tiered for volume discounts.

EFS (Elastic File System):

Pay per use;
Storage class;
Lifecycle rules.

S3 Pricing:

Storage class: S3 Standard, S3 Glacier Deep Archive, etc;
Number and size of objects: Price can be tiered (based on volume);
Number and type of requests;
Data transfer OUT of the S3 region;
S3 Transfer Acceleration;
Lifecycle transitions.

ECS pricing:

EC2 Launch Type Model: No additional fees, you pay AWS resources stored and created in your application;
Fargate Launch Type Model: Pay for vCPU and memory resources allocated to your applications in your containers.

Lambda pricing:

Pay per call;
Performance (RAM);
Pay per duration.

Snowball Family Pricing: AWS Snowball offers significantly discounted pricing (up to 62%) for 1-year usage and 3-year usage commitments for Edge compute use cases.

Database pricing - RDS:

Per hour billing;
Database characteristics: Engine, Size, Memory class;
Purchase types: On-demand, Reserved instances;
Backup storage: no additional charge for backup storage up to 100% of your total database storage for a region.
Additional storage (per GB per month);
Number of input and output requests per month;
Deployment type (storage and I/O are variable): Single or Multiple AZ;
Data transfer:
- Outbound traffic are tiered for volume discounts;
- Inbound is free.

CloudFront pricing:

Pricing is different across different geographic regions;
Aggregated for each edge location, then applied to your bill;
Data Transfer Out (volume discount);
Number of HTTP/HTTPS requests.

Billing and Costing Tools:

Pricing Calculator: estimate the cost for your solution architecture.

AWS Billing Dashboard: home page for an overview of your AWS cloud financial management data and to help you make faster and more informed decisions. AWS Free Tier Dashboard: tracking AWS Free Tier usage.

Cost Allocation Tags: use cost allocation tags to track AWS costs on a detailed level.

Tagging and Resource Groups:

Tags are used for organizing resources;
Resource Groups are used for creating, maintaining and viewing a collection of resources that share common tags.

Cost and Usage Reports: lists AWS usage for each service category used by an account and its IAM users in hourly or daily line items, as well as any tags that customer activated for cost allocation purposes, including additional metadata about AWS services, pricing and reservations.

Cost Explorer: visualize, understand, and manage your AWS costs and usage over time. Create custom reports that analyze cost and usage data.

Analyze at high level (total costs across all accounts) or monthly, hourly, resource level granularity
Choose an optimal Savings Plan to lower costs
Forecast usage up to 12 months based on previous usage

Billing Alarms in CloudWatch: intended simple alarm for actual cost, not for projected costs, based on billing data metric stored in CloudWatch.

Create billing alert for free tier (Details):

(Change region to N.Virginia) [ Alert preferences ] > [ Edit ] > Receive CloudWatch billing alerts [x] > [ Save ];
https://us-east-1.console.aws.amazon.com/billing/home#/preferences
Open the CloudWatch console » Alarms > All alarms > Create alarm > Select metric > Billing > Total Estimated Charge.
https://console.aws.amazon.com/cloudwatch/

AWS Budgets: set custom budgets to track your costs and usage, and respond quickly to alerts received from email or SNS notifications if you exceed your threshold.

Usage;
Cost;
Reservation;
Saving Plans.

AWS Cost Anomaly Detection: continuously monitor your cost and usage using ML to detect unusual spends. It learns your unique, historic spend patterns to detect one-time cost spike and/or continuous cost increases — no need to define thresholds (ML does it). Monitor by: AWS services, member accounts, cost allocation tags, or cost categories. Sends anomaly detection report with root-cause analysis. Get notified with individual alerts or daily/weekly summary via SNS.

AWS Service Quotas: notifies you when you’re close to a service quota value threshold. Create CloudWatch Alarms. Request a quota increase from AWS Service Quotas or shutdown resources before limit is reached.

AWS Trusted Advisor: analyze your AWS accounts and provides recommendation on 6 categories:

Cost optimization: under-utilization;
Performance;
Security;
Fault tolerance;
Service limits;
Operational Excellence.
Full set of Checks (Only for Business & Enterprise Support plan);
Programmatic Access using AWS Support API (Business & Enterprise Support plan).

AWS Compute Optimizer: uses ML to analyze existing resources’ configurations and their utilization CloudWatch metrics, helps to choose optimal configurations and right-size your workloads (over/under provisioned). Supports: EC2 Instances, EC2 Auto Scaling Groups, EBS volumes, Lambda functions. Recomendations can be exported to S3.

⚠️ Exam trap — Cost Explorer vs Compute Optimizer:

Compute Optimizer = recommends instance types (right-sizing). Does NOT recommend purchasing options (RI/SP)
Cost Explorer = identifies idle/underutilized resources + recommends RI/Savings Plans purchases
Trusted Advisor = flags idle resources but does NOT take action (no auto-renew, no auto-terminate)

AWS Support Plans Pricing:

AWS Basic Support (free):

Customer Service & Communities: 24/7 access to customer service, documentation and support forums;
AWS Trusted Advisor: Access to the 7 core Trusted Advisor checks and guidance to provision your resources following best practices to increase performance and improve security;
AWS Person Health Dashboard: A personalized view of the health of AWS services, and alerts when your resources are impacted.

AWS Developer Support Plan:

All Basic Support Plans features included;
Business hours email access to Cloud Support Associates;
Unlimited cases (1 primary contact);
Case severity / response times:
- General guidance: <24 business hours;
- System impaired: <12 business hours.

AWS Business Support Plan (24/7):

Intended to be used on production workloads;
Trusted Advisor: full set of checks + API access;
24/7 phone, email and chat access to Cloud Support Engineers;
Unlimited cases / unlimited contacts;
Access to Infrastructure Event Management for additional fee;
Case severity / response times:
- General guidance: <24 business hours;
- System impaired: <12 business hours;
- Production system impaired: <4 hours;
- Production system down: <1 hour.

AWS Enterprise On-Ramp Support Plan (24/7):

Intended to be used if you have production or business crtical workloads;
All of Business Suport Plan features included;
Access to a pool of Technical Account Managers (TAM);
Concierge Support Team (for billing and account best practices);
Infrastructure Event Management, Well-Archiected & Operations Reviews;
Case severity / response times:
- General guidance: <24 business hours;
- System impaired: <12 business hours;
- Production system impaired: <4 hours;
- Production system down: <1 hour;
- Business-critical system down: <30 minutes.

AWS Enterprise Support Plan (24/7):

Intended to be used if you have mission critical workloads:
All of Business Support Plan features included;
Access to designated Technical Account Manager (TAM);
Concierge Support Team (for billing and account best practices);
Infrastructure Event Management, Well-Architected & Operations Reviews;
Case severity / response times:
- General guidance: <24 business hours;
- System impaired: <12 business hours;
- Production system impaired: <4 hours;
- Production system down: <1 hour.
- Business-critical system down: <15 minutes.

🎯 MASTER SUMMARY: Billing, Costing & Support Exam Guide

Part 1: Core Principles (Understand WHY → Derive WHAT)

Principle 1: Cost Tools Differ by WHEN You Need Them

Before deployment: Pricing Calculator (estimate)
During operation: Cost Explorer (analyze), Budgets (alert), Cost Anomaly Detection (ML detect)
After the fact: Cost and Usage Reports (detailed audit)

Principle 2: Alerts Have Different Triggers

CloudWatch Billing Alarms = simple, actual cost only (not projected), one threshold
AWS Budgets = cost, usage, reservation, Savings Plans — alerts via email/SNS when approaching OR exceeding threshold
Cost Anomaly Detection = ML-based, no threshold needed — learns patterns automatically

Principle 3: Trusted Advisor = 6 Categories, Tier-Gated

Free tier gets 7 core checks only. Full checks require Business or Enterprise support plan. Categories: Cost optimization, Performance, Security, Fault tolerance, Service limits, Operational Excellence.

Principle 4: Support Plans Are a Spectrum

Basic (free) → Developer → Business → Enterprise On-Ramp → Enterprise. Key differentiators: response time, TAM access, Trusted Advisor access. Business = first plan with 24/7 phone/chat + full Trusted Advisor + API access.

Principle 5: Tags Drive Cost Visibility

Cost Allocation Tags → track costs per project/team/environment. Resource Groups → view resources sharing common tags. Without tags, you can’t do granular cost analysis.

Part 2: Decision Trees

What cost question are you answering?
│
├─ "Estimate cost BEFORE building" → Pricing Calculator
├─ "Visualize/analyze PAST costs" → Cost Explorer
├─ "Set budget ALERTS" → AWS Budgets
├─ "Detect UNUSUAL spending (ML)" → Cost Anomaly Detection
├─ "Detailed cost AUDIT report" → Cost and Usage Reports
├─ "Right-size resources" → Compute Optimizer
└─ "Check best practices" → Trusted Advisor

Which Support Plan?
│
├─ "Just documentation + forums" → Basic (free)
├─ "Email support, business hours" → Developer
├─ "24/7 phone + full Trusted Advisor" → Business
├─ "Pool of TAMs, <30 min critical" → Enterprise On-Ramp
└─ "Designated TAM, <15 min critical" → Enterprise

Part 3: Scenario Pattern Recognition

Pattern: “Detect unexpected cost spikes without setting thresholds”

Keywords: unusual spending, ML, automatic detection Answer: AWS Cost Anomaly Detection Why: ML learns patterns — no manual thresholds. Sends root-cause analysis via SNS.

Pattern: “Get alerted when cost exceeds $X”

Keywords: budget, threshold, alert, notification Answer: AWS Budgets Why: Budgets support cost/usage/reservation thresholds with email/SNS alerts.

Pattern: “Forecast next 12 months of AWS spending”

Keywords: forecast, predict, future cost Answer: Cost Explorer (forecast feature)

Pattern: “Right-size EC2/Lambda/EBS resources”

Keywords: over-provisioned, under-utilized, right-size Answer: AWS Compute Optimizer

Pattern: “Need 24/7 phone support + full Trusted Advisor”

Keywords: production workloads, 24/7, phone support Answer: Business Support Plan (minimum for this)

Pattern: “Need a designated TAM”

Keywords: TAM, Technical Account Manager, designated Answer: Enterprise Support Plan (On-Ramp has a pool, not designated)

Part 4: Quick Reference Tables

Tool	Purpose	Trigger
Pricing Calculator	Estimate cost before building	Manual
Cost Explorer	Visualize past costs, forecast 12mo	On-demand
AWS Budgets	Alert when approaching/exceeding threshold	Threshold-based
Cost Anomaly Detection	ML detects unusual spending	Automatic (ML)
Cost & Usage Reports	Detailed line-item audit	Scheduled
Compute Optimizer	Right-size recommendations	ML analysis
Trusted Advisor	Best practice checks (6 categories)	Continuous

Support Plan	Response (Critical)	Trusted Advisor	TAM
Basic	—	7 core checks	❌
Developer	12h (business hrs)	7 core checks	❌
Business	<1h	Full + API	❌
Enterprise On-Ramp	<30 min	Full + API	Pool
Enterprise	<15 min	Full + API	Designated

Part 5: Ultimate Instant-Answer Table

Question Contains	→ Instant Answer
“Estimate cost before building”	Pricing Calculator
“Visualize past costs”	Cost Explorer
“Forecast future spending”	Cost Explorer (12 mo)
“Set budget alert”	AWS Budgets
“Detect unusual spending (ML)”	Cost Anomaly Detection
“No thresholds, automatic detection”	Cost Anomaly Detection
“Detailed cost audit per service”	Cost & Usage Reports
“Right-size EC2/EBS/Lambda”	Compute Optimizer
“Best practices check”	Trusted Advisor
“Track costs by project/team”	Cost Allocation Tags
“24/7 phone support”	Business plan (minimum)
“Full Trusted Advisor + API”	Business plan (minimum)
“Designated TAM”	Enterprise plan
“Pool of TAMs”	Enterprise On-Ramp
“<15 min response critical”	Enterprise plan
“Service quota approaching limit”	Service Quotas
“Stop dev instances after hours”	Instance Scheduler

Part 6: Elimination Checklist

□ Is it about ESTIMATING cost before building?
  → Yes = Pricing Calculator
  → No = analyzing existing costs

□ Is it about DETECTING unusual spending automatically?
  → Yes + no thresholds = Cost Anomaly Detection (ML)
  → Yes + specific threshold = AWS Budgets

□ Is it about VISUALIZING past costs or FORECASTING?
  → Visualize/forecast = Cost Explorer
  → Detailed line-item audit = Cost & Usage Reports

□ Do they need 24/7 PHONE support?
  → Yes = Business plan (minimum)
  → Email only = Developer plan

□ Do they need a TAM?
  → Designated = Enterprise
  → Pool = Enterprise On-Ramp
  → None = Business or lower

□ Do they need FULL Trusted Advisor?
  → Yes = Business plan (minimum)
  → 7 core checks only = Basic/Developer

□ Is it about RIGHT-SIZING resources?
  → Yes = Compute Optimizer
  → Cost visualization = Cost Explorer (different!)

□ Is it about TRACKING costs per project/team?
  → Yes = Cost Allocation Tags first
  → No tags = can't do granular tracking

🏆 The Golden Rules

Pricing Calculator = before, Cost Explorer = after (estimate vs analyze)
Budgets = you set threshold, Anomaly Detection = ML finds it (manual vs automatic)
Business plan = minimum for 24/7 + full Trusted Advisor (common exam question)
Enterprise = designated TAM, On-Ramp = pool of TAMs (key distinction)
Tags first, then allocate costs (no tags = no granular cost tracking)
Compute Optimizer ≠ Cost Explorer (right-size vs visualize)
CloudWatch Billing Alarm = simple actual cost only (Budgets is more powerful)

Scalability and High Availability

Scalability means that an application or system can handle greater loads by adapting.

Vertical scalability: means increasing the size of the instance, common for non-distributed systems (databases). There’s usually a limit to how much you can vertically scale (hardware limit). Achieved via:
- Auto Scaling Group: Multi-AZ;
- Load Balancer: Multi-AZ;
Horizontal scalability (elasticity): increasing the numbers of instances or systems for your application, implies distributed systems (modern web applications). Achieved via:
- Auto Scaling Group: scale out - Horizontal expansion or scale in - Reduces costs by removing unused capacity;
- Load Balancer: Distribute traffic across instances.

High Availability: survivability of a data center loss (disaster). Running application or system in at least two AZs.

Fault-tolerant systems emphasize maintaining continuous operation during unexpected failures, while high-availability infrastructures prioritize keeping services up and running despite scheduled maintenance or potential bottlenecks.

Scalability vs Elasticity vs Agility:

Scalability: ability to accommodate a larger load by making the hardware stronger (scale up), or by adding nodes (scale out);
Elasticity: once a system is scalable, elasticity means that there will be some “auto-scale” so that the system can scale based on the load (pay-per-use, match demand, optimize costs);
Agility: reduced time to make IT resources available.

Elastic Load Balancer (ELB) - managed load balancer that automatically distributes incoming application traffic across multiple resources, such as Amazon EC2 instances.

Spread load across multiple downstream instances;
Expose a single point of access (only DNS, no static IP) to your application;
Seamlessly handle failure of downstream instance;
Do regular health checks to instances (Protocol: HTTP; Port: 4567; Endpoint: /health);
Provide SSL termination (HTTPS) for your website;
Enforce stickiness with cookies;
Separate public traffic from private traffic;
High availability across zones.

Load Balancer Flows:

ALB Flow (Layer 7):
                         ┌─────────────────┐
    Internet             │       ALB       │
   ─────────────────────►│  SSL Termination│
    HTTPS :443           └────────┬────────┘
                                  │ HTTP :80
                    ┌─────────────┼─────────────┐
                    ▼             ▼             ▼
              ┌─────────┐   ┌─────────┐   ┌─────────┐
              │  EC2    │   │  EC2    │   │  EC2    │
              └─────────┘   └─────────┘   └─────────┘
                        Target Group

ALB Routing Rules:
              ┌──────────────────────────────────────┐
              │     ALB Listener :443 (HTTPS+SNI)    │
              └──────────────────┬───────────────────┘
          ┌──────────────────────┼──────────────────────┐
          │ /api/*               │ /images/*            │ default
          ▼                      ▼                      ▼
   ┌──────────────┐      ┌──────────────┐      ┌──────────────┐
   │  API Servers │      │  S3/Lambda   │      │  Web Servers │
   └──────────────┘      └──────────────┘      └──────────────┘

NLB with Static IP (Layer 4):
   Client (needs static IP for firewall whitelist)
                    │
                    ▼
         ┌─────────────────────┐
         │        NLB          │
         │  Elastic IP: 1.2.3.4│  ◄── Static IP per AZ
         │     (Layer 4)       │
         └──────────┬──────────┘
                    │ TCP passthrough
                    │ (Client IP preserved)
                    ▼
            ┌───────────────┐
            │  Target Group │
            └───────────────┘


GLB for Security Appliances:
                    ┌──────────┐
   Traffic ────────►│   GLB    │
                    │(Layer 3) │
                    └────┬─────┘
                         │ GENEVE :6081
                         ▼
              ┌─────────────────────┐
              │  Security Appliance │
              │  (Firewall/IDS/IPS) │
              │     Inspect ───────►│──► Allow/Block
              └─────────────────────┘
                         │
                         ▼
                   Your Application

Types of load balancers:

Feature	ALB	NLB	GLB	CLB
Layer	7 (HTTP/S)	4 (TCP/UDP)	3 (IP)	4 & 7
Use Case	Web apps, microservices	Ultra-low latency, static IP	Firewalls, IDS/IPS	Legacy (deprecated)
Performance	Moderate	Millions req/sec	High throughput	Moderate
Static IP	❌ DNS only	✅ Elastic IP per AZ	❌	❌
SNI (multi-cert)	✅	✅	N/A	❌
Cross-Zone Default	✅ Enabled (free)	❌ Disabled (paid)	❌ Disabled (paid)	❌ Disabled (free)
Hostname	XXX.region.elb.amazonaws.com	XXX.region.elb.amazonaws.com	XXX.region.elb.amazonaws.com	Fixed hostname

Target Group Support:

Target Type	ALB	NLB	GLB	CLB
EC2 Instances	✅	✅	✅	✅
IP Addresses (private)	✅	✅	✅	❌
Lambda Functions	✅ (HTTP→JSON)	❌	❌	❌
ALB	❌	✅	❌	❌
ECS Tasks	✅	✅	❌	✅

When to Use:

Scenario	Choose
HTTP routing (path/host/headers/query string)	ALB
WebSockets, HTTP/2	ALB
Containers with dynamic ports	ALB
Need static/Elastic IP (IP whitelisting)	NLB
Millions req/sec, ultra-low latency	NLB
TCP/UDP non-HTTP traffic	NLB
3rd party security appliances	GLB
Deep packet inspection	GLB

Details:

Classic Load Balancers [TCP (Layer 4), HTTP/HTTPS (Layer 7)]: Deprecated;
- Fixed hostname;
- Health Checks: TCP, HTTP protocols;
Application Load Balancer [HTTP/HTTPS - Layer 7]: Micro services and container-based applications;
- Routing based on path/hostname/headers/query string in URL;
- Dynamic port mapping: supports multiple containers per instance (hostPort: 0, ALB auto-discovers ephemeral ports);
- Target Groups: EC2 instances, ECS tasks, Lambda functions (HTTP→JSON), IP Addresses (private IPs); can route to multiple target groups;
- Fixed hostname: XXX.region.elb.amazonaws.com;
- X-Forwarded Headers: ALB terminates connection, adds headers for original client info:
  - X-Forwarded-For: Client’s real IP address;
  - X-Forwarded-Port: Original port (80 or 443);
  - X-Forwarded-Proto: Original protocol (http or https).
- Health Checks: HTTP, HTTPS only (Layer 7);
Network Load Balancer [TCP/UDP - Layer 4]: Ultra-high performance;
- TCP & UDP traffic: millions of requests per second, ultra-low latency;
- Static IP per AZ: supports Elastic IP (helpful for IP whitelisting);
- Target Groups: EC2 instances, IP Addresses (private IPs), ALB;
- Health Checks: TCP, HTTP, HTTPS protocols;

⚠️ Exam trap: ELB target registration — instance ID vs IP address

Register By	Routing Behavior	Use Case
Instance ID	Routes to primary private IP on primary ENI	Default, simplest
IP Address	Routes to the specific IP you chose	Multiple IPs per instance, non-EC2 targets (on-prem, containers)

❌ Public IP / Elastic IP → never used for target routing (ELB routes within VPC via private IPs)
❌ Instance ID as routable address → instance ID is a reference, NLB resolves it to primary private IP
Gateway Load Balancer [Layer 3 - IP Packets]: Deploy/scale 3rd party network virtual appliances;
- Use cases: Firewalls, Intrusion Detection/Prevention, Deep Packet Inspection;
- Functions: Transparent Network Gateway (single entry/exit) + Load Balancer;
- Protocol: GENEVE on port 6081;
- Target Groups: EC2 instances, IP Addresses (private IPs);
- Health Checks: TCP, HTTP, HTTPS protocols;

Load Balancer Details:

Feature	CLB	ALB	NLB	GLB
Layer	4 & 7 (deprecated)	7 (HTTP/S)	4 (TCP/UDP)	3 (IP)
Use Case	Legacy	Microservices, containers	Ultra-low latency, static IP	Firewalls, IDS/IPS
Routing	Basic	Path/host/headers/query	-	-
Target Groups	EC2 only	EC2, ECS, Lambda, IPs	EC2, IPs, ALB	EC2, IPs
Static IP	❌	❌ DNS only	✅ Elastic IP/AZ	❌
Health Checks	TCP, HTTP	HTTP, HTTPS	TCP, HTTP, HTTPS	TCP, HTTP, HTTPS
Dynamic Port Mapping	❌	✅	❌	❌
Client Info	Preserved	X-Forwarded-* headers	Preserved	Preserved
Protocol	TCP/HTTP	HTTP/HTTPS	TCP/UDP	GENEVE (port 6081)

Sticky Sessions (Session Affinity): client always redirected to same instance behind load balancer.

Supported by: CLB, ALB, NLB;
Use case: preserve session data (user login state);
Trade-off: may cause load imbalance across instances.

Cookie Types:

Application-based:
- Custom cookie: generated by target app, name specified per target group (avoid reserved: AWSALB, AWSALBAPP, AWSALBTG);
- Application cookie: generated by load balancer, name: AWSALBAPP;
Duration-based: generated by load balancer, expiration controlled; name: AWSALB (ALB), AWSELB (CLB).

Cross-Zone Load Balancing: distributes traffic evenly across all registered instances in all AZs (not just per-node).

Load Balancer	Default	Inter-AZ Data Charges
ALB	✅ Enabled	No charges
NLB & GLB	❌ Disabled	Charges if enabled
CLB	❌ Disabled	No charges

SSL/TLS: encrypts traffic in transit (in-flight encryption) between clients and load balancer.

SSL (Secure Sockets Layer) → TLS (Transport Layer Security) is newer version;
Uses X.509 certificates (public certs from CAs: Let’s Encrypt, Comodo, etc.);
Certificates have expiration date - must be renewed.

Load Balancer - SSL Certificates:

Uses X.509 certificate; manage via ACM or upload your own;
HTTPS listener: must specify default cert, can add list for multiple domains;
SNI (Server Name Indication): client specifies hostname → LB returns correct cert;
Security policies: can support older SSL/TLS versions for legacy clients.

HTTP → HTTPS Redirect:

Configure ALB listener rule on port 80 to redirect to HTTPS (port 443);
ALB handles redirect automatically - returns 301/302 to client;
Setup: ALB → Listeners → HTTP:80 → Add rule → Redirect to HTTPS:443.

⚠️ Exam trap: DNS cannot redirect HTTP→HTTPS (DNS only resolves names to IPs, no protocol handling).

SNI (Server Name Indication): allows multiple SSL certs on one server (multiple websites).

Client indicates hostname in SSL handshake → server returns correct cert;
ALB auto-selects optimal certificate based on hostname;
Use case: single HTTPS listener serving users.example.com + checkout.example.com with different certs;
Supported: ALB, NLB, CloudFront;
NOT supported: CLB (need separate CLB per domain).

⚠️ Exam traps:

HTTP→HTTPS redirect ≠ multi-domain certs (redirect is for port 80→443, not SNI);
Security Groups ≠ SSL certs (SGs are firewalls, not certificate managers).

Load Balancer	SSL Certificates	SNI Support
CLB	1 only (need multiple CLBs for multiple domains)	❌ No
ALB	Multiple (via multiple listeners)	✅ Yes
NLB	Multiple (via multiple listeners)	✅ Yes

Connection Draining / Deregistration Delay: time to complete in-flight requests while instance is de-registering or unhealthy.

Naming: Connection Draining (CLB) / Deregistration Delay (ALB & NLB);
Stops sending new requests to de-registering instance;
Duration: 1-3600 seconds (default: 300); set to 0 to disable;
Tip: use low value for short requests.

Auto Scaling Group (ASG): ensures optimal capacity by automatically scaling EC2 instances.

Scale out: add instances when load increases;
Scale in: remove instances when load decreases;
Self-healing: replace unhealthy instances automatically.

Launch Template vs Launch Configuration:

Feature	Launch Configuration (legacy)	Launch Template (recommended)
Multiple instance types	❌ Single type only	✅ Multiple types
Mixed On-Demand + Spot	❌	✅
Versioning	❌	✅
Capacity Reservations	❌	✅
Status	Legacy (deprecated)	Recommended

⚠️ Exam trap: “Mix On-Demand + Spot across multiple instance types in ASG” → Launch Template only. Launch Configuration supports single instance type, single purchase option. AWS recommends Launch Templates for all new ASGs.

ASG + ALB Integration:
┌─────────────────────────────────────────────────┐
│            Auto Scaling Group                   │
│  ┌───────┐   ┌───────┐   ┌───────┐             │
│  │  EC2  │   │  EC2  │   │  EC2  │   ...       │
│  └───┬───┘   └───┬───┘   └───┬───┘             │
│      └───────────┴───────────┘                 │
│              ▲ Health Checks                    │
│  ALB ────────┘                                  │
│                                                 │
│  Scale Out ◄── CloudWatch Alarm (CPU>70%)      │
│  Scale In  ◄── CloudWatch Alarm (CPU<30%)      │
└─────────────────────────────────────────────────┘

ASG Health Check Types:

Type	What it checks	Status
EC2	Instance running (hardware/hypervisor)	Always on
ELB	App responds on health endpoint	Optional (additive)

When ELB enabled: must pass both EC2 + ELB to be healthy.

Unhealthy instance behavior: ASG terminates instance → launches new one.

⚠️ Exam trap: ASG never “restarts the app” or “detaches and leaves running” - always terminates + replaces.

Auto Scaling Groups - Capacity characteristics:

Minimum capacity (number of instances);
Desired capacity (number of instances);
Scale as needed (required by demand capacity);
Maximum capacity (number of instances).

Auto Scaling Groups - Scaling Strategies: Manual Scaling: Update the size of an ASG manually; Dynamic Scaling: Respond to changing demand: - Simple / Step Scaling: threshold-based, you define actions; - Example: CPU > 70% → add 2 units; CPU < 30% → remove 1 unit; - Target Tracking Scaling: “keep metric at X” - ASG auto-adjusts (like thermostat); - Example: avg 1000 connections/instance, 70% CPU, 50 requests/target; - Scheduled Scaling: time-based, for predictable patterns; - Example: scale to 10 instances every Monday 9am, scale down Friday 6pm; - Predictive Scaling: ML-based, proactive; - Uses Machine Learning to predict future traffic ahead of time;

Custom Metrics for Scaling:

Built-in metrics: CPU, Network, ALB request count per target;
Custom metrics: anything else (DB connections, queue depth, app-specific);
How: App → CloudWatch PutMetricData API → Custom Metric → Alarm → ASG.

⚠️ Exam trap: “Detailed Monitoring” only increases EC2 metric frequency (1min vs 5min) - does NOT add new metric types. For app-specific metrics like “DB requests/min” → Custom Metric required.

Decoupling Services (Microservices Approach):

Why decouple? Synchronous communication can be problematic with sudden traffic spikes.

Example: Need to encode 1000 videos but usually it’s 10 → synchronous = bottleneck

Application Communication Patterns:

1) Synchronous (app-to-app):        2) Asynchronous (app-to-queue-to-app):

┌─────────┐      ┌──────────┐       ┌─────────┐    ┌───────┐    ┌──────────┐
│ Buying  │◀────▶│ Shipping │       │ Buying  │───▶│ Queue │───▶│ Shipping │
│ Service │      │ Service  │       │ Service │    │       │    │ Service  │
└─────────┘      └──────────┘       └─────────┘    └───────┘    └──────────┘
   (tight coupling)                    (decoupled, scales independently)

Decoupling services:

SQS: Queue model (point-to-point)
SNS: Pub/Sub model (one-to-many)
Kinesis: Real-time streaming model

These services scale independently from your application!

⚠️ Exam trap - “Services that buffer or throttle traffic spikes”:

✅ API Gateway = throttling (rate limits, burst limits, returns 429)
✅ SQS = buffering (queues absorb spikes, consumers process at own pace)
✅ Kinesis = buffering (stream absorbs spikes, consumers read at own rate)
❌ SNS = push only, no buffering (delivers immediately or fails)
❌ ELB = distributes traffic, doesn’t throttle or buffer
❌ Lambda = compute, not a buffer/throttle service
❌ Gateway Endpoints = VPC private access to S3/DynamoDB, not throttling

Amazon SQS (Simple Queue Service) is fully managed, serverless messaging service that is used to decouple aplications. Send, store, and receive messages between software components, without losing messages or requiring other services to be available. In Amazon SQS, an application sends messages into a queue. A user or service retrieves a message from the queue, processes it, and then deletes it from the queue.

Amazon SNS is a fully managed, serverless, publish/subscribe notification service. Using Amazon SNS topics, a publisher publishes messages to subscribers:

Web servers;
Email addresses;
AWS Lambda functions;
Amazon SQS;
Amazon Kinesis Data Firehose. And more.

Amazon MQ is a managed message broker service for RabbitMQ and ActiveMQ.

Amazon Kinesis is a managed service to collect, process and analyze real-time streaming data at any scale.

Kinesis Data Streams: low latency streaming to ingest data at scale from hundreds of thousands of source;
Kinesis Data Firehose: load stream into S3, Redshift, ElasticSearch, etc;
Kinesis Data Analytics: perform real-time analytics on streams using SQL;
Kinesis Video Streams: monitor real-time video streams for analytics or ML.

Amazon SQS – Standard Queue

• Oldest offering (over 10 years old) • Fully managed service, used to decouple applications • Attributes: • Unlimited throughput, unlimited number of messages in queue • Default retention of messages: 4 days, maximum of 14 days • Low latency (<10 ms on publish and receive) • Limitation of 256KB per message sent • Can have duplicate messages (at least once delivery, occasionally) • Can have out of order messages (best effort ordering)

SQS Queue - Multiple Producers & Consumers:

┌──────────┐                              ┌──────────┐
│ Producer │──┐                     ┌────▶│ Consumer │
└──────────┘  │                     │     └──────────┘
┌──────────┐  │    ┌───────────┐    │     ┌──────────┐
│ Producer │──┼───▶│ SQS Queue │────┼────▶│ Consumer │
└──────────┘  │    └───────────┘    │     └──────────┘
┌──────────┐  │   (Send messages)   │     ┌──────────┐
│ Producer │──┘                     └────▶│ Consumer │
└──────────┘                              └──────────┘
                                    (Poll messages)

Producing Messages:

Use SDK (SendMessage API)
Message persisted in SQS until consumer deletes it
Message size: up to 256 KB
Can include any attributes (order ID, customer ID, etc.)

Consuming Messages:

Consumers: EC2, Lambda, on-prem servers
Poll SQS for messages (receive up to 10 messages at a time)
Process the message (e.g., insert into RDS)
Delete using DeleteMessage API

SQS Message Flow:

                Poll/Receive              Process
SQS Queue ─────────────────▶ Consumer ─────────────▶ RDS
    ▲                            │
    │         DeleteMessage      │
    └────────────────────────────┘

Multiple Consumers (Horizontal Scaling):

Consumers receive and process messages in parallel
Each message goes to one consumer (different messages to different consumers)
Scale consumers horizontally to improve throughput

SQS with Auto Scaling Group:

                  SendMessage                    ReceiveMessages
┌────────────┐        │        ┌───────────┐         │        ┌────────────┐
│ Front-end  │────────┼───────▶│ SQS Queue │─────────┼───────▶│ Back-end   │
│ Web App    │        │        │(infinitely│         │        │ Processing │
│  (ASG)     │        │        │ scalable) │         │        │   (ASG)    │
└────────────┘        │        └─────┬─────┘         │        └────────────┘
                                     │
                    ┌────────────────┴────────────────┐
                    ▼                                 ▼
          CloudWatch Metric                   CloudWatch Alarm
    (ApproximateNumberOfMessages)                    │
                                                     ▼
                                              Scale ASG

⚠️ Exam trap: “Scale consumers based on queue depth” → use CloudWatch Alarm on ApproximateNumberOfMessages

At-Least-Once Delivery: SQS prioritizes never losing messages over exactly-once delivery. Duplicates can occur when:

Consumer takes too long → visibility timeout expires → message reappears → another consumer gets it
Internal replication timing (SQS stores across multiple servers)

Solution: Make consumers idempotent (processing same message twice = same result) or use FIFO queue (exactly-once)

⚠️ Exam trap: “Prevent duplicate processing in SQS” → use FIFO queue or idempotent consumers

[NEW INFO]

Amazon SQS – Security

Security Layer	Options
In-flight encryption	HTTPS API
At-rest encryption	KMS keys (SSE-KMS)
Client-side encryption	Customer manages encrypt/decrypt
Access Controls	IAM policies for SQS API access
SQS Access Policies	Resource-based (like S3 bucket policies)

SQS Access Policies use cases:

Cross-account access to SQS queues
Allow other services (SNS, S3, etc.) to write to SQS queue

⚠️ Exam trap: “Allow S3 to send notifications to SQS” → SQS Access Policy (not IAM policy)

Amazon SQS – Long Polling

Consumer can wait for messages if queue is empty (instead of returning immediately)
Reduces API calls, increases efficiency, reduces latency
Wait time: 1 to 20 seconds (20 sec recommended)
Long Polling > Short Polling (always prefer long polling)

Polling Type	Behavior	API Calls	Cost
Short Polling	Returns immediately (even if empty)	High	Higher (pay per request!)
Long Polling	Waits up to 20 sec for messages	Low	Lower

Why Long Polling saves $$$: SQS pricing = per 1 million requests. Short polling = constant empty responses = wasted money.

Enable Long Polling:

Queue level: set ReceiveMessageWaitTimeSeconds
API level: use WaitTimeSeconds parameter

⚠️ Exam trap: “Reduce SQS costs” or “reduce empty responses” → enable Long Polling

Amazon SQS – Message Visibility Timeout

After message is polled, it becomes invisible to other consumers
Default: 30 seconds (consumer has 30 sec to process & delete)
If not processed in time → message becomes visible again → duplicate processing

Visibility Timeout Timeline:

  ReceiveMessage     ReceiveMessage     ReceiveMessage        ReceiveMessage
      │                   │                  │                      │
      ▼                   ▼                  ▼                      ▼
──────┬───────────────────────────────────────┬────────────────────────▶ Time
      │         Visibility Timeout            │
      │◄─────────────────────────────────────▶│
      │                                       │
   Message              Not returned          Message returned
   returned            (invisible)              (again!)

Visibility Timeout	Risk
Too low (seconds)	Duplicates (message reappears before processing done)
Too high (hours)	Long wait if consumer crashes (re-processing delayed)

Extend timeout: Call ChangeMessageVisibility API to get more time

⚠️ Exam trap: “Consumer needs more time” → use ChangeMessageVisibility API

Amazon SQS – FIFO Queue

FIFO = First In First Out (guaranteed ordering)
Queue name must end with .fifo

FIFO Queue Flow:

┌──────────┐   Send messages    ┌─────────────┐   Poll messages   ┌──────────┐
│ Producer │───────────────────▶│ FIFO Queue  │──────────────────▶│ Consumer │
└──────────┘   [4][3][2][1]     │   ▶|||◀     │   [4][3][2][1]    └──────────┘
                                └─────────────┘
                              (same order in/out)

Feature	Standard Queue	FIFO Queue
Throughput	Unlimited	300 msg/s (3000 with batching)
Ordering	Best effort	Guaranteed (by Message Group ID)
Delivery	At-least-once	Exactly-once (Deduplication ID)
Duplicates	Possible	Removed via Deduplication ID

FIFO Required Parameters:

Parameter	Type	Purpose	Example
Message Group ID	Tag	Groups messages for ordering. Same group = processed in order	`customer_123` or `order_456`
Message Deduplication ID	Token	Prevents duplicates. Same ID within 5 min = rejected	`txn_789` or hash of message body

Use Cases:

Message Group ID: E-commerce orders → group by order_id so all updates for same order are processed in sequence
Deduplication ID: Payment processing → same payment_id won’t be processed twice if retry happens

⚠️ Exam trap: “Need ordering” → FIFO queue. “Need exactly-once” → FIFO queue with Deduplication ID

SQS as a Buffer to Database Writes

Problem: Direct writes to DB under heavy load → transactions lost

Without SQS (transactions may be lost):

requests ───▶ ┌─────────────┐ ─── Insert ───▶ ┌─────────────┐
              │ Application │   transactions   │   RDS /     │
              │   (ASG)     │ ────────────────▶│   Aurora /  │
              └─────────────┘                  │   DynamoDB  │
                    │                          └─────────────┘
              (overwhelmed)

With SQS Buffer (no data loss):

              ┌─────────────┐                  ┌─────────────┐
requests ───▶ │ Enqueue App │  SendMessage     │ Dequeue App │   insert
              │    (ASG)    │ ───────────────▶ │    (ASG)    │ ──────────▶ DB
              └─────────────┘    ┌───────┐     └─────────────┘
                                 │  SQS  │
                                 │Queue  │
                                 │(buffer)│
                                 └───────┘
                            (infinitely scalable)

Use case: Protect database from write spikes, decouple producers from consumers

⚠️ Exam trap: “Database overwhelmed by writes” → use SQS as buffer

[NEW INFO]

Amazon SNS – Simple Notification Service

Pub/Sub model: One message to many receivers (vs SQS point-to-point)

Direct Integration vs Pub/Sub:

Direct (tight coupling):              Pub/Sub (decoupled):

┌─────────┐ ──▶ Email                 ┌─────────┐         ┌─────────────┐
│ Buying  │ ──▶ Fraud Service         │ Buying  │──▶ SNS ─┼──▶ Email    │
│ Service │ ──▶ Shipping              │ Service │   Topic │──▶ Fraud    │
│         │ ──▶ SQS Queue             └─────────┘         │──▶ Shipping │
└─────────┘                           (1 publish)         │──▶ SQS Queue│
(4 integrations to maintain)                              └─────────────┘
                                                         (add subscribers easily)

SNS Limits:

Up to 12,500,000 subscriptions per topic
Up to 100,000 topics

Subscribers: SQS, Lambda, Kinesis Data Firehose, Email, SMS, HTTP(S) endpoints

AWS Services → SNS (built-in integrations):

┌─────────────────────────────────────────────┐
│ CloudWatch Alarms  │  AWS Budgets  │ Lambda │      publish
│ ASG (Notifications)│  S3 (Events)  │ DynamoDB│ ───────────▶  SNS
│ CloudFormation     │  AWS DMS      │ RDS    │
│ (State Changes)    │ (New Replica) │ Events │
└─────────────────────────────────────────────┘

Amazon SNS – How to Publish

Method	Use Case	Steps
Topic Publish (SDK)	Standard notifications	Create topic → Create subscription(s) → Publish
Direct Publish (Mobile SDK)	Mobile push	Create platform app → Create endpoint → Publish

Mobile Push Platforms: Google GCM, Apple APNS, Amazon ADM

⚠️ Exam trap: “Send notification to multiple services at once” → SNS (not SQS)

SNS + SQS: Fan Out Pattern

Push once to SNS, receive in all SQS queues that are subscribers

SNS + SQS Fan Out:

┌─────────┐         ┌───────────┐         ┌───────────┐         ┌─────────────┐
│ Buying  │────────▶│ SNS Topic │────────▶│ SQS Queue │────────▶│ Fraud       │
│ Service │         └─────┬─────┘         └───────────┘         │ Service     │
└─────────┘               │                                     └─────────────┘
                          │               ┌───────────┐         ┌─────────────┐
                          └──────────────▶│ SQS Queue │────────▶│ Shipping    │
                                          └───────────┘         │ Service     │
                                                                └─────────────┘

Benefits:

Fully decoupled, no data loss (SQS persists messages)
SQS enables: data persistence, delayed processing, retries
Add more SQS subscribers anytime
Cross-Region Delivery: works with SQS queues in other regions

Required: SQS queue access policy must allow SNS to write

SNS + SQS FIFO: Fan Out with Ordering

Need fan out + ordering + deduplication? Use SNS FIFO + SQS FIFO

SNS FIFO + SQS FIFO Fan Out:

┌─────────┐         ┌────────────┐         ┌────────────┐         ┌─────────┐
│ Buying  │────────▶│ SNS FIFO   │────────▶│ SQS FIFO   │────────▶│ Fraud   │
│ Service │         │ Topic      │         │ Queue      │         │ Service │
└─────────┘         └─────┬──────┘         └────────────┘         └─────────┘
                          │                ┌────────────┐         ┌─────────┐
                          └───────────────▶│ SQS FIFO   │────────▶│Shipping │
                                           │ Queue      │         │ Service │
                                           └────────────┘         └─────────┘

SNS FIFO Topic:

Same features as SQS FIFO: Message Group ID (ordering), Deduplication ID
Subscribers: SQS Standard or FIFO queues only
Throughput: Limited (same as SQS FIFO: 300/3000 msg/s)

SNS – Message Filtering

JSON policy to filter messages per subscription (subscribers only get what they need)

SNS Message Filtering:

                    Message:
                    Order: 1036
┌─────────┐         Product: Pencil        ┌─────────────────────────────────────┐
│ Buying  │──────▶  State: Placed  ──────▶ │           SNS Topic                 │
│ Service │                                └──────────────┬──────────────────────┘
└─────────┘                                               │
                                                          │
                    ┌─────────────────────────────────────┼─────────────────────┐
                    │                                     │                     │
            Filter: State=Placed              Filter: State=Cancelled    No Filter
                    │                                     │                     │
                    ▼                                     ▼                     ▼
            ┌───────────────┐                     ┌───────────────┐     ┌───────────────┐
            │ SQS (Placed)  │                     │ SQS (Cancelled)│    │ SQS (All)     │
            └───────────────┘                     └───────────────┘     └───────────────┘

No filter policy = receives ALL messages

⚠️ Exam trap: “Route different message types to different queues” → SNS Filter Policy

Application: S3 Events to Multiple Queues

Problem: S3 allows only one event rule per combination of event type + prefix

Solution: S3 → SNS → Fan out to multiple SQS queues

S3 Events Fan Out:

┌───────────┐   events   ┌───────────┐   fan-out   ┌───────────┐
│ S3 Object │───────────▶│ SNS Topic │────────────▶│ SQS Queue │
│ Created   │            └─────┬─────┘             └───────────┘
└───────────┘                  │                   ┌───────────┐
                               ├──────────────────▶│ SQS Queue │
                               │                   └───────────┘
                               │                   ┌───────────┐
                               └──────────────────▶│ Lambda    │
                                                   └───────────┘

⚠️ Exam trap: “S3 event to multiple destinations” → S3 → SNS → Fan out

SNS can send to Kinesis Data Firehose → then to any KDF destination

SNS → Kinesis Data Firehose → S3:

┌─────────┐         ┌───────────┐         ┌─────────────────┐         ┌────────┐
│ Buying  │────────▶│ SNS Topic │────────▶│ Kinesis Data    │────────▶│   S3   │
│ Service │         └───────────┘         │ Firehose        │         └────────┘
└─────────┘                               └─────────────────┘
                                          (or any KDF destination)

Amazon SNS – Security

Security Layer	Options
In-flight encryption	HTTPS API
At-rest encryption	KMS keys
Client-side encryption	Customer manages encrypt/decrypt
Access Controls	IAM policies for SNS API access
SNS Access Policies	Resource-based (like S3 bucket policies)

SNS Access Policies use cases:

Cross-account access to SNS topics
Allow other services (S3, etc.) to write to SNS topic

⚠️ Exam trap: “Allow S3 to publish to SNS” → SNS Access Policy (not IAM policy)

Feature	SQS	SNS	Kinesis
Model	Pull (consumers poll)	Push (to subscribers)	Pull (standard) / Push (enhanced fan-out)
Data persistence	Deleted after consumed	Not persisted (lost if not delivered)	Retained up to 365 days
Replay capability	No	No	Yes
Consumers/Subscribers	Unlimited workers	12.5M subscribers, 100K topics	2 MB/shard (standard), 2 MB/shard/consumer (enhanced)
Throughput	No provisioning needed	No provisioning needed	Provisioned or On-demand
Ordering	FIFO queues only	FIFO topics (for SQS FIFO)	Per shard (Partition ID)
Delay	Individual message delay	No	No
Use case	Decouple apps, buffer	Fan-out notifications	Real-time big data, analytics, ETL

Amazon Kinesis Data Streams

Collect and store streaming data in real-time

Kinesis Data Streams Flow:

┌─────────────────┐                                      ┌──────────────────┐
│ Click Streams   │                                      │ Application      │
│ IoT Devices     │──┐    ┌──────────────────────┐   ┌──▶│ Lambda           │
│ Metrics & Logs  │  │    │ Kinesis Data Streams │   │   │ Data Firehose    │
└─────────────────┘  │    │  ┌────┬────┬────┐    │   │   │ Apache Flink     │
                     ├───▶│  │Shard│Shard│Shard│  │───┘   └──────────────────┘
┌──────────────────┐ │    │  └────┴────┴────┘    │              Consumers
│ Producers:       │ │    └──────────────────────┘
│ - Applications   │─┘
│ - Kinesis Agent  │
└──────────────────┘

Key Features:

Retention: 1 day (default) up to 365 days
Replay capability (reprocess data)
Data is immutable (can’t delete until expiry)
Max record size: 1 MB
Ordering by Partition ID (same partition = same shard = ordered)
Encryption: KMS (at-rest), HTTPS (in-flight)

Libraries:

KPL (Kinesis Producer Library): optimized producer
KCL (Kinesis Client Library): optimized consumer

Kinesis Data Streams – Capacity Modes

Mode	Provisioning	Throughput	Scaling	Pricing
Provisioned	Choose # of shards	1 MB/s in, 2 MB/s out per shard	Manual	Per shard/hour
On-Demand	Automatic	Default 4 MB/s in	Auto (based on last 30 days peak)	Per stream/hour + data in/out

Switching modes: Console or CLI, no downtime, but limited to 2 switches per 24 hours

ProvisionedThroughputExceeded: Add more shards or switch to On-Demand mode

⚠️ Exam trap: “Unpredictable traffic spikes in Kinesis” → On-demand mode

⚠️ Exam trap: “ProvisionedThroughputExceeded in Kinesis” → Add shards or use On-Demand mode

⚠️ Exam trap: Why NOT “SQS as buffer to Kinesis”? Seems logical (SQS handles any spike, buffers for Kinesis). But adds latency (no longer real-time), complexity, and the bottleneck just MOVES to where SQS writes to Kinesis Solution: Scale Kinesis directly (add shards) — don’t work around it

⚠️ Exam trap: “Need to replay streaming data” → Kinesis Data Streams (not Firehose, not SQS)

Amazon Data Firehose

Load streaming data into destinations (fully managed, no code)

Data Firehose Flow:

┌─────────────────┐                                     ┌─────────────────────┐
│ Producers:      │                                     │ AWS Destinations:   │
│ - Kinesis Streams│     ┌─────────────────────┐        │ - S3               │
│ - CloudWatch    │     │                     │        │ - Redshift         │
│ - AWS IoT       │────▶│  Data Firehose      │───────▶│ - OpenSearch       │
│ - SNS           │     │  (batch writes)     │        ├─────────────────────┤
│ - SDK/Agent     │     │       │             │        │ 3rd Party:         │
└─────────────────┘     │       ▼             │        │ - Splunk, Datadog  │
                        │  Lambda Transform   │        │ - MongoDB, NewRelic│
    Record up to 1MB    └─────────────────────┘        ├─────────────────────┤
                              │                        │ Custom: HTTP endpoint│
                              ▼                        └─────────────────────┘
                        S3 Backup Bucket
                        (all or failed data)

Note: SQS is NOT a Firehose producer (SQS → Firehose requires Lambda in between)

Key Features:

Fully managed, serverless, auto-scaling
Near real-time (buffering by size/time)
Supports: CSV, JSON, Parquet, Avro, Raw, Binary
Transformations: CSV→JSON, Parquet/ORC conversion, compression (gzip/snappy)
Lambda: Custom data transformation

Kinesis Data Streams vs Data Firehose

Feature	Kinesis Data Streams	Data Firehose
Purpose	Streaming data collection	Load data to destinations
Management	Producer/Consumer code needed	Fully managed
Latency	Real-time (~200ms)	Near real-time (buffering)
Scaling	Provisioned / On-Demand	Automatic
Data Storage	Up to 365 days	No storage
Replay	✅ Yes	❌ No
Destinations	Custom consumers	S3, Redshift, OpenSearch, 3rd party, HTTP
Data Transformation	❌ No (raw data)	✅ Yes (Lambda, format conversion)

⚠️ Exam trap: “Real-time streaming” → Kinesis Data Streams. “Near real-time” → Data Firehose

⚠️ Exam trap: “Transform data while streaming to S3” → Data Firehose (only service with built-in transformation)

⚠️ Exam trap: “Load streaming data directly to S3” → Data Firehose (not Kinesis Data Streams)

Amazon MQ

Managed message broker for RabbitMQ and ActiveMQ (migration path for on-prem apps)

When to use Amazon MQ vs SQS/SNS:

SQS/SNS: Cloud-native, proprietary AWS protocols, scales massively
Amazon MQ: Migrating on-prem apps using open protocols (MQTT, AMQP, STOMP, OpenWire, WSS)

Feature	SQS/SNS	Amazon MQ
Protocols	AWS proprietary	MQTT, AMQP, STOMP, OpenWire, WSS
Scaling	Serverless, unlimited	Runs on servers, limited scaling
Use case	New cloud-native apps	Migrate existing on-prem apps
Features	Queue (SQS) OR Topic (SNS)	Both queue AND topic features

Amazon MQ Supported Protocols:

MQTT — lightweight IoT protocol (publish/subscribe)
AMQP — Advanced Message Queuing Protocol
STOMP — Simple Text Oriented Messaging Protocol
OpenWire — ActiveMQ native protocol
WSS — WebSocket Secure
JMS/NMS — Java/.NET Messaging Service APIs

⚠️ Exam trap: “Migrate on-prem app using MQTT/AMQP/STOMP” → Amazon MQ (SQS/SNS don’t support these protocols)

Amazon MQ High Availability (Multi-AZ):

                          Region (us-east-1)
                    ┌─────────────────────────────────┐
                    │      AZ (us-east-1a)            │
                    │    ┌─────────────────┐          │
           ┌───────▶│    │ ACTIVE Broker   │◀────┐    │
           │        │    └─────────────────┘     │    │
           │        ├────────────────────────────┼────┤
┌────────┐ │        │      AZ (us-east-1b)       │    │
│ Client │─┤        │    ┌─────────────────┐     │    │    ┌─────────┐
└────────┘ │        │    │ STANDBY Broker  │◀────┼────┼───▶│ Amazon  │
           │        │    └─────────────────┘     │    │    │ EFS     │
           └───────▶│         (failover)         │    │    │(storage)│
                    └─────────────────────────────────┘    └─────────┘

High Availability:

Active/Standby brokers across 2 AZs
Amazon EFS for shared storage (messages persist across failover)
Automatic failover

⚠️ Exam trap: “Migrate on-prem RabbitMQ/ActiveMQ to AWS” → Amazon MQ (not SQS/SNS)

⚠️ Exam trap: “SNS FIFO topic subscribers” → SQS queues only (Standard or FIFO)

🎯 MASTER SUMMARY: Messaging Services Exam Guide

Part 1: Core Principles (Understand WHY → Derive WHAT)

Principle 1: Decoupling = Scaling Independence

The entire point of messaging services is breaking tight coupling. When you see:

“Traffic spikes” → decouple with queue
“Service overwhelmed” → buffer in front of it
“Transactions lost” → queue absorbs, processes at safe pace

SQS is the universal buffer. Infinite throughput, never loses messages. When anything is overwhelmed (database, API, service), put SQS in front.

Two fundamental patterns:

Pull (SQS): Consumer decides when to get messages. Good for: rate limiting, batch processing, handling backpressure.
Push (SNS): Producer broadcasts, subscribers receive immediately. Good for: notifications, fan-out, event-driven.

Fan Out = SNS + SQS combined. SNS pushes to multiple SQS queues. Each queue processes independently. Best of both worlds.

Principle 3: Persistence = Can You Replay?

Service	Stores Data?	Replay?	Retention
SQS	Until consumed	❌ No	4 days default, 14 max
SNS	Never	❌ No	None (deliver or lose)
Kinesis Streams	Up to 365 days	✅ Yes	1 day default, 365 max
Firehose	Never	❌ No	None (pass-through)

If they need to reprocess old data → Kinesis Data Streams is the ONLY option.

Principle 4: Real-time vs Near Real-time

Real-time (~200ms): Kinesis Data Streams. You write consumers, you manage shards.
Near real-time (buffered): Data Firehose. Fully managed, batches writes, transforms data.

Firehose is “lazy Kinesis” — easier but slower. If latency matters, use Streams.

Principle 5: FIFO = Ordering + Exactly-Once

Standard queues/topics are fast but messy (duplicates possible, order not guaranteed).

FIFO queues/topics trade throughput (300 msg/s) for guarantees:

Message Group ID = ordering within group
Deduplication ID = no duplicates within 5 min window

Principle 6: Access Control = Who’s Calling?

IAM Policy: Controls what YOUR users/roles can do
Resource Policy (Access Policy): Controls what OTHER accounts/services can do

Cross-account? Other AWS service (S3, SNS)? → Resource-based policy.

Principle 7: Protocol = Cloud-Native vs Legacy

SQS/SNS = AWS proprietary SDKs. Great for new apps. Amazon MQ = Open protocols (MQTT, AMQP, STOMP). For migrating existing apps.

Keyword “migrate” + “existing broker” + “no code changes” = Amazon MQ

Principle 8: Scaling Strategy

Service	How to Scale
SQS	Automatic (just add consumers)
SNS	Automatic
Kinesis	Add shards (Provisioned) or use On-Demand
Firehose	Automatic

ProvisionedThroughputExceeded = add shards or switch to On-Demand.

Part 2: Decision Tree (Follow Keywords → Find Answer)

Step 1: What’s the communication pattern?

                        What's the pattern?
                              │
        ┌─────────────────────┼─────────────────────┐
        ▼                     ▼                     ▼
   One-to-One            One-to-Many         Continuous Stream
        │                     │                     │
        ▼                     ▼                     ▼
      SQS                   SNS              Need replay?
                       (or Fan Out)               │
                                          ┌───────┴───────┐
                                          ▼               ▼
                                         Yes              No
                                          │               │
                                          ▼               ▼
                                    Kinesis DS      Firehose

Step 2: Feature-Based Decision Table

If question mentions…	Answer is…
“ordering” or “sequence”	SQS/SNS FIFO
“exactly-once” or “no duplicates”	FIFO + Deduplication ID
“replay” or “reprocess”	Kinesis Data Streams
“transform while streaming”	Data Firehose + Lambda
“load directly to S3/Redshift”	Data Firehose
“multiple destinations from one event”	SNS Fan Out
“filter messages per subscriber”	SNS Filter Policy
“MQTT/AMQP/STOMP protocol”	Amazon MQ
“cross-account access”	Resource-based policy (SQS/SNS Access Policy)
“reduce costs” + SQS	Long Polling
“database overwhelmed”	SQS as buffer
“unpredictable traffic” + Kinesis	On-Demand mode
“ProvisionedThroughputExceeded”	Add shards or On-Demand
“consumer needs more time”	`ChangeMessageVisibility` API
“scale based on queue depth”	CloudWatch Alarm on `ApproximateNumberOfMessages`

The “NOT” Rules (Eliminate Wrong Answers Fast)

Statement	Why It’s Wrong
SNS for replay	SNS does NOT persist messages
SQS pushes to consumers	SQS is pull-only (consumers poll)
Kinesis Streams transforms data	Streams = raw data only, Firehose transforms
Firehose for replay	Firehose does NOT store (pass-through)
SQS/SNS with MQTT	Use Amazon MQ for MQTT/AMQP/STOMP
Lambda subscribes to SNS FIFO	SNS FIFO → SQS queues ONLY
SQS as Firehose producer	SQS needs Lambda to feed Firehose
Kinesis Streams loads to S3 directly	Use Firehose for S3/Redshift loading

Part 3: Scenario Pattern Recognition

Pattern: “Decouple / Buffer / Protect Backend”

Keywords: overwhelmed, spikes, lost transactions, protect database, decouple

Answer: SQS as buffer

requests ──▶ [Front-end ASG] ──▶ [SQS Queue] ──▶ [Back-end ASG] ──▶ Database
                                 (absorbs spike)   (processes safely)

Pattern: “One Event → Multiple Actions”

Keywords: notify multiple services, fan out, broadcast, S3 event to multiple queues

Answer: SNS (or SNS + SQS Fan Out)

[Producer] ──▶ [SNS Topic] ──┬──▶ [SQS Queue 1] ──▶ Service A
                             ├──▶ [SQS Queue 2] ──▶ Service B
                             └──▶ [Lambda]      ──▶ Service C

Pattern: “Need to Replay / Reprocess Data”

Keywords: replay, reprocess, audit trail, re-analyze, multiple consumers read same data

Answer: Kinesis Data Streams

Why: Only service that stores data (up to 365 days) and allows multiple reads.

Pattern: “Stream Data to S3/Redshift/OpenSearch”

Keywords: load streaming data, store in S3, analytics destination, transform while streaming

Answer: Data Firehose

[Any Source] ──▶ [Firehose] ──▶ (optional Lambda) ──▶ S3/Redshift/OpenSearch

Pattern: “Real-time Processing Required”

Keywords: real-time, sub-second, immediate, ~200ms, IoT, clickstream

Answer: Kinesis Data Streams (NOT Firehose — it buffers)

Pattern: “Order Matters / No Duplicates”

Keywords: ordering, sequence, exactly-once, financial transactions, no duplicates

Answer: FIFO queue/topic

Remember: Queue name ends with .fifo. Throughput = 300-3000 msg/s.

Pattern: “Migrate Existing Message Broker”

Keywords: migrate, existing application, RabbitMQ, ActiveMQ, MQTT, AMQP, no code changes

Answer: Amazon MQ

Why: Supports open protocols. SQS/SNS require AWS SDK = code changes.

Pattern: “Reduce Costs / Empty Responses”

Keywords: reduce API calls, empty responses, cost optimization, SQS

Answer: Long Polling (set WaitTimeSeconds up to 20 sec)

Pattern: “Consumer Processing Takes Too Long”

Keywords: timeout, visibility, need more time, duplicate processing

Answer: Increase Visibility Timeout or call ChangeMessageVisibility API

Pattern: “Unpredictable Traffic in Kinesis”

Keywords: unpredictable, variable load, spikes, promotional campaign

Answer: On-Demand mode (auto-scales based on last 30 days peak)

Pattern: “ProvisionedThroughputExceeded Error”

Keywords: throughput exceeded, throttling, Kinesis errors

Answer: Add more shards OR switch to On-Demand

Pattern: “Cross-Account / Allow Other Service”

Keywords: cross-account, allow S3 to write, allow SNS to write

Answer: Resource-based policy (SQS/SNS Access Policy)

Pattern: “Route Different Message Types”

Keywords: filter, route by attribute, different processing per type

Answer: SNS Filter Policy (JSON policy per subscription)

Pattern: “S3 Event to Multiple Destinations”

Keywords: S3 notification, multiple queues, multiple Lambda

Answer: S3 → SNS → Fan Out (S3 allows only one rule per event+prefix combo)

Part 4: Quick Reference Tables

Service Comparison At-a-Glance

Feature	SQS	SNS	Kinesis Streams	Firehose	Amazon MQ
Model	Pull	Push	Pull/Push	Push	Pull/Push
Throughput	Unlimited	Unlimited	Per shard	Auto	Limited
Ordering	FIFO only	FIFO only	Per shard	No	Yes
Persistence	Until consumed	No	Up to 365 days	No	Yes
Replay	❌	❌	✅	❌	❌
Transform	❌	❌	❌	✅	❌
Protocols	AWS SDK	AWS SDK	AWS SDK	AWS SDK	MQTT/AMQP/STOMP

Throughput Numbers

Service	Throughput
SQS Standard	Unlimited
SQS FIFO	300 msg/s (3000 batched)
SNS Standard	Unlimited (12.5M subscribers/topic)
SNS FIFO	300 msg/s (3000 batched)
Kinesis Provisioned	1 MB/s in, 2 MB/s out per shard
Kinesis On-Demand	Auto (default 4 MB/s in)
Data Firehose	Auto-scales

Message Size Limits

Service	Max Message/Record Size
SQS	256 KB
SNS	256 KB
Kinesis	1 MB
Firehose	1 MB

Key APIs to Remember

API	Service	Purpose
`SendMessage`	SQS	Send message to queue
`ReceiveMessage`	SQS	Poll messages (up to 10)
`DeleteMessage`	SQS	Remove processed message
`ChangeMessageVisibility`	SQS	Extend processing time
`Publish`	SNS	Send to topic
`PutRecord` / `PutRecords`	Kinesis	Send to stream

Part 5: Ultimate Instant-Answer Table

Question Contains	→ Instant Answer
“replay” / “reprocess”	Kinesis Data Streams
“fan out” / “multiple destinations”	SNS (+ SQS for persistence)
“buffer” / “overwhelmed” / “protect DB”	SQS
“MQTT” / “AMQP” / “migrate broker”	Amazon MQ
“ordering” / “sequence” / “exactly-once”	FIFO
“transform while streaming”	Data Firehose
“load to S3” (from stream)	Data Firehose
“real-time” + streaming	Kinesis Data Streams
“near real-time” + streaming	Data Firehose
“reduce SQS costs”	Long Polling
“empty responses”	Long Polling
“unpredictable Kinesis traffic”	On-Demand mode
“ProvisionedThroughputExceeded”	Add shards / On-Demand
“cross-account” / “allow S3/SNS”	Resource-based policy
“filter per subscriber”	SNS Filter Policy
“S3 event multiple destinations”	S3 → SNS → Fan Out
“consumer needs more time”	ChangeMessageVisibility
“scale on queue depth”	CloudWatch ApproximateNumberOfMessages
“no code changes” + “migrate”	Amazon MQ
“RabbitMQ/ActiveMQ to AWS”	Amazon MQ

Part 6: Elimination Checklist

When stuck between options, eliminate systematically:

□ Do they need REPLAY?
  → No = eliminate Kinesis Data Streams
  → Yes = Kinesis Data Streams is likely answer

□ Do they need PUSH to multiple?
  → No = eliminate SNS
  → Yes = SNS or Fan Out pattern

□ Do they need ORDERING?
  → No = eliminate FIFO options
  → Yes = must be FIFO

□ Do they need REAL-TIME?
  → No = Firehose acceptable
  → Yes = must be Kinesis Data Streams

□ Do they mention OPEN PROTOCOLS (MQTT/AMQP)?
  → No = eliminate Amazon MQ
  → Yes = Amazon MQ is likely answer

□ Do they need DATA TRANSFORMATION?
  → No = Kinesis Streams acceptable
  → Yes = must be Firehose (with Lambda)

□ Is it CROSS-ACCOUNT or OTHER SERVICE access?
  → No = IAM policy
  → Yes = Resource-based policy

🏆 The Golden Rules

Replay = Kinesis Data Streams (only option)
Fan Out = SNS (optionally + SQS)
Buffer = SQS (infinite, never loses)
Open Protocols = Amazon MQ (migrate without code changes)
Order/Exactly-Once = FIFO (trade throughput for guarantees)
Stream to S3 = Firehose (not Kinesis Streams directly)
Transform = Firehose (only streaming service with built-in transform)
Real-time = Kinesis Streams (Firehose buffers = near real-time)
Other Account/Service = Resource Policy (not IAM)
Reduce SQS cost = Long Polling (always)

Solution Architecture

Stateless Web App Evolution: WhatIsTheTime.com A simple app that returns current time — no database needed.

Growth Steps:

Step	Architecture	Problem Solved	New Problem
1	EC2 + Public IP	Works!	IP changes on restart
2	EC2 + Elastic IP	Static IP	Single point of failure, no scaling
3	EC2 + Route 53 (A record)	DNS-based, no Elastic IP needed	Still single instance
4	ELB + multiple EC2	Horizontal scaling, health checks	Manual instance management
5	ELB + ASG	Auto-scaling, self-healing	Single AZ failure risk
6	ELB + ASG + Multi-AZ	High availability across AZs	✅ Production ready!

       ┌──────────┐
       │ Route 53 │ Alias Record
       │   DNS    │ api.whatisthetime.com
       └────┬─────┘
            │
            ▼
       ┌─────────┐      AZ 1-3
       │   ELB   │◄─── Health Checks
       │Multi-AZ │     + Multi-AZ
       └────┬────┘
            │
    ┌───────┼───────┐
    ▼       ▼       ▼
  ┌───┐   ┌───┐   ┌───┐
  │M5 │   │M5 │   │M5 │  ◄── Auto Scaling Group
  │AZ1│   │AZ2│   │AZ3│      (spans 3 AZs)
  └───┘   └───┘   └───┘

Key Concepts Covered:

Public vs Private IP, Elastic IP
Route 53 TTL, A records, Alias Records
ELB Health Checks + Multi-AZ
Manual EC2 vs Auto Scaling Groups
Security Group Rules
Reserved Instances for cost savings

⚠️ Exam trap - Cost Optimization with ASG:

If ASG min capacity = 2 (for HA), those 2 instances run 24/7
Use Reserved Instances for the baseline (min capacity) → up to 72% savings
On-Demand/Spot only needed for instances above min capacity
Don’t confuse: reducing min to 0 or 1 breaks HA requirement!

Stateful Web App Evolution: MyClothes.com (Session State) E-commerce app with shopping cart — needs to maintain user state across requests.

The Problem: With multiple EC2 instances behind ELB, user may hit different server each request → loses shopping cart!

Growth Steps:

Step	Solution	How It Works	Trade-off
1	ELB Sticky Sessions	Cookie ties user to same EC2	Instance failure = lost cart
2	User Cookies	Store cart in browser cookie	Limited size, security risk
3	ElastiCache (Sessions)	Store session in Redis/Memcached	Sub-ms latency, shared state
4	DynamoDB (Sessions)	Alternative to ElastiCache	Serverless, auto-scaling
5	RDS (User Data)	Persist user details, addresses	Need read replicas for scale
6	ElastiCache (Caching)	Cache RDS queries	Reduce DB load
7	Multi-AZ Everything	RDS + ElastiCache Multi-AZ	✅ Production ready!

                        ┌──────────┐
                        │ Route 53 │
                        └────┬─────┘
                             │
    ┌────────────────────────┴──────────────────────┐
    │                    Multi-AZ                   │
    │  ┌─────────┐                                  │
    │  │   ELB   │◄── Open HTTP/HTTPS to 0.0.0.0/0  │
    │  └────┬────┘                                  │
    │       │ Restrict to ELB SG only               │
    │  ┌────┴────┬─────────┐     Auto Scaling Group │
    │  ▼         ▼         ▼                        │
    │┌────┐    ┌────┐    ┌────┐                     │
    ││ M5 │    │ M5 │    │ M5 │  AZ1, AZ2, AZ3      │
    │└──┬─┘    └─┬──┘    └──┬─┘                     │
    └──┼─────────┼──────────┼───────────────────────┘
       │         │          │
       │  Restrict to EC2 SG only
       ▼         ▼          ▼
  ┌─────────┐        ┌─────────┐
  │Elasti-  │        │   RDS   │
  │Cache    │        │Multi-AZ │
  │(sessions│        │+Replicas│
  │+caching)│        └─────────┘
  └─────────┘

3-Tier Security (SG Chaining):

Layer	Security Group Rule
ELB	Inbound: HTTP/HTTPS from `0.0.0.0/0`
EC2	Inbound: Only from ELB SG
RDS/ElastiCache	Inbound: Only from EC2 SG

Key Concepts:

Sticky Sessions — quick fix, but not fault-tolerant
ElastiCache — session storage (Redis) + query caching
DynamoDB — serverless alternative for sessions
RDS Read Replicas — scale reads, Multi-AZ for DR
SG Chaining — reference SGs instead of IPs for tight security

⚠️ Exam trap - Stateless Session Storage:

Storage	Stateless?	Why
ElastiCache	✅ Yes	Shared across all EC2s
RDS/DynamoDB	✅ Yes	Shared across all EC2s
HTTP Cookies	✅ Yes	Client carries state
EBS	❌ No	Single AZ, single EC2 only

EBS makes app stateful — user hitting different EC2 loses session!

Typical 3-Tier Web App Architecture Reference diagram showing production-ready AWS web app with all components:

                              ┌──────────┐
                              │ Route 53 │
                              └────┬─────┘
                                   │
┌──────────────────────────────────┴──────────────────────────────────────┐
│ PUBLIC SUBNET                                                           │
│    ┌─────────────────────────────────────────────────────────────────┐  │
│    │                     ELB (Multi-AZ)                              │  │
│    │               ◄─ Open HTTP/HTTPS to 0.0.0.0/0                   │  │
│    └─────────────────────────┬───────────────────────────────────────┘  │
└──────────────────────────────┼──────────────────────────────────────────┘
                               │
┌──────────────────────────────┴──────────────────────────────────────────┐
│ PRIVATE SUBNET            Auto Scaling Group                            │
│         ┌─────────────┬─────────────┬─────────────┐                     │
│         │   ┌─────┐   │   ┌─────┐   │   ┌─────┐   │                     │
│         │   │ M5  │   │   │ M5  │   │   │ M5  │   │                     │
│         │   │ AZ1 │   │   │ AZ2 │   │   │ AZ3 │   │                     │
│         │   └──┬──┘   │   └──┬──┘   │   └──┬──┘   │                     │
│         └─────────────┴─────────────┴─────────────┘                     │
└─────────────────┼───────────┼───────────┼───────────────────────────────┘
                  │           │           │
┌─────────────────┴───────────┴───────────┴───────────────────────────────┐
│ DATA SUBNET                                                             │
│    ┌─────────────────────┐        ┌─────────────────────┐               │
│    │     ElastiCache     │        │      Amazon RDS     │               │
│    │   ─────────────     │        │   ─────────────     │               │
│    │  Session storage    │        │  Read/write data    │               │
│    │  + Query cache      │        │  (Multi-AZ)         │               │
│    └─────────────────────┘        └─────────────────────┘               │
└─────────────────────────────────────────────────────────────────────────┘

Subnet	Contains	Access
Public	ELB	Open to internet (0.0.0.0/0)
Private	EC2 (ASG)	Only from ELB SG
Data	RDS, ElastiCache	Only from EC2 SG

Stateful Web App Evolution: MyWordPress.com (Shared File Storage) Scalable WordPress with image uploads and MySQL database.

The Problem: Images uploaded to one EC2 won’t be visible from other EC2 instances!

Growth Steps:

Step	Solution	Problem Solved	Limitation
1	Single EC2 + EBS	Simple, works	Single AZ, no scaling
2	Multi EC2 + EBS each	Scaling	Images not shared across instances!
3	Multi EC2 + EFS	Shared storage across AZs	✅ All instances see all images
4	Aurora MySQL	Multi-AZ + Read Replicas built-in	✅ Production ready!

EBS vs EFS for Distributed Apps:

Storage	Scope	Use Case
EBS	Single EC2 in single AZ	Single instance apps
EFS	Shared across EC2s + AZs	Distributed apps (WordPress, CMS)

       ┌──────────┐
       │ Route 53 │
       └────┬─────┘
            │
       ┌────┴────┐
       │   ELB   │ Multi-AZ
       └────┬────┘
            │
    ┌───────┴───────┐
    ▼               ▼
┌────────┐     ┌────────┐
│   M5   │     │   M5   │
│  AZ 1  │     │  AZ 2  │
└───┬────┘     └───┬────┘
    │   ENI        │   ENI
    │              │
    └──────┬───────┘
           │
           ▼
       ┌───────┐
       │  EFS  │ ◄── Shared storage
       │       │     (images visible
       └───────┘      from all EC2s)

Key Concepts:

EBS — single instance only, locked to one AZ
EFS — NFS shared across instances and AZs (via ENI)
Aurora — MySQL/PostgreSQL with easy Multi-AZ + Read Replicas

⚠️ Exam trap: “Shared file storage across multiple EC2 instances” → EFS (not EBS!)

⚠️ Exam trap - Software updates on 100s of EC2s:

❌ EBS snapshots → heavy operations, not dynamic
❌ EBS + replication → complex, single AZ per volume
❌ RDS → databases, not file storage
✅ EFS → mount as network drive, all instances see updates instantly

Instantiating Applications Quickly Launching a full stack (EC2, EBS, RDS) can be slow — install apps, configure, insert data. Use these strategies to speed up:

Golden AMI = AMI standardized through configuration, consistent security patching, and hardening. Contains pre-approved agents for logging, security, and performance monitoring. In Beanstalk, you can specify a custom AMI instead of the standard platform AMI to improve provisioning times.

Resource	Fast Launch Strategy	What It Does
EC2	Golden AMI	Pre-baked image with OS, apps, dependencies
EC2	User Data	Bootstrap script for dynamic config at launch
EC2	Hybrid	Golden AMI + User Data (Elastic Beanstalk approach)
RDS	Restore from Snapshot	DB with schemas + data ready instantly
EBS	Restore from Snapshot	Pre-formatted disk with data

Golden AMI vs User Data:

Approach	Speed	Flexibility	Use Case
Golden AMI	⚡ Fastest	Low (requires rebuild)	Stable configs, rarely change
User Data	Slower	High (scripts)	Dynamic config, secrets
Hybrid	Balanced	Medium	Best of both worlds

⚠️ Exam trap - “Speed up EC2 launch / scale-out”:

Golden AMI = correct answer (pre-baked, boots in seconds)
User Data = runs scripts at boot (slow for 1hr+ installs)
EFS = file storage, not faster boot
RDS = database, irrelevant

⚠️ Exam trap — EC2 User Data facts:

Runs with root privileges (sudo not needed)
By default runs only on first boot (initial launch), NOT on stop/start
Must stop instance to modify user data (can’t change while running)
To run on every boot: configure [scripts-user, always] in cloud-init (non-default)

⚠️ Exam trap - “Static + dynamic installation, reduce boot time”:

✅ Golden AMI (static parts pre-baked) + User Data (dynamic config at boot) = Hybrid approach
❌ “Beanstalk deployment caching” = not a real feature — invented wrong answer
❌ “Store files in S3” = downloading at boot still takes time, doesn’t pre-bake
❌ “User Data to install everything” = defeats the purpose (still 45 min boot)

Elastic Beanstalk

Developer-centric view of deploying apps on AWS — just upload code, Beanstalk handles the rest.

Feature	Details
What it manages	EC2, ASG, ELB, RDS, CloudWatch, etc.
Your responsibility	Application code only
Control	Full control over configuration if needed
Cost	Free (pay only for underlying resources)

Workflow:

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Create    │───→│   Upload    │───→│   Launch    │───→│   Manage    │
│ Application │    │   Version   │    │ Environment │    │ Environment │
└─────────────┘    └──────┬──────┘    └─────────────┘    └──────┬──────┘
                          │                                     │
                          │◄────── deploy new version ──────────┘
                          │
                          └──────── update version ─────────────→

Components:

Component	Description
Application	Container for environments, versions, configs
Application Version	Iteration of your code (stored in S3)
Environment	AWS resources running ONE version at a time
Environment Tier	Web Server or Worker

Environment Tiers:

Tier	Use Case	Components
Web Server	HTTP requests	ELB + ASG + EC2
Worker	Background tasks	SQS + ASG + EC2

┌─────────────────────────────────────┐   ┌─────────────────────────────────────┐
│         Web Server Tier             │   │           Worker Tier               │
│  (myapp.us-east-1.elasticbeanstalk) │   │                                     │
├─────────────────────────────────────┤   ├─────────────────────────────────────┤
│              ┌─────┐                │   │            ┌───────────┐            │
│              │ ELB │                │   │            │ SQS Queue │            │
│              └──┬──┘                │   │            └─────┬─────┘            │
│                 │                   │   │          pull messages              │
│     ┌───────────┴───────────┐       │   │       ┌───────────┴───────────┐     │
│     ▼                       ▼       │   │       ▼                       ▼     │
│ ┌────────┐ ASG        ┌────────┐    │   │   ┌────────┐ ASG        ┌────────┐  │
│ │  EC2   │            │  EC2   │    │   │   │  EC2   │            │  EC2   │  │
│ │(WebSrv)│            │(WebSrv)│    │   │   │(Worker)│            │(Worker)│  │
│ └────────┘            └────────┘    │   │   └────────┘            └────────┘  │
│   AZ 1                  AZ 2        │   │     AZ 1                  AZ 2      │
└─────────────────────────────────────┘   └─────────────────────────────────────┘

Worker Tier Details:

Scales based on number of SQS messages in queue
Web Server tier can push messages to Worker’s SQS queue
Use case: long-running tasks, video processing, email sending

Deployment Modes:

Mode	Components	Use Case
Single Instance	Elastic IP + EC2 + RDS	Dev/test
High Availability	ALB + ASG + Multi-AZ RDS	Production

┌─────────────────────────┐   ┌─────────────────────────────────────────────┐
│    Single Instance      │   │     High Availability with Load Balancer   │
│    (Great for dev)      │   │     (Great for prod)                       │
├─────────────────────────┤   ├─────────────────────────────────────────────┤
│      Elastic IP         │   │                  ┌─────┐                   │
│          │              │   │                  │ ALB │                   │
│          ▼              │   │                  └──┬──┘                   │
│     ┌────────┐          │   │        ┌───────────┴───────────┐           │
│     │  EC2   │          │   │        ▼                       ▼           │
│     └────────┘          │   │   ┌────────┐  ASG         ┌────────┐       │
│          │              │   │   │  EC2   │              │  EC2   │       │
│          ▼              │   │   └────────┘              └────────┘       │
│     ┌────────┐          │   │      AZ 1                   AZ 2           │
│     │  RDS   │          │   │        │                       │           │
│     │ Master │          │   │        ▼                       ▼           │
│     └────────┘          │   │   ┌────────┐              ┌────────┐       │
│       AZ 1              │   │   │  RDS   │              │  RDS   │       │
│                         │   │   │ Master │              │Standby │       │
└─────────────────────────┘   │   └────────┘              └────────┘       │
                              │      AZ 1                   AZ 2           │
                              └─────────────────────────────────────────────┘

Supported Platforms: Go, Java SE, Java/Tomcat, .NET Core/Linux, .NET/Windows, Node.js, PHP, Python, Ruby, Docker (Single/Multi-container), Packer Builder

⚠️ Exam trap: Beanstalk is free — you pay for EC2, RDS, ELB, etc. that it provisions!

⚠️ Exam trap - Slow Beanstalk deployments:

If dependencies resolve on every deploy → use Golden AMI with dependencies pre-installed
Configure Beanstalk to use custom AMI instead of default

Deployment Strategies:

Strategy	Downtime	Deploy Time	Rollback	Use Case
All-at-once	Yes ⚠️	⚡ Fastest	Redeploy	Dev/test
Rolling	No	Slow	Redeploy	Prod, cost-conscious
Rolling with batch	No	Slower	Redeploy	Prod, maintain capacity
Immutable	No	Slowest	Terminate new ASG	Prod, safest
Blue/Green	No	Fast	Swap URL	Prod, instant rollback

Deployment Details:

Strategy	How It Works
All-at-once	Deploy to all at same time — brief outage
Rolling	Deploy to batches, old instances serve while updating
Rolling with batch	Like rolling, but spins up NEW instances first (maintains capacity)
Immutable	New ASG with new instances → swap → terminate old ASG
Blue/Green	New environment → Route 53/ELB swap → terminate old env

⚠️ Exam trap - Deployment strategies:

“Zero downtime” → NOT All-at-once
“Fastest rollback” → Blue/Green (just swap back) or Immutable (terminate new)
“Rolling” keeps some old instances running during deploy

.ebextensions:

Folder in app root: .ebextensions/*.config (YAML/JSON)
Customize EC2, ELB, ASG, environment variables, packages
Runs during deployment

Saved Configurations:

Save environment config to S3
Use to clone/recreate environments

Monitoring:

Amazon CloudWatch is a service that monitors applications, responds to performance changes, optimizes resource use, and provides insights into operational health. By collecting data across AWS resources, CloudWatch gives visibility into system-wide performance and allows users to set alarms, automatically react to changes, and gain a unified view of operational health.

Important metrics:

EC2 instances: CPU Utilization, Status Checks, Network (not RAM);
- Default metric every 5 minutes;
- Option for Detailed Monitoring (every 1 minute) costs additional money.
EBS volumes: Disk Read/Writes;
S3 buckets: BucketSizeBytes, NumberOfObjects, AllRequests;
Billing: Total Estimated Charge (only us-east-1);
Service Limits: how much you’ve been using a service API;
Custom metrics: push your own metrics.

Amazon CloudWatch Alarms are used to trigger notifications for any metric.

Auto scaling: increase or decrease EC2 instances “desired” count;
EC2 Actions: stop, terminate, reboot or recover an EC2 instance;
SNS notifications: send a notification into n SNS topic.

Amazon CloudWatch Logs:

Elastic Beanstalk: collection of logs from application;
ECS: collection from container;
AWS Lambda: collections from function logs;
CloudTrail based on filter;
CloudWatch Log Agents: on EC2 machines or on-premises servers;
Route53: log DNS queries.

Amazon EventBridge is a serverless event bus that ingests data from your own apps, SaaS apps, and AWS services and routes that data to targets.

Schedule: Cron jobs (scheduled scripts);
Event Pattern: Event rules to react to a service doing something;
Trigger Lambda functions, send SQS/SNS messages. EventBridge Scheduler is ideal for simple scheduling tasks, providing a straightforward interface and cost-effective solution for triggering events at specific times.

Amazon CloudTrail is an AWS service that helps you enable operational and risk auditing, governance, and compliance of your AWS account. Actions taken by a user, role, or an AWS service are recorded as events in CloudTrail. Get an history of events / API calls made in the AWS Management Console, AWS Command Line Interface, and AWS SDKs and APIs. Can put logs from CloudTrail into CloudWatch Logs or S3. Audit of all users’ events and activities.

AWS X-Ray provides a complete view of requests as they travel through your application and filters visual data across payloads, functions, traces, services, APIs, and more with no-code and low-code motions.

Analyze and debug applications Receive trace data from your simple and complex applications, whether they are in development or production;
Generate a detailed service map: Compile data from your AWS resources to determine bottlenecks in your cloud architecture and improve application performance;
View performance analytics: Compare trace sets with different conditions for root cause analysis purposes;
Audit your data securely: Configure X-Ray to meet your security and compliance objectives. Aimed at analyzing distributed applications.

Amazon CodeGuru is a static application security testing (SAST) tool that combines machine learning (ML) and automated reasoning to identify vulnerabilities in your code, provide recommendations on how to fix the identified vulnerabilities, and track the status of the vulnerabilities until closure.

Amazon CodeGuru Reviewer: automated code reviews for static code analysis (development).
Amazon CodeGuru Profiler: helping to understand the runtime behavior of applications, identify and remove code inefficiencies, improve performance, and significantly decrease compute costs.

AWS Health Dashboard view the overall status and health of AWS services. AWS Health Dashboard - Your Account provides alerts and remediation guidance when AWS is experiencing events that may impact you.

CloudWatch Metrics

CloudWatch provides metrics for every service in AWS
Metric = variable to monitor (CPUUtilization, NetworkIn, etc.)
Metrics belong to namespaces
Dimension = attribute of a metric (instance id, environment, etc.)
- Up to 30 dimensions per metric
Metrics have timestamps
Can create CloudWatch Dashboards of metrics
Can create Custom Metrics (e.g., RAM — not provided by default)

CloudWatch Logs

Log Structure:

Log groups: arbitrary name, usually representing an application
Log stream: instances within application / log files / containers
Expiration policies: never expire, 1 day to 10 years

Log Sources:

SDK, CloudWatch Logs Agent, CloudWatch Unified Agent
Elastic Beanstalk (application logs)
ECS (container logs)
AWS Lambda (function logs)
VPC Flow Logs
API Gateway
CloudTrail (filtered)
Route53 (DNS queries)

Log Destinations:

Amazon S3 (exports)
Kinesis Data Streams
Kinesis Data Firehose
AWS Lambda
OpenSearch

Encryption:

Logs encrypted by default
Can use KMS with your own keys

CloudWatch Logs Insights

Search and analyze log data stored in CloudWatch Logs
Purpose-built query language
Auto-discovers fields from AWS services and JSON log events
Examples: find specific IP, count “ERROR” occurrences
Save queries and add to CloudWatch Dashboards
Query multiple Log Groups in different AWS accounts
⚠️ Query engine, NOT real-time

CloudWatch Logs — S3 Export vs Subscriptions

Method	Latency	Use Case
S3 Export	Up to 12 hours	Batch archive, compliance
Subscriptions	Real-time / Near real-time	Live processing, streaming

S3 Export:

API call: CreateExportTask
Not real-time — use Subscriptions instead

CloudWatch Logs Subscriptions

Get real-time log events from CloudWatch Logs for processing and analysis.

Subscription Filter = filter which logs are delivered to destination

CloudWatch Logs ──► Subscription Filter ──┬──► Lambda (real-time) ──► OpenSearch
                                          │
                                          ├──► Kinesis Firehose (near real-time) ──► S3
                                          │
                                          └──► Kinesis Data Streams ──► KDF/KDA/EC2/Lambda

Subscription Destinations:

Destination	Latency	Use Case
Lambda	Real-time	Transform, send to OpenSearch
Kinesis Firehose	Near real-time	Deliver to S3, Redshift, OpenSearch
Kinesis Data Streams	Real-time	Custom consumers, analytics

CloudWatch Logs Aggregation (Multi-Account & Multi-Region)

Aggregate logs from multiple accounts and regions into a central location:

Account A / Region 1 ──► Subscription Filter ──┐
                                               │
Account B / Region 2 ──► Subscription Filter ──┼──► Kinesis Data Streams ──► Firehose ──► S3
                                               │                            (near real-time)
Account B / Region 3 ──► Subscription Filter ──┘

⚠️ Exam trap: “Aggregate logs from multiple accounts/regions” → Subscription Filters to central Kinesis Data Streams, then Firehose to S3.

CloudWatch Metric Streams

Continually stream CloudWatch metrics to a destination with near-real-time delivery.

CloudWatch Metrics ──► Kinesis Data Firehose ──┬──► S3 ──► Athena
                      (near real-time)         │
                                               ├──► Redshift
                                               │
                                               └──► OpenSearch

Destinations:

Kinesis Data Firehose → S3, Redshift, OpenSearch
3rd party: Datadog, Dynatrace, New Relic, Splunk, Sumo Logic

Features:

Option to filter metrics to stream only a subset
Near-real-time, low latency

⚠️ Exam trap: “Stream metrics to S3/Redshift/3rd party” → CloudWatch Metric Streams via Firehose.

CloudWatch Agent for EC2

By default, NO logs from EC2 go to CloudWatch!

Must run a CloudWatch Agent on EC2 to push logs
Requires proper IAM permissions
Works on on-premises servers too

Agent Types:

Agent	Metrics	Logs	Notes
CloudWatch Logs Agent	❌	✅	Old version, logs only
CloudWatch Unified Agent	✅	✅	Recommended, more metrics

CloudWatch Unified Agent:

Collects system-level metrics (not available by default)
Collects logs and sends to CloudWatch Logs
Centralized configuration using SSM Parameter Store

Unified Agent Metrics (Linux/EC2):

Category	Metrics
CPU	active, guest, idle, system, user, steal
Disk	free, used, total
Disk IO	writes, reads, bytes, iops
RAM	free, inactive, used, total, cached
Netstat	TCP/UDP connections, net packets, bytes
Processes	total, dead, blocked, idle, running, sleep
Swap	free, used, used %

⚠️ Exam trap: “Monitor RAM on EC2” or “EC2 memory usage” → CloudWatch Unified Agent required! RAM is NOT a default EC2 metric.

⚠️ Exam trap: Default EC2 metrics = CPU, Disk, Network (high-level). For RAM, processes, detailed disk IO → Unified Agent.

CloudWatch Alarms

Alarm States:

State	Meaning
OK	Metric within threshold
ALARM	Metric breached threshold
INSUFFICIENT_DATA	Not enough data yet

Period:

Length of time (seconds) to evaluate metric
High resolution custom metrics: 10 sec, 30 sec, or multiples of 60 sec

Alarm Targets:

Target	Action
EC2	Stop, Terminate, Reboot, or Recover
Auto Scaling	Trigger scaling action (scale out/in)
SNS	Send notification (then trigger Lambda, etc.)

Composite Alarms:

Standard alarms = single metric
Composite Alarms = monitor states of multiple alarms
Use AND and OR conditions
Reduces “alarm noise” by creating complex conditions

CloudWatch Alarms from Logs (Metric Filters)

Create alarms based on CloudWatch Logs using Metric Filters:

CW Logs ──► Metric Filter ──► CW Metric ──► CW Alarm ──► SNS (alert)
            (pattern match)   (count)       (threshold)

How it works:

Logs arrive in CloudWatch Logs (e.g., RDS, Lambda, application logs)
Metric Filter scans for pattern (e.g., “Error”, “Exception”)
Each match increments a custom metric (count)
Alarm monitors metric → triggers when threshold exceeded

Example: RDS Error Alerting

RDS DB Logs ──► CloudWatch Logs ──► Metric Filter ──► Metric ──► Alarm ──► SNS
                                    ("Error")         (count)    (>0)

⚠️ Exam trap: “Alert on keyword in logs” (Error, Exception, etc.) → Metric Filter + Alarm.

⚠️ Exam trap: Don’t use Lambda polling (expensive, not real-time). Don’t use Config (monitors resource config, not log content).

EC2 Instance Recovery

Status Checks:

Check	What it monitors
Instance status	EC2 VM (software)
System status	Underlying hardware
Attached EBS status	EBS volumes

Recovery with CloudWatch Alarm:

EC2 Instance ◄── monitor ── CloudWatch Alarm ──► alert ──► SNS Topic
      │                    (StatusCheckFailed_System)
      │
      └── EC2 Instance Recovery

What’s preserved after recovery:

Same Private IP, Public IP, Elastic IP
Same metadata, placement group

⚠️ Exam trap: “Auto-recover EC2 on hardware failure” → CloudWatch Alarm on StatusCheckFailed_System → EC2 Recovery action.

⚠️ Exam trap: “Most cost-optimal way to auto-reboot/stop/recover EC2” → CloudWatch Alarm → EC2 Action (direct). NOT CW Alarm → SNS → Lambda → EC2 API (over-engineered, 3 services). NOT EventBridge → Lambda (unnecessary compute). CW Alarms have built-in EC2 actions (Stop, Terminate, Reboot, Recover) — no Lambda needed.

Testing CloudWatch Alarms

Test alarms manually using CLI:

aws cloudwatch set-alarm-state \
  --alarm-name "myalarm" \
  --state-value ALARM \
  --state-reason "testing purposes"

CloudWatch Network Synthetic Monitor

Monitor network issues between AWS and on-premises data center.

AWS Cloud
┌────────────────────────────┐
│  ┌──────────────────────┐  │
│  │   Private Subnet     │  │
│  │   ┌────────────┐     │  │
│  │   │ EC2 Instance│     │  │
│  │   └────────────┘     │  │
│  └──────────────────────┘  │
│                            │
│  CloudWatch Metrics ◄──────┼──── DX Connection ──┬──► Corporate Data Center
│                            │         or          │         │
└────────────────────────────┘    VPN Connection   │      Server
                                                   │

Features:

Detect network performance degradation (packet loss, latency, jitter)
No agents required on targets
Tests ICMP or TCP traffic to IPv4/IPv6 on-premises destinations
Works through Direct Connect or Site-to-Site VPN
Publishes data to CloudWatch Metrics

⚠️ Exam trap: “Monitor network connectivity to on-premises” or “detect packet loss/latency over DX/VPN” → CloudWatch Network Synthetic Monitor.

CloudWatch Insights (4 Types)

Insight Type	Target	Use Case
Container Insights	ECS, EKS, K8s on EC2, Fargate	Metrics + logs from containers
Lambda Insights	Lambda functions	Cold starts, memory, CPU, shutdowns
Contributor Insights	CloudWatch Logs	Find top-N talkers, bad hosts, heavy users
Application Insights	EC2 apps (Java, .NET, IIS)	Auto-dashboard for app troubleshooting

CloudWatch Container Insights

Collect, aggregate, summarize metrics and logs from containers
Supported platforms:
- Amazon ECS
- Amazon EKS
- Kubernetes on EC2
- Fargate (ECS and EKS)
For EKS/Kubernetes: uses containerized CloudWatch Agent to discover containers

CloudWatch Lambda Insights

Monitoring and troubleshooting for serverless applications
Collects system-level metrics: CPU time, memory, disk, network
Collects diagnostic info: cold starts, Lambda worker shutdowns
Provided as a Lambda Layer

CloudWatch Contributor Insights

Analyze log data → create time series of contributor data
Find top-N contributors and their usage
Use cases:
- Find bad hosts
- Identify heaviest network users
- Find URLs generating most errors
Works with any AWS-generated logs (VPC, DNS, etc.)
Use sample rules from AWS or build custom rules

CloudWatch Application Insights

Automated dashboards showing potential problems with monitored apps
Supported: EC2 with Java, .NET, IIS, databases
Also works with: EBS, RDS, ELB, ASG, Lambda, SQS, DynamoDB, S3, ECS, EKS, SNS, API Gateway
Powered by SageMaker (ML-based detection)
Findings/alerts sent to EventBridge and SSM OpsCenter

⚠️ Exam trap: “Find top talkers” or “heaviest network users from logs” → Contributor Insights.

⚠️ Exam trap: “Monitor Lambda cold starts” or “Lambda memory/CPU” → Lambda Insights (Lambda Layer).

⚠️ Exam trap: “Auto-dashboard for .NET/Java app issues” → Application Insights (SageMaker-powered).

AWS X-Ray

Distributed tracing for analyzing and debugging applications.

Client ──► API Gateway ──► Lambda ──► DynamoDB
              │              │            │
              └──────────────┴────────────┘
                    X-Ray collects traces
                           │
                           ▼
                    ┌─────────────┐
                    │ Service Map │ ◄── Visual representation
                    │  (latency,  │     of request flow
                    │   errors)   │
                    └─────────────┘

Key Concepts:

Concept	Description
Segments	Data about work done by a service
Subsegments	More granular timing (e.g., DB calls)
Trace	End-to-end path of a request
Annotations	Key-value pairs for filtering traces (indexed)
Metadata	Key-value pairs for additional data (NOT indexed)

Sampling Rules:

Default: 1st request/second + 5% of additional requests
Reservoir = fixed requests/second guaranteed
Rate = percentage of additional requests after reservoir
Custom rules to sample specific paths/methods differently

X-Ray Daemon:

Runs on EC2/ECS/Elastic Beanstalk
Listens on UDP port 2000
Batches and sends traces to X-Ray API
Requires IAM permissions to write to X-Ray

Integrations:

Service	How to Enable
Lambda	Enable “Active Tracing” in config
API Gateway	Enable X-Ray in stage settings
ECS/EKS	Run X-Ray daemon as sidecar
Elastic Beanstalk	`.ebextensions` config
EC2	Install and run X-Ray daemon
ELB	Automatically adds trace header

X-Ray APIs:

API	Purpose
`PutTraceSegments`	Upload segment documents
`PutTelemetryRecords`	Upload telemetry
`GetSamplingRules`	Retrieve sampling rules
`GetSamplingTargets`	Get sampling decisions
`GetServiceGraph`	Get visual service map
`GetTraceSummaries`	Get trace IDs and annotations
`BatchGetTraces`	Get full traces by ID

⚠️ Exam trap: “Debug microservices” or “trace request across services” → X-Ray.

⚠️ Exam trap: “Filter traces by custom attribute” → Use Annotations (indexed), NOT Metadata.

⚠️ Exam trap: X-Ray daemon listens on UDP 2000 — ensure Security Group allows it.

CloudWatch Synthetics Canaries

Configurable scripts that monitor endpoints and APIs.

CloudWatch Synthetics
        │
        ▼
┌───────────────┐     ┌──────────────┐     ┌─────────────┐
│    Canary     │────►│  Endpoint/   │────►│  CloudWatch │
│  (scheduled)  │     │    API       │     │   Metrics   │
└───────────────┘     └──────────────┘     └─────────────┘
        │                                         │
        ▼                                         ▼
  S3 (screenshots,                          CW Alarms
   HAR files)                              (alert on failure)

Key Features:

Scripts written in Node.js or Python
Reproduce customer actions programmatically
Check availability, latency, UI screenshots
Integrates with CloudWatch Alarms
Stores artifacts in S3 (screenshots, HAR files)

Canary Blueprints:

Blueprint	Use Case
Heartbeat Monitor	Load URL, store screenshot, check availability
API Canary	Test REST APIs (GET, POST, etc.)
Broken Link Checker	Check all links on a page
Visual Monitoring	Compare screenshots against baseline
Canary Recorder	Record actions in Chrome, generate script
GUI Workflow Builder	Test multi-step workflows (login, checkout)

Schedule: Run once or on schedule (rate or cron expression)

⚠️ Exam trap: “Monitor website availability” or “test API endpoint regularly” → Synthetics Canaries.

⚠️ Exam trap: Canaries are NOT for load testing — they’re for monitoring.

AWS Health Dashboard

Two components:

Dashboard	Scope	Purpose
Service Health	All AWS	Global AWS service status
Your Account Health	Your account	Events affecting YOUR resources

Your Account Health Dashboard:

Shows events relevant to your account
Provides remediation guidance
Proactive notifications via EventBridge

EventBridge Integration:

AWS Health Event ──► EventBridge ──► Lambda/SNS/etc.
(your account)         (rule)        (automate response)

Use cases:

Auto-restart EC2 when AWS schedules maintenance
Alert on-call team when service degradation affects your region
Trigger failover when availability zone has issues

Health Event Types:

Type	Description
Scheduled Change	Planned maintenance
Account Notification	Account-specific issues
Issue	Ongoing service problem

⚠️ Exam trap: “React to AWS service issues affecting my resources” → Health Dashboard + EventBridge.

⚠️ Exam trap: Service Health = public status. Your Account Health = personalized to your resources.

CloudWatch Evidently

Feature flags and A/B testing for applications.

Application ──► Evidently ──► Feature Flag / Variation
                   │
                   ▼
            Metrics collected
                   │
                   ▼
            Analyze results

Key Features:

Feature	Description
Feature Flags	Safely launch features (enable/disable remotely)
A/B Testing	Compare variations to measure impact
Launches	Gradual rollout to percentage of users
Experiments	Compare metrics between variations

Use Cases:

Launch new feature to 10% of users, gradually increase
Test two button colors, measure which gets more clicks
Kill switch for problematic features

⚠️ Exam trap: “Gradual feature rollout” or “A/B testing” → CloudWatch Evidently.

⚠️ Exam trap: Evidently is for application features, NOT infrastructure testing.

EventBridge Deep Dive

Event buses:

Bus Type	Description
Default	Receives events from AWS services
Custom	Your application events
Partner	SaaS integrations (Datadog, Zendesk, etc.)

Event Flow:

Event Sources              EventBridge                    Targets
┌─────────────┐           ┌───────────┐                 ┌─────────┐
│ AWS Services│──────────►│           │                 │ Lambda  │
├─────────────┤           │   Event   │    Rules        ├─────────┤
│ Custom Apps │──────────►│    Bus    │────(filter)────►│ SQS/SNS │
├─────────────┤           │           │                 ├─────────┤
│ SaaS Partners│─────────►│           │                 │ Step Fn │
└─────────────┘           └───────────┘                 └─────────┘

Schema Registry:

Auto-discovers event schemas from your event bus
Generate code bindings for your IDE
Version schemas as they evolve

Resource-based Policies:

Allow/deny cross-account event bus access
Use case: Aggregate events from multiple accounts to central bus

Account A ──► EventBridge (Account A) ──► Event Bus (Account B - central)
Account B ──► EventBridge (Account B) ──► Event Bus (Account B - central)
Account C ──► EventBridge (Account C) ──► Event Bus (Account B - central)
                                              │
                                              ▼
                                         Central processing

⚠️ Exam trap: “Aggregate events from multiple accounts” → EventBridge Resource-based Policy for cross-account access.

⚠️ Exam trap: Schema Registry = auto-discover event structure, NOT define schemas manually.

AWS CloudTrail

Governance, compliance, and audit for your AWS Account.

Sources                      CloudTrail                    Destinations
┌─────────┐                                               ┌─────────────────┐
│   SDK   │──┐                                        ┌──►│ CloudWatch Logs │
├─────────┤  │                                        │   └─────────────────┘
│   CLI   │──┼──►  CloudTrail  ──► Inspect & Audit ──┤
├─────────┤  │                                        │   ┌─────────────────┐
│ Console │──┤                                        └──►│    S3 Bucket    │
├─────────┤  │                                            └─────────────────┘
│IAM Users│──┘
│& Roles  │
└─────────┘

Key Points:

Enabled by default!
Records history of events/API calls made in your account
Logs can go to CloudWatch Logs or S3
Trail scope: All Regions (default) or single Region
If resource deleted → investigate CloudTrail first!

CloudTrail Event Types

Event Type	Default	What it logs
Management Events	✅ Enabled	Operations on resources (IAM, EC2, CloudTrail config)
Data Events	❌ Disabled	High-volume: S3 object-level, Lambda Invoke
Insights Events	❌ Disabled	Unusual activity detection

Management Events:

Configuring security (IAM AttachRolePolicy)
Configuring routing (EC2 CreateSubnet)
Setting up logging (CloudTrail CreateTrail)
Can separate Read Events vs Write Events

Data Events:

Not logged by default (high volume)
S3 object-level: GetObject, DeleteObject, PutObject
Lambda function execution (Invoke API)
Can separate Read vs Write

CloudTrail Insights

Detect unusual activity in your account:

Inaccurate resource provisioning
Hitting service limits
Bursts of IAM actions
Gaps in periodic maintenance

Management Events ──► Continuous ──► CloudTrail ──► Insights ──┬──► CloudTrail Console
                      analysis        Insights       Events    │
                                                               ├──► S3 Bucket
                                                               │
                                                               └──► EventBridge (automation)

How it works:

Analyzes normal management events → creates baseline
Continuously analyzes write events → detects unusual patterns
Anomalies appear in console, sent to S3, generate EventBridge event

CloudTrail Events Retention

Storage	Retention	Use Case
CloudTrail	90 days	Quick lookup, recent activity
S3 Bucket	Long-term	Compliance, historical analysis

Long-term analysis: Log to S3 → query with Athena

Event Types:              CloudTrail           S3 Bucket           Athena
┌──────────────────┐     ┌─────────┐          ┌─────────┐        ┌─────────┐
│ Management Events│────►│         │          │         │        │         │
│ Data Events      │────►│ 90 days │───log───►│Long-term│──SQL──►│ Analyze │
│ Insights Events  │────►│retention│          │retention│        │         │
└──────────────────┘     └─────────┘          └─────────┘        └─────────┘

⚠️ Exam trap: “Who deleted the resource?” or “API call history” → CloudTrail.

⚠️ Exam trap: “Detect unusual IAM activity” or “burst of API calls” → CloudTrail Insights.

⚠️ Exam trap: “Keep CloudTrail logs beyond 90 days” → Log to S3, query with Athena.

⚠️ Exam trap: “Data Events disabled by default” — S3 object-level and Lambda Invoke need explicit enabling.

EventBridge Archive and Replay

Store events and replay them later — built-in feature, no custom code needed.

Feature	Description
Archive	Store events from any event bus (indefinitely or set retention)
Replay	Re-send archived events to same or different event bus
Filter	Archive only matching events (use event patterns)
Use case	Replay production events in dev/test environment

Production Event Bus                     Dev Event Bus
     │                                        ▲
     ▼                                        │
  Archive ──────► Stored Events ─────► Replay │
  (filter)        (S3, managed)        (6 months later)

Key use case: Store production events → replay in dev environment for testing (periodically or on-demand).

⚠️ Exam trap: “Store EventBridge events for later replay” → Archive and Replay (NOT Lambda + S3/DynamoDB — over-engineered).

⚠️ Exam trap: “Most efficient and cost-effective way to store and replay events” → built-in feature wins over custom Lambda solutions.

EventBridge + CloudTrail Integration

Pattern: React to any API call with alerts/automation.

User ──► API Call ──► AWS Service ──► CloudTrail ──► EventBridge ──► SNS/Lambda
                      (logs API)       (event)        (alert/automate)

Examples:

Trigger	Flow
User assumes IAM Role	IAM (AssumeRole) → CloudTrail → EventBridge → SNS
Security Group modified	EC2 (AuthorizeSecurityGroupIngress) → CloudTrail → EventBridge → SNS
DynamoDB table deleted	DynamoDB (DeleteTable) → CloudTrail → EventBridge → SNS

Example 1: IAM Role Assumption Alert
User ──► AssumeRole ──► IAM ──► CloudTrail ──► EventBridge ──► SNS
                              (API Call log)    (event)       (alert)

Example 2: Security Group Change Alert  
User ──► Edit SG Rules ──► EC2 ──► CloudTrail ──► EventBridge ──► SNS
         (AuthorizeSecurityGroupIngress)

Example 3: DynamoDB Table Deletion Alert
User ──► DeleteTable ──► DynamoDB ──► CloudTrail ──► EventBridge ──► SNS

Key insight: CloudTrail logs all API calls → EventBridge can react to any of them!

⚠️ Exam trap: “Alert when user assumes role” or “notify on Security Group changes” → CloudTrail + EventBridge + SNS.

AWS Config

Auditing and recording compliance of AWS resources over time.

Use cases:

Is there unrestricted SSH access to my security groups?
Do my buckets have public access?
How has my ALB configuration changed over time?

Key Points:

Per-region service (can aggregate across regions/accounts)
Records configuration changes over time
Receive SNS notifications for any changes
Store configuration data in S3 → analyze with Athena

Config Rules

Evaluate whether resources are compliant with desired configurations.

Rule Type	Description
AWS Managed Rules	75+ pre-built rules
Custom Rules	Defined in Lambda

Examples:

Evaluate if EBS disks are type gp2
Evaluate if EC2 instances are t2.micro

Rule Triggers:

On each config change
At regular time intervals
Or both

⚠️ Config Rules does NOT prevent actions (no deny) — only evaluates compliance!

Config Rules — Notifications

Two notification patterns:

Pattern 1: EventBridge (filtered, action-oriented)

AWS Resources ──► AWS Config ──► NON_COMPLIANT ──► EventBridge ──┬──► Lambda
                  (monitor)                        (trigger)     ├──► SNS
                                                                 └──► SQS

Pattern 2: SNS (all events)

AWS Resources ──► AWS Config ──► All events ──► SNS ──► Admin
                  (monitor)      (config changes,       (notification)
                                 compliance state)

Use SNS Filtering or client-side filtering for Pattern 2.

Config Rules — Remediations

Auto-fix non-compliant resources using SSM Automation Documents.

Non-Compliant Resource ──► AWS Config ──► SSM Automation ──► Auto-Remediation
(e.g., expired IAM key)    (detect)       Document            (deactivate key)
                                          (Retries: 5)

Remediation Options:

Option	Description
AWS-Managed Documents	Pre-built remediation actions
Custom Documents	Your own automation (can invoke Lambda)
Remediation Retries	Retry if still non-compliant after auto-fix

Example:

IAM Access Key expired → AWS Config detects → SSM Document AWSConfigRemediation-RevokeUnusedIAMUserCredentials → deactivate key

⚠️ Exam trap: “Auto-remediate non-compliant resources” → AWS Config + SSM Automation Documents.

⚠️ Exam trap: “Config Rules” = detect/evaluate only. “Auto-remediation” = SSM Automation.

⚠️ Exam trap: Config does NOT prevent actions (no deny) — it only detects non-compliance after the fact.

CloudWatch vs CloudTrail vs Config

Service	Purpose	Question it Answers
CloudWatch	Performance monitoring, dashboards, alerts, logs	How is my app performing?
CloudTrail	API call history, audit	WHO made changes?
Config	Configuration compliance, change timeline	Is my resource compliant? How did it change?

Quick Decision:

"Performance/metrics/dashboard"     → CloudWatch
"Who did it? / API calls / audit"   → CloudTrail  
"Is it compliant? / config history" → Config

Example: ELB Monitoring (CloudWatch vs CloudTrail vs Config)

Service	ELB Use Case
CloudWatch	Monitor incoming connections, visualize error codes %, dashboard for performance
Config	Track SG rules, track config changes, ensure SSL certificate always assigned (compliance)
CloudTrail	Track WHO made changes to the Load Balancer (API calls)

⚠️ Exam trap: This comparison is exam-favorite! Remember:

CloudWatch = PERFORMANCE (metrics, dashboards)
CloudTrail = WHO (API audit)
Config = COMPLIANCE (rules, changes over time)

🚫 Common Wrong Answers Explained

Scenario: “Alert on keyword in logs (Error, Exception)”

Wrong Answer	Why Wrong
❌ Lambda polling logs hourly	Expensive compute, not real-time, over-engineered
❌ AWS Config Rule	Config monitors resource configuration, not log content
❌ CloudTrail	CloudTrail logs API calls, not application logs
✅ Metric Filter + Alarm	Built-in, near real-time, cost-effective

Scenario: “Monitor EC2 memory/RAM”

Wrong Answer	Why Wrong
❌ CloudWatch default metrics	RAM is NOT a default metric (only CPU, Disk, Network)
❌ Enable detailed monitoring	Detailed = 1-minute instead of 5-minute, still no RAM
❌ CloudTrail	CloudTrail is for API audit, not metrics
✅ CloudWatch Unified Agent	Required for OS-level metrics (RAM, processes, disk IO)

Scenario: “Who deleted the resource / API audit”

Wrong Answer	Why Wrong
❌ CloudWatch Logs	Logs application output, not API calls
❌ AWS Config	Config tracks config state over time, not who made changes
❌ VPC Flow Logs	Network traffic, not API calls
✅ CloudTrail	Records ALL API calls with user identity

Scenario: “Is resource compliant? / Track config changes”

Wrong Answer	Why Wrong
❌ CloudTrail	Shows API calls, not current config state or compliance
❌ CloudWatch	Performance metrics, not configuration compliance
❌ IAM Access Analyzer	Analyzes IAM policies, not general resource config
✅ AWS Config	Records config changes + evaluates compliance rules

Scenario: “Prevent non-compliant resource creation”

Wrong Answer	Why Wrong
❌ AWS Config Rules	Config detects after the fact, doesn’t prevent
❌ CloudTrail	Audit only, no enforcement
✅ SCPs	Prevent at Organization level
✅ IAM Policies	Prevent at user/role level

Scenario: “Auto-remediate non-compliant resources”

Wrong Answer	Why Wrong
❌ Config Rules alone	Rules only detect, don’t fix
❌ CloudWatch Alarms	Alarms alert, don’t remediate config
❌ Lambda (without trigger)	No automatic invocation mechanism
✅ Config + SSM Automation	Config detects → SSM Document remediates

Scenario: “Real-time log processing / streaming”

Wrong Answer	Why Wrong
❌ S3 Export (CreateExportTask)	Batch only, up to 12 hours latency
❌ CloudWatch Logs Insights	Query engine, not real-time stream
✅ Subscription Filters	Real-time to Lambda/Kinesis

Scenario: “Debug microservices / distributed tracing”

Wrong Answer	Why Wrong
❌ CloudWatch Logs	Shows individual service logs, not request path
❌ CloudWatch Metrics	Shows aggregated metrics, not individual traces
❌ VPC Flow Logs	Network packets, not application-level tracing
✅ X-Ray	Traces requests across services, shows latency per hop

Scenario: “React to AWS service events / automate on events”

Wrong Answer	Why Wrong
❌ CloudWatch Alarms	Only for metric thresholds, not AWS events
❌ SNS alone	Needs something to trigger it
❌ Lambda scheduled	Polling, not event-driven
✅ EventBridge	Native integration with 100+ AWS services

Scenario: “Keep CloudTrail logs beyond 90 days”

Wrong Answer	Why Wrong
❌ Increase CloudTrail retention	Not configurable, always 90 days in console
❌ CloudWatch Logs	Different service, not where CloudTrail stores
✅ S3 + Athena	S3 for storage, Athena for SQL queries

Scenario: “Alert on API call (AssumeRole, Security Group change)”

Wrong Answer	Why Wrong
❌ CloudWatch Alarm	Alarms are for metrics, not API events
❌ Config Rule	Config checks compliance, not individual API calls
❌ GuardDuty	Threat detection, not general API alerting
✅ CloudTrail + EventBridge + SNS	CloudTrail logs → EventBridge triggers → SNS alerts

Scenario: “Monitor Security Group rules / port exposure”

Wrong Answer	Why Wrong
❌ CloudWatch Metrics	Metrics = performance (CPU, network) — not config state
❌ CloudTrail	Logs API calls (who changed SG) — not current config state
❌ Lambda on schedule	Works but over-engineered — Config has built-in rules
✅ Config Rules	Continuously monitors Security Group configurations

Example Config Rules for Security Groups:

restricted-ssh — checks if SSH (port 22) is restricted
restricted-common-ports — checks for unrestricted access on common ports
vpc-sg-open-only-to-authorized-ports — custom rule for specific ports

Scenario: “Config feature for email/alert on config change”

Wrong Answer	Why Wrong
❌ Config Rules	Evaluate compliance — doesn’t send notifications itself
❌ Config Remediations	Auto-fix resources — not for alerting
✅ Config Notifications	Send alerts via SNS on config changes

Config Features Quick Reference:

Feature	Purpose
Config Rules	EVALUATE — is it compliant?
Config Notifications	ALERT — send SNS/email
Config Remediations	FIX — auto-correct via SSM

Scenario: “Store EventBridge events for later replay/testing”

Wrong Answer	Why Wrong
❌ Lambda → S3	Over-engineered — built-in feature exists
❌ Lambda → DynamoDB	Not designed for event replay
❌ Kinesis Firehose → S3	Extra service, no native replay
✅ EventBridge Archive and Replay	Native feature, cost-effective, replay to any event bus

Quick Reference: Service Confusion Matrix

If Question Says	NOT This	Use This Instead
“Monitor RAM/memory”	Default CW metrics	Unified Agent
“Who did it / audit”	CloudWatch, Config	CloudTrail
“Is it compliant”	CloudTrail	Config
“Prevent action”	Config Rules	SCPs / IAM
“Real-time logs”	S3 Export	Subscriptions
“Log keyword alert”	Lambda polling, Config	Metric Filter
“Distributed tracing”	CW Logs, VPC Flow	X-Ray
“React to events”	CW Alarms	EventBridge
“Auto-fix non-compliant”	Config alone	Config + SSM
“Port/SG exposed”	CloudWatch, CloudTrail	Config Rules
“Store/replay events”	Lambda+S3, DynamoDB	EventBridge Archive
“Monitor website/API”	CloudWatch Metrics	Synthetics Canaries
“A/B testing / feature flags”	Lambda, custom code	Evidently
“AWS outage affects me”	Service Health	Your Account Health + EventBridge

🎯 MASTER SUMMARY: Monitoring & Observability Exam Guide

Part 1: Core Principles (Understand WHY → Derive WHAT)

Principle 1: The Three Pillars — Performance vs Audit vs Compliance

WHY: AWS separates concerns into three distinct services because each answers a fundamentally different question:

Service	Question	Data Type
CloudWatch	“How is it performing?”	Metrics, logs, dashboards
CloudTrail	“Who did what?”	API call history
Config	“Is it compliant?”	Resource configuration state

Application: When you see keywords, map to the pillar:

“Dashboard”, “metrics”, “alarm”, “alert on threshold” → CloudWatch
“Audit”, “who deleted”, “API history” → CloudTrail
“Compliance”, “rules”, “configuration changes” → Config

Principle 2: Real-time vs Batch — Know the Latency

WHY: Exam tests whether you know which service provides real-time data vs batch processing.

Need	NOT This (Batch)	Use This (Real-time)
Log processing	S3 Export (12h)	Subscription Filters
Metrics streaming	Pull from API	Metric Streams
Event reaction	Scheduled Lambda	EventBridge

Principle 3: Default Metrics vs Agent Required

WHY: EC2 hypervisor can only see certain metrics. OS-level metrics require an agent.

Metric Type	Default (Hypervisor)	Agent Required
CPU	✅	-
Network	✅	-
Disk (high-level)	✅	-
RAM/Memory	❌	✅ Unified Agent
Processes	❌	✅ Unified Agent
Disk IO detailed	❌	✅ Unified Agent

Principle 4: Detect vs Prevent vs Remediate

WHY: Config ONLY detects. It cannot prevent or fix.

PREVENT          DETECT           REMEDIATE
   │                │                 │
   ▼                ▼                 ▼
SCPs/IAM ───► Config Rules ───► SSM Automation
(before)      (after the fact)    (auto-fix)

Application:

“Stop users from creating non-compliant” → SCPs/IAM (PREVENT)
“Alert when non-compliant” → Config Rules (DETECT)
“Auto-fix non-compliant” → Config + SSM (REMEDIATE)

Principle 5: CloudTrail Retention — 90 Days is the Limit

WHY: CloudTrail console only keeps 90 days. For long-term, you MUST export.

CloudTrail (90 days) ──► S3 (unlimited) ──► Athena (query)

Principle 6: X-Ray is for Distributed Tracing ONLY

WHY: X-Ray traces request flow across services. It’s NOT for:

General log viewing (→ CloudWatch Logs)
Metric monitoring (→ CloudWatch Metrics)
Network debugging (→ VPC Flow Logs)

Application: “Debug latency in microservices” or “find bottleneck between services” → X-Ray

Principle 7: EventBridge is the Event Router

WHY: EventBridge connects AWS services, custom apps, and SaaS. CloudWatch Alarms only handle metric thresholds.

Event Type	Service
Metric crosses threshold	CloudWatch Alarm
AWS service state change	EventBridge
API call made	CloudTrail → EventBridge
Scheduled task	EventBridge Scheduler

Principle 8: Synthetics vs X-Ray — External vs Internal

WHY: They solve different problems:

Service	Perspective	Use Case
Synthetics Canaries	External (customer view)	Is my site/API up?
X-Ray	Internal (developer view)	Where’s the bottleneck?

Part 2: Decision Tree (Follow Keywords → Find Answer)

START: What does the question ask about?
│
├─► "Who did it?" / "audit" / "API history"
│   └─► CloudTrail
│
├─► "Is it compliant?" / "configuration state" / "rules"
│   └─► Config
│
├─► "Performance" / "metrics" / "dashboard" / "alarm"
│   └─► CloudWatch
│       │
│       ├─► "RAM/memory on EC2" → Unified Agent
│       ├─► "Keyword in logs" → Metric Filter + Alarm
│       ├─► "Real-time log stream" → Subscription Filter
│       └─► "Stream metrics to S3" → Metric Streams
│
├─► "Trace requests" / "microservices debug" / "latency between services"
│   └─► X-Ray
│
├─► "React to event" / "automate on state change"
│   └─► EventBridge
│       │
│       ├─► "Store/replay events" → Archive and Replay
│       └─► "Cross-account events" → Resource-based Policy
│
├─► "Monitor website/API availability"
│   └─► Synthetics Canaries
│
├─► "Feature flags" / "A/B testing" / "gradual rollout"
│   └─► Evidently
│
└─► "AWS service issue affecting me"
    └─► Health Dashboard + EventBridge

Part 3: The “CANNOT” List

Service	CANNOT Do
CloudWatch default metrics	Monitor RAM/memory
Config Rules	Prevent resource creation (only detect after)
CloudTrail	Keep logs beyond 90 days (without S3)
S3 Export (logs)	Real-time processing (up to 12h delay)
CloudWatch Alarms	React to AWS service events (only metrics)
X-Ray	Show aggregated metrics (only traces)
Metric Filters	Filter BEFORE logs arrive (only after)

Part 4: Quick Reference Tables

CloudWatch Insights Comparison

Insight Type	What it Monitors	Key Feature
Container Insights	ECS/EKS/Fargate	Container metrics + logs
Lambda Insights	Lambda functions	Cold starts, memory
Contributor Insights	Log data	Top-N talkers
Application Insights	Java/.NET apps	SageMaker-powered dashboards

Log Destinations Comparison

Destination	Latency	Use Case
S3 Export	Up to 12 hours	Archive, compliance
Subscription → Lambda	Real-time	Transform, OpenSearch
Subscription → Firehose	Near real-time	S3, Redshift delivery
Subscription → Kinesis	Real-time	Custom analytics

CloudTrail Event Types

Event Type	Default	Examples
Management Events	✅ ON	IAM, EC2, CloudTrail config
Data Events	❌ OFF	S3 object-level, Lambda Invoke
Insights Events	❌ OFF	Unusual activity detection

Part 5: Instant-Answer Table

Question Contains	→ Instant Answer
“Monitor RAM/memory EC2”	Unified Agent
“Who deleted resource”	CloudTrail
“API call history”	CloudTrail
“Is resource compliant”	Config
“Track config changes over time”	Config
“Prevent non-compliant creation”	SCPs / IAM Policies
“Auto-remediate non-compliant”	Config + SSM Automation
“Alert on log keyword”	Metric Filter + Alarm
“Real-time log processing”	Subscription Filters
“Stream metrics to S3”	Metric Streams via Firehose
“Aggregate logs multi-account”	Subscriptions → Kinesis
“Debug microservices”	X-Ray
“Trace request across services”	X-Ray
“Filter traces by attribute”	X-Ray Annotations
“React to AWS service event”	EventBridge
“Schedule task (cron)”	EventBridge Scheduler
“Store/replay events”	EventBridge Archive
“Cross-account event bus”	EventBridge Resource Policy
“CloudTrail beyond 90 days”	S3 + Athena
“Unusual IAM activity”	CloudTrail Insights
“Monitor website/API up”	Synthetics Canaries
“A/B testing”	Evidently
“Feature flags”	Evidently
“Gradual feature rollout”	Evidently
“AWS outage affecting me”	Health Dashboard + EventBridge
“Top network users in logs”	Contributor Insights
“Lambda cold starts”	Lambda Insights
“Java/.NET app dashboard”	Application Insights
“Container metrics ECS/EKS”	Container Insights
“Network to on-premises”	Network Synthetic Monitor
“SG port exposure check”	Config Rules

Part 6: Elimination Checklist

□ Is it about WHO did something?
  → Yes = CloudTrail
  → No = Continue

□ Is it about COMPLIANCE or configuration state?
  → Yes = Config
  → No = Continue

□ Is it about PREVENTING creation?
  → Yes = SCPs/IAM (Config can't prevent)
  → No = Continue

□ Is it about REAL-TIME log processing?
  → Yes = Subscription Filters (S3 Export has 12h delay)
  → No = Continue

□ Is it about RAM/memory metrics?
  → Yes = Unified Agent (not default metrics)
  → No = Continue

□ Is it about distributed tracing?
  → Yes = X-Ray
  → No = Continue

□ Is it about reacting to AWS events?
  → Yes = EventBridge (not CloudWatch Alarms)
  → No = Continue

□ Is it about website/API monitoring?
  → Yes = Synthetics Canaries
  → No = Continue

□ Is it about feature rollout/A/B testing?
  → Yes = Evidently
  → No = Continue

🏆 The Golden Rules

RAM = Agent (EC2 memory requires Unified Agent)
Who = Trail (CloudTrail for API audit)
Compliant = Config (Config for rules and compliance)
Prevent ≠ Config (Config detects, doesn’t prevent)
Real-time logs = Subscriptions (S3 Export is batch)
90 days = S3 (CloudTrail needs S3 for long-term)
Trace = X-Ray (Distributed tracing across services)
Events = EventBridge (Not CloudWatch Alarms)
Keyword alert = Metric Filter (Not Lambda polling)
Remediate = SSM (Config + SSM Automation)
Top talkers = Contributor (Contributor Insights)
Lambda health = Lambda Insights (As a Lambda Layer)
Website up = Canaries (Synthetics Canaries)
Feature flags = Evidently (CloudWatch Evidently)
Store events = Archive (EventBridge Archive and Replay)

Security:

AWS Security Groups (SG):

Security Group (Firewall) controls how traffic is allowed into or out of EC2 Instances or other Security Groups. Can be attached to multiple instances. Locked down to a region/VPC combination. Does live “outside” EC2", if traffic blocked, EC2 won’t see it.
Security Group by default denies every inbound traffic and contain only allow rules. All outbound traffic is authorised.

Security Group Rules regulate:

Access to Ports;
Authorised IP ranges (IPv4&IPv6);
Control of inbound and outbound network from one instance to another.

Best practices:

Maintain separate Security Group for SSH access;

Troubleshooting:

If application is not accessible (time out), then it’s a security group issue;
If application give a “connection refused” error, then it’s an application error or it’s not launched.

Amazon S3 - Security:

User-Based:
- IAM Policies: which API calls should be allowed for a specific user from IAM;
Resource-Based:
- Bucket Policies;
- Object Access Control List (ACL);
- Bucket Access Control List (ACL). An IAM principal can access an S3 object if the user IAM permissions ALLOW it OR the resource policy ALLOWS it. There is no explicit DENY needed. Only Block all public access settings, to prevent data leaks.

IAM Access Analyzer for S3:

Ensures that only intended people have access to your S3 buckets;
Publicly accessible bucket, bucket shared with other AWS account;
Evaluates S3 Bucket Policies, S3 ACLs, S3 Access Point Policies.

Network Protection:

DDoS Protection on AWS:

AWS Shield Standard (free): protects against DDoS attack (SYN/UDP Floods, Reflection attacks and other layer 3/4 attacks) for your website and applications, for all customers at no additional costs.
AWS Shield Advanced (additional cost): 24/7 premium DDoS protection from more sophisticated attack on AWS Services. Protects against higher fees during usage spikes due to DDoS.

AWS WAF: filter specific requests based on rules and protects web application from common web exploits (Layer 7). Deploy on Application Load Balancer, API Gateway and CloudFront.

Define Web ACL (Web Access Controll List):

Rules can include IP address, HTTP headers, HTTP body or URI strings;
Protects from common attack: SQL injection and Cross-Site Scripting (XSS);
Size contrains, geo-match (block countries);
Rate-based rules (to count occurences of events) for DDoS protection.

AWS Network Firewall protect entire Amazon VPC (from layer 3 to layer 7). Inspect directions:

VPC to VPC traffic;
Outbound/inbound internet traffic;
To/from Direct Connect & Site-to-Site VPN.

AWS Firewall Manager manages security rules in all accounts of an AWS Organization. Rules are applied to new resources as they are created (good for compliance) across all and future accounts in your Organization.

Security policies:

VPC Security Groups for EC2, Application Load Balancer, etc;
AWS WAF Rules;
AWS Shield Advanced;
AWS Network Firewall.

Penetration Testing and Abusing activities on AWS Cloud:

AWS Acceptable use policy:

No illigal, Harmful, or Offensive Use or Content;
No Security Violations;
No Network Abuse;
No E-mail or Other Message Abuse.

Eight servics that are allowed without prior approval from AWS to carry out security assessments or penetration tests:

Amazon EC2 instances, NAT Gateways and Elastic Load Balancers;
Amazon RDS;
Amazon CloudFront;
Amazon Aurora;
Amazon API Gateways;
AWS Lambda and Lambda Edge functions;
Amazon Lightsail resources;
Amazon Elastic Beanstalk environment.

Prohibited Actiities:

DNS zone walking via Amazon Route 53 Hosted Zones;
Denial of Service (DoS), Distributed Denial of Service (DDoS), Simulated DoS/DDoS;
Port flooding;
Protocol flooding;
Request flooding (login request flooding, API request flooding).

AWS Abuse report suspected AWS resources used for abusive or illegal purposes.

Abusive & prohibited behaviors are:

Spam: recieving undesired emails from AWS-owned IP address, websites & forums spammed by AWS resources;
Port scanning: sending packets to ports to discover the unsecured ones;
DoS/DDoS attacks: AWS-owned IP addresses attempting to overwhelm or crash service;
Intrusion attempts: logging in on your resources;
Hosting objectionable or copyrighted content: distributing illigal or copyrighted content without consent;
Disctributing malware: AWS resources distributing software to harm computers.

Encryption Management:

AWS KMS (Key Management Service) service that helps manage the ecryption keys.

Encryption Opt-in:

EBS volumes: encrypt volumes;
S3 buckets: Server-side encryption of objects;
Redshift database: encryption of data;
RDS database: encryption of data;
EFS drives: encryption of data.

Encryption Automatically enabled:

CloudTrail Logs;
S3 Glacier;
Storage Gateway.

AWS CloudHSM (Cloud Hardware Security Module) provisioning encryption hardware. Customer manages all ecryption keys.

CloudHSM vs KMS:

Feature	AWS KMS	AWS CloudHSM
Key Management	AWS manages keys	Customer manages keys
Access Control	IAM policies + Key policies	You manage users in HSM
Hardware	Shared (multi-tenant)	Dedicated hardware (single-tenant)
FIPS 140-2	Level 2	Level 3 (tamper-evident)
High Availability	AWS managed	You must set up cluster across AZs
Integration	Native with 100+ AWS services	Custom integration needed
Cost	Pay per key + API calls	~$1.50/hour per HSM
Use Case	Most encryption needs	Strict compliance, BYOK, SSL/TLS offload

Key Insight:

KMS: IAM controls WHO can access the service, but AWS manages the actual key material
CloudHSM: You control BOTH access AND the key material (AWS has NO access to your keys)

KMS:                                    CloudHSM:
┌─────────────────────────┐            ┌─────────────────────────┐
│      AWS manages        │            │    Customer manages     │
│  ┌─────────────────┐    │            │  ┌─────────────────┐    │
│  │   Key Material  │    │            │  │   Key Material  │    │
│  │  (AWS controls) │    │            │  │ (YOU control)   │    │
│  └────────▲────────┘    │            │  └────────▲────────┘    │
│           │             │            │           │             │
│     IAM Policy          │            │    HSM Users/Certs      │
│   (access control)      │            │   (you manage)          │
└─────────────────────────┘            └─────────────────────────┘

⚠️ Exam trap: “Customer needs to manage their own encryption keys with FIPS 140-2 Level 3” → CloudHSM (KMS is Level 2)

⚠️ Exam trap: “AWS should NOT have access to encryption keys” → CloudHSM (with KMS, AWS manages key material)

⚠️ Exam trap: “Multi-region” + “Global database” + “client-side encryption” → KMS Multi-Region Keys (NOT CloudHSM)

CloudHSM = single-region only (no native cross-region key replication)
KMS Multi-Region Keys (mrk-) = same key ID works across all regions

Scenario	Answer	Why NOT other
Aurora Global + client-side encryption	KMS Multi-Region Keys	CloudHSM can’t replicate keys across regions
FIPS 140-2 Level 3 compliance	CloudHSM	KMS is Level 2 only
AWS must NOT access keys	CloudHSM	KMS = AWS manages key material

Type of KMS Keys (based on creating, managing, using rotaion policies):

Customer Managed Keys;
AWS Managed key;
AWS Owned Keys;
CloudHSM Keys.

AWS Certificate Manager (ACM) is a managed service to provision, manage, and deploy public and private SSL/TLS certificates with AWS services and internal connected resources. Intergrated with Elastic Load Balancer, CloudFront Distributions, APIs on API Gateway.

AWS Secrets Manager helps you manage, retrieve, and rotate database credentials, API keys, and other secrets throughout their lifecycles. Integrated with AWS Lambda, AWS RDS (MySQL, PostgreSQL, Aurora).

Rotation sercrets is the process of periodically updating a secret. When you rotate a secret, you update the credentials in both the secret and the database or service. In Secrets Manager, you can set up automatic rotation for your secrets.

Managed rotation: for most managed secrets, you use managed rotation, where the service configures and manages rotation for you. Managed rotation doesn’t use a Lambda function;
Rotation by Lambda function: for other types of secrets, Secrets Manager rotation uses a Lambda function to update the secret and the database or service.

AWS Systems Manager Parameter Store provides secure, hierarchical storage for configuration data management and secrets management. You can store data such as passwords, database strings, and license codes as parameter values.
Parameter Store doesn’t provide automatic rotation services for stored secrets.

Other Security Tools:

Amazon GuardDuty intelligent (uses Machine Learning) threat discovery to protect your AWS account.

Input data includes:

CloudTrail Events Logs: unusual API calls, unauthorized deployments:
- CloudTrail Management Events: create VPC subnet, create trail, etc;
- CloudTrail S3 Data Events: get object, list objects, delete object, etc;
VPC Flow Logs: usual internal traffic, unusual IP address;
DNS Logs: compromised EC2 instances sending encoded data within DNS queries;
Optional Features: EKS Audit Logs, RDS & Aurora, EBS, Lambda, S3 Data Events, etc
Can protect against CryptoCurrency attacks.

Amazon Inspector automatically discovers workloads, such as Amazon EC2 instances, containers, and Lambda functions, and scans them for software vulnerabilities and unintended network exposure.

For EC2 instances:
- Leveraging the AWS System Manager (SSM) agent;
- Analyze againt unintended network accessibility;
- Analyze the running OS against known vulnerabilities;
For Container Images push to Amazon ECR:
- Assessment of Container Images as they are pushed;
For Lambda Functions:
- Identifies software vulnerabilities in fucntion code and packages dependencies;
- Assessment of functions as they are deplayed. Reporting & integraion with AWS Security Hub. Send findings to Amazon Event Bridge.

AWS Config is a config tool that helps you assess, audit, and evaluate the configurations and relationships of your resources. Possibility of storing the configuration data into S3 (analyzed by Athena) and recieving alerts (SNS notifications) for any changes. Per-region service, but can be aggregated across regions and accounts.

“Is there unresctricted SSH access to my security groups?”;
“Do my buckets have any public access?”;
“How has my ALB configuration changed over time?.

AWS Macie is a fully managed data security and data privacy service that uses machine learning and pattern matching to discover and protect your sensitive data in AWS. Macie helps identify and alert you to sensitive data, such as personally identifiable information (PII).

AWS Detective analyzes, investigates and quickly identifies the root cause of security issues or suspicious activities (using ML and graphs). Automatically collects and processes events from VPC Flow Logs, CloudTrail, GuardDuty and create unified view.

AWS Security Hub central security tool to manage security across several AWS accounts and automate security checks. Integrated dashboards showing current security and compliance status to quickly take actions. Not enabled by default

Autmatically aggregates alerts:

AWS Config;
AWS GuardDuty;
AWS Inspector;
AWS Macie;
IAM Access Analyzer;
AWS Systems Manager;
AWS Firewall Manager;
AWS Health;
AWS Partner Network Solutions. All discovered data can be used by AWS EventBridge, AWS Security Hub Findings, Amazon Detective.

Security Compliances and Reports:

AWS Artifact (not really a service) portal that provides customers with on-demand access to AWS compliance documentation and AWS agreements. Can be used to support internal audit or compliance.

Artifact Reports: Allows you to download AWS security and compliance documents from third-party auditors (AWS ISO, Payment Card industry and System and Organization Control);
Artifact Agreements: Allows you to review, accept and track the status of AWS agreements (as Business Associate Addendum, Health Insurance Portability and Acountability Act).

IAM Access Analyzer: finds out which resources are shared externally of defined Zone of Trust (AWS account or AWS Organization):

S3 Buckets;
IAM Roles;
KMS Keys;
Lambda Functions and Layers;
SQS queues;
Secrets Manager Secrets.

Encryption Types:

Type	Where Encrypted	Who Has Keys	Use Case
Encryption in Flight (TLS/SSL)	During transmission	TLS certificates	Protect data in transit, prevent MITM attacks
Server-Side Encryption (SSE)	At rest, on server	Server (AWS manages)	S3, EBS, RDS - data protected at rest
Client-Side Encryption	Before sending	Client only	Server should NOT see plaintext (zero-trust)

Client-Side Encryption:
┌──────────┐                              ┌─────────────────┐
│  Client  │  encrypted data              │  Storage (S3)   │
│ ┌──────┐ │  ─────────────────────────►  │                 │
│ │ Key  │ │                              │ Encrypted blob  │
│ └──────┘ │  ◄─────────────────────────  │ (can't decrypt) │
└──────────┘  encrypted data              └─────────────────┘

Server-Side Encryption:
┌──────────┐   plaintext     ┌─────────────────────────────────┐
│  Client  │  ────────────►  │  AWS Service (S3)               │
│          │   HTTPS         │  ┌─────┐  encrypt   ┌────────┐  │
│          │                 │  │ Key │ ─────────► │ Stored │  │
│          │  ◄────────────  │  └─────┘  decrypt   └────────┘  │
└──────────┘   plaintext     └─────────────────────────────────┘

⚠️ Exam trap: “[Service] Client-side Encryption” terminology

“Aurora Client-side” = Your app (client OF Aurora) encrypts before storing
“S3 Client-side” = Your app (client OF S3) encrypts before uploading
❌ “Lambda Client-side” makes NO SENSE - Lambda is compute, not storage!

When question says “data must not be disclosed even by company admins”: → Client-side encryption (service stores only ciphertext, can’t decrypt)

AWS KMS (Key Management Service):

AWS KMS manages encryption keys for AWS services. Anytime you hear “encryption” for an AWS service, it’s most likely KMS.

Fully integrated with IAM for authorization
Audit KMS Key usage using CloudTrail
Seamlessly integrated into most AWS services (EBS, S3, RDS, SSM…)
Available through API calls (SDK, CLI) - encrypted secrets can be stored in code/env vars

KMS Key Types:

Key Type	Description	Access to Key Material
Symmetric (AES-256)	Single key for encrypt/decrypt	Never (must use KMS API)
Asymmetric (RSA/ECC)	Public + Private key pair	Public key downloadable, private never

Symmetric: Used by ALL AWS service integrations with KMS
Asymmetric use case: Encryption outside AWS by users who can’t call KMS API

Asymmetric Key Usage (IMPORTANT):

Key Type	Key Usage	Can Do	Cannot Do
RSA	`ENCRYPT_DECRYPT`	Encrypt, Decrypt	Sign, Verify
RSA	`SIGN_VERIFY`	Sign, Verify	Encrypt, Decrypt
ECC	`SIGN_VERIFY` only	Sign, Verify	Encrypt, Decrypt (never!)

Key usage is chosen at creation time and is permanent
One asymmetric key cannot do both encrypt/decrypt AND sign/verify
ECC keys can ONLY sign/verify - never encrypt

⚠️ Exam trap: “Asymmetric key for encryption AND signing” → IMPOSSIBLE with single key. Need TWO separate keys.

KMS Key Ownership & Pricing:

Key Type	Cost	Example	Rotation
AWS Owned	Free	SSE-S3, SSE-SQS, SSE-DDB	AWS manages
AWS Managed	Free	`aws/rds`, `aws/ebs`	Auto every 1 year
Customer Managed (created)	$1/month + API calls	Your keys	Must enable, auto every 1 year
Customer Managed (imported)	$1/month + API calls	BYOK	Manual only (use alias)

KMS Key Rotation Deep Dive:

Key Type	Auto Rotation	Period	Notes
AWS Managed	✅ Always ON	1 year	Cannot disable
Customer Managed	Optional (must enable)	1 year	On-demand also available
Imported	❌ Not available	N/A	Manual only via alias

How rotation works:

KMS creates new key material (backing key)
Key ID stays the same - no application changes needed
Old key material kept for decrypting old data
New encryptions use new key material

Before rotation:          After rotation:
┌─────────────────┐       ┌─────────────────┐
│ Key ID: abc-123 │       │ Key ID: abc-123 │  ◄── Same ID!
│ ┌─────────────┐ │       │ ┌─────────────┐ │
│ │ Key Material│ │       │ │OLD Material │ │  ◄── Kept for decrypt
│ │ (v1)        │ │       │ │(v1)         │ │
│ └─────────────┘ │       │ ├─────────────┤ │
└─────────────────┘       │ │NEW Material │ │  ◄── Used for encrypt
                          │ │(v2)         │ │
                          │ └─────────────┘ │
                          └─────────────────┘

⚠️ Exam trap: Rotation period = 1 year FIXED (cannot be changed to 90 days, 6 months, etc.)

⚠️ Exam trap: Imported keys can ONLY be rotated manually (no automatic rotation)

Manual Rotation (for custom rotation periods):

If policy requires rotation more frequently than 1 year (e.g., 6 months):

Create a new KMS CMK
Update the Key Alias to point to new key
Keep old key (needed to decrypt old data)
Applications using alias automatically use new key

6 months ago:                    Now (after manual rotation):
┌─────────────────┐              ┌─────────────────┐
│ Alias: my-key   │──────────┐   │ Alias: my-key   │──────────┐
└─────────────────┘          │   └─────────────────┘          │
                             ▼                                 ▼
                    ┌─────────────┐               ┌─────────────┐
                    │ CMK-OLD     │               │ CMK-NEW     │
                    │ (key-111)   │               │ (key-222)   │
                    └─────────────┘               └─────────────┘
                           │                             │
                           │ Still exists!               │
                           │ (decrypt old data)          │ (new encryptions)
                           ▼                             ▼

⚠️ Exam trap: “Rotate every 6 months” → Manual rotation with aliases (auto rotation is 1 year only, cannot configure)

Wrong answers explained:

❌ “Configure Retention Period with 180 days” → No such setting exists for rotation
❌ “AWS Managed Keys rotate every 3 months” → FALSE, they rotate every 1 year

KMS Access Control (Key Policies):

Key Policy = PRIMARY way to control access to KMS keys (resource-based policy).

Access Method	Description	Required?
Key Policy	Resource-based policy ON the key	✅ Always required
IAM Policy	Identity-based policy on user/role	Optional (works WITH key policy)
Grants	Temporary, delegated access	Optional

Critical difference from other AWS services:

S3: IAM policy alone CAN grant access
KMS: IAM policy alone CANNOT grant access - Key Policy must explicitly allow it

S3 Access:                          KMS Access:
IAM Policy ──► S3 Bucket            IAM Policy ──┐
     │                                           │
     └─► Access granted!                         ▼
                                    Key Policy ──► KMS Key
                                         │
                                         └─► BOTH needed!

Default Key Policy:

Created automatically if you don’t specify one
Gives root user (entire AWS account) full access
Allows IAM policies to grant access to the key

Custom Key Policy - use cases:

Define specific users/roles that can USE the key
Define who can ADMINISTER the key (manage, not use)
Cross-account access (required for sharing keys)

⚠️ Exam trap: “KMS IAM Policy” alone → NOT enough! Key Policy is required. IAM policies work only if Key Policy allows it.

⚠️ Exam trap: “KMS ACL” → Does NOT exist! (Unlike S3, KMS has no ACLs)

KMS Grants:

Temporary, programmatic access delegation
Use case: AWS service needs to use key on your behalf
Can be revoked without changing key policy

Copying Snapshots Across Accounts:

Create snapshot encrypted with Customer Managed Key
Attach KMS Key Policy for cross-account access
Share the encrypted snapshot
(Target account) Copy snapshot, re-encrypt with target account’s CMK
Create volume from snapshot

Copying Snapshots Across Regions:

KMS keys are regional → must re-encrypt with destination region’s key
Use KMS ReEncrypt to change encryption key during copy

Region A (eu-west-2)              Region B (ap-southeast-2)
┌─────────────┐                   ┌─────────────┐
│ EBS Volume  │                   │ EBS Volume  │
│ (KMS Key A) │                   │ (KMS Key B) │
└──────┬──────┘                   └──────▲──────┘
       │                                 │
       ▼                                 │
┌─────────────┐   ReEncrypt with   ┌─────────────┐
│ Snapshot    │   KMS Key B        │ Snapshot    │
│ (Key A)     │ ─────────────────► │ (Key B)     │
└─────────────┘                    └─────────────┘

KMS Multi-Region Keys:

Identical keys in different regions (same key ID, key material, rotation)
Encrypt in one region, decrypt in another - no re-encrypt needed
NOT global: Primary + Replicas, each managed independently
ARN format: arn:aws:kms:<region>:111122223333:key/mrk-... (note mrk- prefix)

⚠️ Exam trap: “The same KMS key cannot exist in two regions” → FALSE with Multi-Region keys. Regular KMS keys are regional, but Multi-Region keys CAN exist in multiple regions with same key ID.

                    ┌─────────────────┐
                    │   us-west-2     │
                    │ Replica Key     │
                    │ mrk-1234...     │
                    └────────▲────────┘
                             │ sync
┌─────────────────┐          │          ┌─────────────────┐
│   us-east-1     │──────────┴──────────│   eu-west-1     │
│ PRIMARY Key     │       sync          │ Replica Key     │
│ mrk-1234...     │─────────────────────│ mrk-1234...     │
└─────────────────┘                     └─────────────────┘

Multi-Region Key Use Cases:

Global client-side encryption
Encryption on Global DynamoDB tables
Encryption on Global Aurora databases

⚠️ Exam trap: Multi-Region keys are NOT “global keys” - each replica is managed independently in its region

AMI Sharing with KMS Encryption:

When sharing encrypted AMI across accounts, you must share BOTH the AMI AND the KMS key access.

Account A (Source)                        Account B (Target)
┌────────────────────────────────┐       ┌────────────────────────────────┐
│                                │       │                                │
│  ┌──────────────┐              │       │              ┌──────────────┐  │
│  │ AMI          │              │       │              │ EC2 Instance │  │
│  │ (encrypted)  │──────────────┼──────►│──────────────│ (launched)   │  │
│  └──────┬───────┘   Share AMI  │       │   Launch     └──────────────┘  │
│         │                      │       │                      ▲         │
│         │ encrypted with       │       │                      │         │
│         ▼                      │       │                      │         │
│  ┌──────────────┐              │       │              uses key to       │
│  │ KMS Key      │──────────────┼──────►│──────────────decrypt           │
│  │ (CMK)        │  Share Key   │       │                                │
│  └──────────────┘  (Key Policy)│       │                                │
│                                │       │                                │
└────────────────────────────────┘       └────────────────────────────────┘

Steps to share encrypted AMI:

Source Account: AMI encrypted with Customer Managed Key (CMK)
Modify KMS Key Policy: Add target account as authorized user
Share AMI: Grant LaunchPermission to target account
Target Account: Launch instance - KMS automatically decrypts

⚠️ Exam trap: Cannot share AMI encrypted with AWS Managed Key (aws/ebs) - must use Customer Managed Key

S3 Replication - Encryption Considerations:

Encryption Type	Replication Behavior
Unencrypted	Replicated by default
SSE-S3	Replicated by default
SSE-C (customer provided key)	Can be replicated
SSE-KMS	Must enable option explicitly

SSE-KMS Replication Requirements:

Specify which KMS Key to encrypt objects in target bucket
Adapt KMS Key Policy for target key
IAM Role needs: kms:Decrypt (source key) + kms:Encrypt (target key)
May get KMS throttling → request Service Quotas increase

⚠️ Exam trap: Multi-Region KMS keys are treated as independent keys by S3 - object is still decrypted then re-encrypted (no optimization)

AWS Secrets Manager:

AWS Secrets Manager stores and manages secrets with automatic rotation.

Force rotation every X days (uses Lambda for rotation)
Native integration with RDS (MySQL, PostgreSQL, Aurora)
Secrets encrypted using KMS
Mostly meant for RDS/database credential management

Multi-Region Secrets:

Replicate secrets across multiple AWS Regions
Read replicas stay in sync with primary
Can promote replica to standalone secret
Use cases: multi-region apps, disaster recovery, multi-region DB

us-east-1 (Primary)                    us-west-2 (Secondary)
┌─────────────────┐     replicate      ┌─────────────────┐
│ Secrets Manager │ ─────────────────► │ Secrets Manager │
│   MySecret-A    │                    │   MySecret-A    │
│   (primary)     │                    │   (replica)     │
└─────────────────┘                    └─────────────────┘

SSM Parameter Store vs Secrets Manager:

Feature	SSM Parameter Store	Secrets Manager
Cost	Free tier (Standard), charges for Advanced	$0.40/secret/month + API calls
Auto Rotation	❌ No	✅ Yes (built-in Lambda)
RDS Integration	Manual	✅ Native (MySQL, PostgreSQL, Aurora)
KMS Encryption	Optional (SecureString)	✅ Always encrypted
Hierarchy	✅ Path-based (`/app/dev/db-password`)	❌ Flat
Multi-Region	❌ No	✅ Yes (replicas)
Version Tracking	✅ Built-in	✅ Built-in
Pull from CF/CDK	✅ Direct reference	✅ Direct reference

SSM Parameter Store - Version Tracking:

Every edit creates a new version automatically
Previous versions are retained (history)
Can view/retrieve any version’s value
Get latest: aws ssm get-parameter --name /my/param
Get specific version: aws ssm get-parameter --name /my/param:3

Parameter: /app/db-password
┌─────────────────────────────────────────┐
│ Version 1: "oldpass123"    (2024-01-01) │
│ Version 2: "newpass456"    (2024-06-01) │
│ Version 3: "latestpass789" (2025-01-01) │ ◄── Current
└─────────────────────────────────────────┘

⚠️ Exam trap: “Track secret values over time” → SSM Parameter Store (built-in versioning)

⚠️ Exam trap: “KMS Versioning” → Does NOT exist! KMS has key rotation (new key material), not value versioning

Where to Store Configuration/Secrets - Decision Guide:

Requirement	Best Service	Why NOT others
Config values + version history	SSM Parameter Store	DynamoDB (overkill), S3 (not designed for this), EBS (storage volume)
DB credentials + auto rotation	Secrets Manager	SSM (no auto rotation), KMS (encryption only)
Hierarchical config (`/app/prod/db`)	SSM Parameter Store	Secrets Manager (flat structure)
Sensitive + multi-region	Secrets Manager	SSM (no multi-region)

⚠️ Exam trap: “RDS password + automatic rotation”

✅ Secrets Manager — ONLY service with native auto-rotation for RDS
❌ KMS — encrypts data, doesn’t STORE secrets
❌ SSM Parameter Store — stores secrets, but NO auto-rotation (manual Lambda required)

Why SSM Parameter Store for “externally maintain config”:

✅ Designed for configuration management
✅ Version history built-in
✅ Hierarchical paths (organize by app/environment)
✅ Free tier available
✅ Native integration with EC2, Lambda, ECS, CloudFormation

Wrong answers explained:

❌ DynamoDB: Database, overkill for simple config values
❌ S3: Object storage, not designed for config management (no native versioning of values)
❌ EBS: Block storage volume, attaches to EC2 (not a config store!)

When to use which:

Secrets Manager: DB credentials, need rotation, RDS integration, multi-region
Parameter Store: Configuration values, hierarchical data, cost-sensitive, version history needed

⚠️ Exam trap: “Automatic rotation for DB credentials” → Secrets Manager (Parameter Store has NO auto rotation)

Lambda + Secrets - Security Options (worst to best):

Option	Security Level	Why
❌ Embed in code	WORST	Visible in source control, logs, anyone with code access
❌ Plaintext env var	BAD	Visible in Lambda console, CloudWatch logs
✅ Encrypted env var + KMS	GOOD	Encrypted at rest, decrypted at runtime
✅✅ Secrets Manager/SSM	BEST	Centralized, audit trail, rotation, no env vars

Encrypted Environment Variable Flow:

1. Store secret as encrypted env var (using KMS)
   ┌─────────────────────────────────────────┐
   │ Lambda Config                           │
   │ DB_PASSWORD = AQICAHh...encrypted...    │
   └─────────────────────────────────────────┘
                      │
2. At runtime, Lambda decrypts using KMS
                      │
                      ▼
   ┌──────────┐    decrypt    ┌──────────┐
   │  Lambda  │ ────────────► │   KMS    │
   │  code    │ ◄──────────── │   CMK    │
   └──────────┘   plaintext   └──────────┘
                      │
3. Use decrypted value to connect to DB
                      │
                      ▼
               ┌──────────┐
               │    RDS   │
               └──────────┘

Why encrypted env var is “most secure” in the question:

Among the given options (embed, plaintext, encrypted), encrypted is best
Secret is encrypted at rest in Lambda configuration
Only decrypted when Lambda executes (in memory, briefly)
Lambda needs kms:Decrypt permission on the CMK

⚠️ Exam context: If Secrets Manager is an option, it’s usually the BEST answer (centralized + rotation + audit). But among the 3 options given, encrypted env var wins.

AWS Certificate Manager (ACM):

ACM provisions, manages, and deploys TLS/SSL certificates.

Free for public TLS certificates
Automatic renewal for ACM-generated certs (60 days before expiry)
Supports public and private certificates

ACM Integrations:

Service	Notes
ELB (CLB, ALB, NLB)	Provision certs directly
CloudFront	Must be in us-east-1
API Gateway	Edge-optimized or Regional

⚠️ Exam trap: Cannot use ACM with EC2 directly (private key can’t be extracted)

ACM + API Gateway - Certificate Region Rules:

API Gateway Type	Certificate Location
Edge-Optimized	ACM cert must be in us-east-1 (CloudFront region)
Regional	ACM cert must be in same region as API Gateway

Memory trick: “Where does TLS terminate?”

Edge-Optimized → CloudFront terminates TLS → CloudFront = us-east-1 → cert in us-east-1
Regional → API Gateway terminates TLS → cert in same region as API

API Gateway Endpoint Types Explained:

Type	Audience	How It Works	ACM Region
Edge-Optimized (default)	Global clients	Requests routed via CloudFront edge locations → reduces latency	us-east-1 only
Regional	Same-region clients	Direct access, can add your own CloudFront for more control	Same as API Gateway
Private	VPC only	Access via VPC Interface Endpoint (ENI)	Same as API Gateway

Edge-Optimized (default):

Uses AWS-managed CloudFront distribution behind the scenes
API Gateway still lives in one region (your chosen region)
Certificate MUST be in us-east-1 because CloudFront is global and reads certs from us-east-1
Best for: globally distributed clients

Regional:

No CloudFront in front (direct access)
Can manually add your own CloudFront for caching control + DDoS protection
Certificate must be in same region as the API Gateway stage
Best for: clients in same region, or when you want custom CloudFront config

Edge-Optimized:                         Regional:
┌─────────────┐                        ┌─────────────────┐
│  us-east-1  │                        │  ap-southeast-2 │
│ ┌─────────┐ │                        │ ┌─────────────┐ │
│ │   ACM   │─┼──► CloudFront          │ │ API Gateway │ │
│ └─────────┘ │   (AWS managed)        │ └──────┬──────┘ │
└─────────────┘        │               │        │        │
                       ▼               │ ┌──────▼──────┐ │
               ┌─────────────┐         │ │     ACM     │ │
               │ API Gateway │         │ │ (same rgn)  │ │
               │ (any region)│         │ └─────────────┘ │
               └─────────────┘         └─────────────────┘

Regional + Custom CloudFront (more DDoS control):

┌─────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  us-east-1  │     │  ap-southeast-2 │     │  ap-southeast-2 │
│ ┌─────────┐ │     │ ┌─────────────┐ │     │ ┌─────────────┐ │
│ │   ACM   │─┼────►│ │ CloudFront  │─┼────►│ │ API Gateway │ │
│ │(for CF) │ │     │ │(your own)   │ │     │ │  Regional   │ │
│ └─────────┘ │     │ └─────────────┘ │     │ └─────────────┘ │
└─────────────┘     └─────────────────┘     └─────────────────┘
                    + WAF attached here

⚠️ Exam trap: Edge-Optimized uses CloudFront but certificate must be in us-east-1, NOT in the API Gateway’s region

Route 53 Setup:

After custom domain setup, create CNAME or A-Alias record in Route 53
Alias record is preferred (free, supports apex domain)

ACM + ALB - HTTP to HTTPS Redirect:

User ──► HTTP ──► ALB ──► Redirect to HTTPS
     ◄── 301 ◄──────┘
User ──► HTTPS ──► ALB ──► EC2 (Auto Scaling)
                    │
                    ▼
                   ACM (provision/maintain certs)

Importing Public Certificates:

Import certs generated outside ACM
No automatic renewal - must import new cert before expiry
ACM sends daily expiration events starting 45 days before expiry
Events go to EventBridge → Lambda, SNS, SQS
AWS Config rule: acm-certificate-expiration-check

⚠️ Exam trap — ACM certificate expiry monitoring (created vs imported):

Feature	ACM-Created Certs	Imported (Third-Party) Certs
Auto-renewal	✅ Yes (60 days before)	❌ No — must manually re-import
CW `DaysToExpiry` metric	✅ Yes	❌ No
EventBridge events	✅ Yes (daily, 45 days before)	✅ Yes (daily, 45 days before)
AWS Config rule	✅ Works	✅ Works — best for imported

“Third-party/imported cert + notify before expiry + least effort” → AWS Config managed rule acm-certificate-expiration-check → SNS
“ACM-created cert + notify” → EventBridge daily events OR CW DaysToExpiry alarm
❌ CW DaysToExpiry for imported certs → metric only exists for ACM-created certs!
❌ ACM link to LetsEncrypt → No such feature exists

AWS WAF (Web Application Firewall):

AWS WAF protects web apps from Layer 7 (HTTP) exploits.

Deploy on:

Application Load Balancer
API Gateway
CloudFront
AppSync GraphQL API
Cognito User Pool

Web ACL Rules:

Rule Type	Description
IP Set	Up to 10,000 IPs (use multiple rules for more)
String match	HTTP headers, body, URI strings
SQL injection	Block SQLi attacks
XSS	Block Cross-Site Scripting
Size constraints	Limit request size
Geo-match	Block countries
Rate-based	DDoS protection (count events)

Web ACL are Regional except for CloudFront (global)
Rule groups: Reusable set of rules

WAF + Fixed IP (Load Balancer):

WAF does NOT support NLB (Layer 4)
Use Global Accelerator for fixed IP + WAF on ALB
WebACL must be in same region as ALB

⚠️ Exam trap: “Attach WAF to NLB” → IMPOSSIBLE! WAF = Layer 7 (HTTP), NLB = Layer 4 (TCP/UDP). Use ALB instead, or put Global Accelerator in front for fixed IPs.

WAF-Compatible Services:

✅ Supported	❌ NOT Supported
ALB	NLB
API Gateway	EC2 directly
CloudFront	Route 53
AppSync	CLB (Classic)
Cognito User Pool

Users ──► Global Accelerator ──► ALB ◄── WAF (WebACL)
          (Fixed IP: 1.2.3.4)     │       (same region)
                                  ▼
                             EC2 Instances

WAF vs Firewall Manager vs Shield:

Service	Use Case	Scope
WAF	Granular protection, Web ACL rules	Single resource
Firewall Manager	Manage WAF across accounts, auto-protect new resources	AWS Organization
Shield Advanced	DDoS protection, SRT support, cost protection	Enhanced DDoS

Decision Guide:

Granular protection of single resource → WAF alone
WAF across accounts + auto-protect new resources → Firewall Manager + WAF
Frequent DDoS attacks + need SRT support → Shield Advanced

AWS Shield (DDoS Protection):

Feature	Shield Standard	Shield Advanced
Cost	Free (all customers)	$3,000/month/org
Layer	Layer 3/4	Layer 3/4/7
Protection	SYN/UDP floods, reflection	+ sophisticated attacks
Resources	All	EC2, ELB, CloudFront, Global Accelerator, Route 53
DDoS Response Team	❌	✅ 24/7 access to DRP
Cost Protection	❌	✅ (no higher fees during attack)
Auto WAF rules	❌	✅ (creates rules for L7 attacks)

AWS Firewall Manager:

Firewall Manager manages security rules across all accounts in AWS Organization.

Manages:

WAF rules (ALB, API Gateway, CloudFront)
AWS Shield Advanced
Security Groups (EC2, ALB, ENI in VPC)
AWS Network Firewall (VPC level)
Route 53 Resolver DNS Firewall
Policies created at region level
Rules applied to new resources automatically (compliance)

⚠️ Exam keywords → Firewall Manager:

“centrally manage” + “all accounts” + “Organization”
“Security Groups” + “across accounts”
“Shield Advanced” + “Organization-wide”
“WAF rules” + “multiple accounts”

Why NOT others for “centrally manage across accounts”:

❌ Shield — protection service, not management
❌ GuardDuty — threat detection, not rule management
❌ Config — compliance audit, not rule enforcement

DDoS Resiliency Best Practices:

AWS DDoS Best Practices Reference Architecture:

                            AWS Edge Services
┌─────────────────────────────────────────────────────────────────────┐
│  BP1: Global Accelerator    BP3: Route 53    BP1/BP2: CloudFront   │
│  (fixed IPs, Shield)        (DNS at edge)    (cache + WAF)         │
└─────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│                              Region                                  │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │                        VPC (BP5)                              │   │
│  │   ┌─────────────┐    ┌─────────────┐    ┌─────────────────┐  │   │
│  │   │ Public      │    │ BP6: ELB    │    │ Private Subnet  │  │   │
│  │   │ Subnet      │───►│ + WAF (BP2) │───►│ BP7: Auto       │  │   │
│  │   │ (NACLs)     │    │ + API GW    │    │ Scaling Group   │  │   │
│  │   └─────────────┘    └─────────────┘    └─────────────────┘  │   │
│  │                            │                                  │   │
│  │                            ▼                                  │   │
│  │                    Security Groups                            │   │
│  └──────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

BP Summary Table:

BP	Service	Layer	Purpose
BP1	CloudFront, Global Accelerator	Edge	Absorb DDoS at edge, reduce origin load
BP2	WAF	L7	Filter malicious requests, rate limiting
BP3	Route 53	DNS	DNS at edge, shuffle sharding, health checks
BP4	API Gateway	L7	Hide backend, burst limits, API keys
BP5	VPC (SG + NACL)	L3/L4	Filter IPs at subnet/ENI level
BP6	ELB	L4/L7	Distribute traffic, scales automatically
BP7	Auto Scaling	Infra	Scale EC2 during traffic surges

1. Edge Location Mitigation (BP1, BP3):

Internet ──► CloudFront (BP1) ──► Origin
             │
             ├─ Caches static content (reduces origin requests)
             ├─ Absorbs L3/L4 attacks (SYN floods, UDP reflection)
             └─ Geo-blocking available

Internet ──► Global Accelerator (BP1) ──► ALB/NLB/EC2
             │
             ├─ Fixed Anycast IPs (2 IPs)
             ├─ Routes via AWS backbone (not public internet)
             ├─ Shield integration
             └─ Use when CloudFront not compatible (non-HTTP)

Internet ──► Route 53 (BP3) ──► Your resources
             │
             ├─ DNS resolution at edge
             ├─ Built-in DDoS protection
             └─ Health checks + failover

When to use which:

CloudFront: HTTP/HTTPS workloads, need caching
Global Accelerator: Non-HTTP (TCP/UDP), need fixed IPs, gaming, IoT
Route 53: Always (DNS is entry point)

2. Infrastructure Layer Defense (BP1, BP3, BP6, BP7):

                    DDoS Attack
                         │
                         ▼
┌─────────────────────────────────────────┐
│            Edge Services                │
│  (absorb volumetric attacks)            │
└────────────────────┬────────────────────┘
                     │ reduced traffic
                     ▼
┌─────────────────────────────────────────┐
│         ELB (BP6) - scales auto         │
│  (distributes across instances)         │
└────────────────────┬────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────┐
│     Auto Scaling Group (BP7)            │
│  (adds instances during surge)          │
│     ┌────┐  ┌────┐  ┌────┐  ┌────┐     │
│     │EC2 │  │EC2 │  │EC2 │  │EC2 │     │
│     └────┘  └────┘  └────┘  └────┘     │
└─────────────────────────────────────────┘

Key point: ELB + Auto Scaling = absorb legitimate traffic surges AND DDoS

3. Application Layer Defense (BP1, BP2):

Malicious Request ──► CloudFront ──► WAF (BP2) ──► ALB ──► App
                          │              │
                          │              ├─ SQL injection? BLOCK
                          │              ├─ XSS? BLOCK  
                          │              ├─ Rate > 2000/5min? BLOCK IP
                          │              ├─ Bad IP reputation? BLOCK
                          │              └─ Geo = blocked country? BLOCK
                          │
                          └─ Cached? Return from edge (origin never hit)

WAF Rules for DDoS:

Rate-based rules: Auto-block IPs exceeding threshold
Managed rules: AWS IP reputation list, anonymous IPs
Geo-match: Block entire countries

Shield Advanced (BP1, BP2, BP6):

Auto-creates WAF rules during L7 attacks
24/7 DDoS Response Team (DRT)
Cost protection (no billing spike during attack)

4. Attack Surface Reduction (BP1, BP4, BP5, BP6):

                    Attacker
                        │
                        ▼
              ┌─────────────────┐
              │   CloudFront    │ ◄── Only this IP is public
              │   (or API GW)   │
              └────────┬────────┘
                       │
        ┌──────────────┼──────────────┐
        │              │              │
        ▼              ▼              ▼
   ❌ Can't reach  ❌ Can't reach  ❌ Can't reach
      EC2 IPs        Lambda          RDS
      directly       directly        directly

Obfuscation = hide your backend:

CloudFront/API Gateway/ELB = public-facing
EC2, Lambda, RDS = private (no public IPs)
Attacker can’t directly target your instances

Security Groups + NACLs (BP5):

SG: Allow only from ELB
NACL: Block known bad IP ranges at subnet level
Elastic IPs protected by Shield Advanced

Amazon GuardDuty:

GuardDuty is intelligent threat discovery using ML.

One-click enable (30-day trial), no software to install

Core Data Sources (always analyzed):

✅ Source	What It Detects
CloudTrail Management Events	Unusual API calls, create VPC, create trail
CloudTrail S3 Data Events	Get/list/delete object anomalies
VPC Flow Logs	Unusual traffic, suspicious IPs
DNS Logs	Compromised EC2 sending encoded DNS queries

Optional Features: EKS Audit Logs, RDS & Aurora login, EBS, Lambda, S3 Data Events

NOT a GuardDuty data source:

❌ NOT Scanned	Why it’s a trap
CloudWatch Logs	Common confusion - GuardDuty uses its own log analysis
Application logs	GuardDuty = infrastructure threats, not app logs
Custom logs	Not supported

⚠️ Exam trap: “GuardDuty scans CloudWatch Logs” → FALSE! GuardDuty scans CloudTrail, VPC Flow Logs, DNS Logs (not CloudWatch Logs)

Memory hook - GuardDuty sources: “CVD”

CloudTrail
VPC Flow Logs
DNS Logs

┌─────────────────┐
│ VPC Flow Logs   │──┐
├─────────────────┤  │     ┌───────────┐     ┌─────────────┐
│ CloudTrail Logs │──┼────►│ GuardDuty │────►│ EventBridge │──► SNS/Lambda
├─────────────────┤  │     └───────────┘     └─────────────┘
│ DNS Logs        │──┘
└─────────────────┘
  + Optional: S3, EBS, Lambda, RDS, EKS
  ❌ NOT: CloudWatch Logs

Findings → EventBridge → Lambda, SNS
Dedicated finding for CryptoCurrency attacks

Amazon Inspector:

Inspector performs automated security assessments.

Scans:

Target	What’s Scanned	Requires
EC2 instances	OS vulnerabilities, network reachability	SSM Agent
ECR Container Images	Vulnerabilities on push	-
Lambda Functions	Code vulnerabilities, package dependencies	-

Continuous scanning, only when needed
Uses CVE database for vulnerabilities
Risk score for prioritization
Findings → Security Hub, EventBridge

Lambda ──────┐
             │
SSM Agent ───┼────► Inspector ────► Security Hub
(EC2)        │         │             EventBridge
             │         ▼
ECR Images ──┘    Findings + Risk Score

GuardDuty vs Inspector vs Macie vs Config:

Service	What It Does	Looks At	Use Case
GuardDuty	Threat detection	CloudTrail, VPC Flow, DNS	“Is someone attacking me?”
Inspector	Vulnerability scanning	EC2 OS, ECR images, Lambda	“Do I have unpatched CVEs?”
Macie	Sensitive data discovery	S3 buckets	“Do I have exposed PII?”
Config	Configuration compliance	Resource configs	“Are my resources compliant?”

⚠️ Exam trap keywords:

“OS vulnerabilities” / “CVE” / “patch” → Inspector
“Threats” / “unusual API” / “compromised” → GuardDuty
“PII” / “sensitive data” / “S3 data” → Macie
“Compliance” / “configuration” / “rules” → Config

Wrong answers for “OS vulnerabilities” question:

❌ Shield: DDoS protection (not vulnerability scanning)
❌ GuardDuty: Threat detection (not vulnerability scanning)
❌ Config: Configuration compliance (not vulnerability scanning)

Amazon Macie:

Macie discovers and protects sensitive data using ML and pattern matching.

Identifies PII (Personally Identifiable Information)
Scans S3 buckets only

S3 Buckets ────► Macie ────► EventBridge ────► integrations
              (discover PII)    (notify)      (Lambda, SNS, etc.)

🎯 MASTER SUMMARY: AWS Security Exam Guide

Part 1: Core Principles (Understand WHY → Derive WHAT)

Principle 1: Encryption Ownership Spectrum

AWS security is about WHO controls the keys and WHERE encryption happens.

Most AWS Control ◄──────────────────────────────────► Most Customer Control

SSE-S3          SSE-KMS           SSE-KMS (CMK)      SSE-C           Client-Side
(AWS owns)      (AWS managed)     (Customer managed)  (Customer key)  (Customer encrypts)

Key insight: The more control you want, the more responsibility you have.

“AWS must NOT see my keys” → CloudHSM or Client-Side
“I want AWS to manage but I control access” → KMS with CMK

Principle 2: Key Policy is King (KMS Access)

Unlike S3/Lambda, KMS requires Key Policy — IAM policy alone is NOT enough.

Why? KMS keys are highly sensitive. AWS designed it so you MUST explicitly allow access at the key level.

Derivation: If question mentions “IAM policy for KMS” → check if Key Policy allows it. No Key Policy = No Access.

Principle 3: Regional vs Global Services

Understanding where services “live” determines where certificates/keys must be.

Service	Scope	Certificate/Key Location
CloudFront	Global (us-east-1)	ACM in us-east-1
API Gateway Edge-Optimized	Uses CloudFront	ACM in us-east-1
API Gateway Regional	Regional	ACM in same region
KMS	Regional	Must re-encrypt when crossing regions
KMS Multi-Region	Multi-region	Same key ID across regions
CloudHSM	Regional	No cross-region replication

Derivation: “Where does TLS terminate?” → that’s where cert must be.

Principle 4: Detection vs Protection vs Management

Security services fall into three categories:

Category	Services	Action
Detection	GuardDuty, Inspector, Macie, Config	Find problems
Protection	WAF, Shield, Network Firewall	Block attacks
Management	Firewall Manager, Security Hub	Centralize/aggregate

Derivation: “Centrally manage across accounts” → Management category → Firewall Manager

Principle 5: Layer Determines Service

Network attacks happen at different layers:

Layer	Attacks	Protection
L3/L4	SYN floods, UDP reflection	Shield, NACLs, Security Groups
L7	SQL injection, XSS, DDoS	WAF, API Gateway throttling

Derivation: “NLB + WAF” → IMPOSSIBLE (NLB = L4, WAF = L7)

Principle 6: Rotation ≠ Versioning

Don’t confuse these:

Concept	What Changes	Service
Key Rotation	New key material, same key ID	KMS (1 year fixed)
Secret Rotation	New password/credential	Secrets Manager (configurable)
Version History	Track all previous values	SSM Parameter Store

Derivation: “Rotate every 6 months” → Manual rotation with aliases (KMS auto is 1 year only)

Principle 7: Storage Service Owns Client-Side Naming

“[Service] Client-Side Encryption” means YOUR APP encrypts before sending to [Service].

“Aurora Client-side” = App encrypts → Aurora stores ciphertext
“S3 Client-side” = App encrypts → S3 stores ciphertext
“Lambda Client-side” = NONSENSE (Lambda is compute, not storage!)

Derivation: Client-side encryption question → identify the STORAGE service

Principle 8: Auto-Rotation is Rare

Most services do NOT auto-rotate:

Service	Auto-Rotation?
Secrets Manager	✅ Yes (built-in)
KMS	✅ Yes (1 year only)
SSM Parameter Store	❌ No
IAM Access Keys	❌ No

Derivation: “DB credentials + auto rotation” → Secrets Manager (only option)

Part 2: Decision Trees (Follow Keywords → Find Answer)

Encryption Service Decision Tree

Need encryption?
│
├─► "DB credentials" + "auto rotation"
│   └─► Secrets Manager
│
├─► "Config values" + "version history"
│   └─► SSM Parameter Store
│
├─► "FIPS 140-2 Level 3" OR "AWS cannot access keys"
│   └─► CloudHSM
│
├─► "Multi-region" + "Global DB"
│   └─► KMS Multi-Region Keys (NOT CloudHSM)
│
├─► "Admins cannot see data"
│   └─► Client-Side Encryption
│
└─► Standard encryption
    └─► KMS (default choice)

Security Service Decision Tree

Security question?
│
├─► "Threat" / "attack" / "compromised" / "unusual API"
│   └─► GuardDuty
│
├─► "Vulnerability" / "CVE" / "patch" / "OS security"
│   └─► Inspector
│
├─► "PII" / "sensitive data" / "S3 data discovery"
│   └─► Macie
│
├─► "Compliance" / "configuration audit"
│   └─► Config
│
├─► "Centrally manage" / "across accounts" / "Organization"
│   └─► Firewall Manager
│
├─► "DDoS protection"
│   └─► Shield (Standard=free, Advanced=$3k/mo)
│
└─► "Layer 7" / "SQL injection" / "XSS" / "rate limiting"
    └─► WAF

ACM Certificate Location Decision Tree

Where to put ACM certificate?
│
├─► CloudFront distribution?
│   └─► us-east-1
│
├─► Edge-Optimized API Gateway?
│   └─► us-east-1 (uses CloudFront behind scenes)
│
├─► Regional API Gateway?
│   └─► Same region as API Gateway
│
└─► ALB?
    └─► Same region as ALB

The “CANNOT” List

❌ Impossible	Why
WAF + NLB	WAF = L7, NLB = L4
ACM + EC2 directly	Can’t extract private key
KMS auto-rotate < 1 year	Fixed at 1 year
Imported key auto-rotation	Manual only via alias
CloudHSM multi-region replication	Single-region only
GuardDuty scan CloudWatch Logs	Uses CloudTrail, VPC Flow, DNS only
Single asymmetric key for encrypt + sign	Choose one at creation
Share AMI with AWS Managed Key	Must use Customer Managed Key

Part 3: Scenario Pattern Recognition

Pattern: “RDS credentials + automatic rotation”

Keywords: RDS, password, credentials, automatic rotation Answer: Secrets Manager Why: Only service with native RDS rotation integration

Pattern: “FIPS 140-2 Level 3 compliance”

Keywords: FIPS, Level 3, compliance, tamper-evident Answer: CloudHSM Why: KMS is Level 2 only; CloudHSM is Level 3

Pattern: “AWS should not have access to encryption keys”

Keywords: AWS cannot access, customer-managed hardware Answer: CloudHSM Why: KMS = AWS manages key material; CloudHSM = you manage entirely

Pattern: “Aurora Global + client-side encryption”

Keywords: Global database, multi-region, client-side, encrypt Answer: KMS Multi-Region Keys Why: CloudHSM can’t replicate keys across regions

Pattern: “Centrally manage Security Groups across accounts”

Keywords: centrally, manage, multiple accounts, Organization Answer: Firewall Manager Why: Only service that manages security rules across Organization

Pattern: “Edge-Optimized API Gateway + ACM certificate”

Keywords: Edge-Optimized, API Gateway, certificate, SSL Answer: us-east-1 Why: Edge-Optimized uses CloudFront → CloudFront = us-east-1

Pattern: “Notify 30 days before certificate expires”

Keywords: certificate expiry, notification, X days before Answer: Depends on cert type:

Imported/third-party cert → AWS Config rule acm-certificate-expiration-check → SNS
ACM-created cert → EventBridge daily events (or CW DaysToExpiry alarm) Why: CW DaysToExpiry metric doesn’t exist for imported certs. Config rule works for both.

Pattern: “Fixed IP address + WAF protection”

Keywords: fixed IP, static IP, WAF, DDoS Answer: Global Accelerator + ALB + WAF Why: WAF can’t attach to NLB; Global Accelerator provides fixed IPs to ALB

Pattern: “OS vulnerabilities on EC2 instances”

Keywords: vulnerability, CVE, patch, EC2, OS Answer: Inspector (with SSM Agent) Why: Inspector scans for CVEs; GuardDuty detects threats, not vulnerabilities

Pattern: “Track configuration changes over time”

Keywords: configuration, history, version, changes Answer: SSM Parameter Store (for config values) or Config (for resources) Why: Built-in versioning for every change

Pattern: “Sensitive data discovery in S3”

Keywords: PII, sensitive data, S3, discover Answer: Macie Why: ML-based PII discovery specifically for S3

Pattern: “Unusual API calls detected”

Keywords: unusual API, suspicious activity, threat, compromise Answer: GuardDuty Why: Analyzes CloudTrail for anomalous API patterns

Pattern: “Protect against SQL injection”

Keywords: SQL injection, XSS, Layer 7, web exploits Answer: WAF Why: WAF has managed rules for common web attacks

Pattern: “DDoS protection with 24/7 support team”

Keywords: DDoS, response team, SRT, cost protection Answer: Shield Advanced Why: Shield Advanced includes DDoS Response Team access

Pattern: “Rotate KMS key every 6 months”

Keywords: rotate, 6 months, 90 days, custom period Answer: Manual rotation with Key Alias Why: Auto-rotation is fixed at 1 year; manual rotation for custom periods

Part 4: Quick Reference Tables

Security Services Comparison

Service	Detects	Scans	Output
GuardDuty	Threats, attacks	CloudTrail, VPC Flow, DNS	EventBridge
Inspector	Vulnerabilities	EC2 OS, ECR, Lambda	Security Hub
Macie	Sensitive data	S3 buckets	EventBridge
Config	Non-compliance	Resource configs	SNS, S3

Secrets/Config Storage Comparison

Requirement	Service
DB credentials + auto rotation	Secrets Manager
Config values + versioning	SSM Parameter Store
Hierarchical config paths	SSM Parameter Store
Multi-region secrets	Secrets Manager
Free tier needed	SSM Parameter Store

KMS Key Types

Key Type	Auto-Rotation	Period	Manual Rotation
AWS Managed	Always ON	1 year	N/A
Customer Managed	Optional	1 year	Via alias
Imported	❌ Never	N/A	Via alias only

WAF Compatibility

✅ Works	❌ Doesn’t Work
ALB	NLB
CloudFront	CLB
API Gateway	EC2 directly
AppSync	Route 53
Cognito User Pool

Part 5: Ultimate Instant-Answer Table

Question Contains	→ Instant Answer
“RDS” + “auto rotation”	Secrets Manager
“FIPS 140-2 Level 3”	CloudHSM
“AWS cannot access keys”	CloudHSM
“Multi-region” + “encryption” + “Global DB”	KMS Multi-Region Keys
“Edge-Optimized” + “certificate”	us-east-1
“Regional API Gateway” + “certificate”	Same region as API
“CloudFront” + “certificate”	us-east-1
“Centrally manage” + “accounts”	Firewall Manager
“Security Groups” + “Organization”	Firewall Manager
“WAF” + “multiple accounts”	Firewall Manager
“OS vulnerability” / “CVE”	Inspector
“Threat” / “unusual API” / “compromised”	GuardDuty
“PII” / “sensitive data” + “S3”	Macie
“Configuration compliance”	Config
“Fixed IP” + “WAF”	Global Accelerator + ALB + WAF
“NLB” + “WAF”	IMPOSSIBLE
“Rotate every 6 months”	Manual rotation with alias
“Asymmetric” + “encrypt AND sign”	Two separate keys needed
“Version history” + “config”	SSM Parameter Store
“Certificate expiry notification”	Imported = Config rule; ACM-created = EventBridge/CW
“DDoS” + “response team”	Shield Advanced
“DDoS” + “free”	Shield Standard
“SQL injection” / “XSS”	WAF
“Layer 7 protection”	WAF
“Layer 3/4 protection”	Shield, Security Groups, NACLs
“Cross-account encrypted AMI”	Customer Managed Key + Key Policy
“Admins cannot see data”	Client-Side Encryption
“Lambda Client-side Encryption”	WRONG ANSWER (Lambda isn’t storage)
“KMS IAM Policy” alone	NOT enough (Key Policy required)
“ACM” + “EC2 directly”	IMPOSSIBLE
“GuardDuty” + “CloudWatch Logs”	FALSE (not a data source)

Part 6: Elimination Checklist

For Encryption Questions:

□ Does it mention "auto rotation" for DB?
  → Yes = Secrets Manager
  → No = continue

□ Does it mention "FIPS Level 3" or "AWS cannot access"?
  → Yes = CloudHSM
  → No = continue

□ Does it mention "multi-region" + "Global database"?
  → Yes = KMS Multi-Region Keys
  → No = continue

□ Does it mention "config values" or "version history"?
  → Yes = SSM Parameter Store
  → No = probably KMS

For Security Service Questions:

□ Is it about DETECTION (finding problems)?
  → Threats/attacks = GuardDuty
  → Vulnerabilities/CVE = Inspector
  → Sensitive data = Macie
  → Configuration = Config

□ Is it about PROTECTION (blocking attacks)?
  → Layer 7 (HTTP) = WAF
  → Layer 3/4 (network) = Shield, SG, NACL

□ Is it about MANAGEMENT (centralize/aggregate)?
  → Across accounts = Firewall Manager
  → Aggregate findings = Security Hub

For Certificate Location Questions:

□ Is CloudFront involved (directly or Edge-Optimized)?
  → Yes = us-east-1
  → No = same region as the service

🏆 The Golden Rules

KMS Key Policy is mandatory (IAM alone never works for KMS)
KMS rotation = 1 year fixed (use manual rotation + alias for custom periods)
Edge-Optimized = us-east-1 (because CloudFront)
WAF = Layer 7 only (can’t attach to NLB)
CloudHSM = single-region (no multi-region replication)
Secrets Manager = auto-rotation (SSM Parameter Store doesn’t have it)
GuardDuty sources = CVD (CloudTrail, VPC Flow, DNS - NOT CloudWatch!)
“Centrally manage across accounts” = Firewall Manager (it’s in the name)
Inspector = vulnerabilities (GuardDuty = threats - different!)
Client-side encryption = storage service name (Lambda Client-side = nonsense)
Cross-account AMI = Customer Managed Key (AWS Managed Key can’t be shared)
Asymmetric key = encrypt OR sign (never both with same key)
ACM + EC2 directly = impossible (can’t extract private key)
Certificate expiry alerts — imported = Config rule + SNS; ACM-created = EventBridge/CW DaysToExpiry
Multi-region encryption = KMS Multi-Region Keys (CloudHSM can’t do it)

Deployment (IaaS) and software development (CI/CD):

AWS Cloudformation is a declarative way of outlining and creating your AWS Infrastructure, for any resources, in the right order and with exact configuration that you specify.

Infrastructure as Code (IaC): no resources are manually created, changes to the infrastructure are reviewed through code;
Cost: each resource in the stack is tagged with an identifier → easily see stack costs. Saving strategy: automate deletion of templates at 5 PM, recreate at 8 AM;
Productivity: destroy and re-create infrastructure on the fly. Declarative programming (no need to figure out ordering and orchestration);
Supports (almost) all AWS resources — use custom resources for unsupported ones;
Don’t re-invent the wheel: leverage existing templates on the web.

CloudFormation Service Role:

IAM role that allows CloudFormation to create/update/delete stack resources on your behalf
Users can manage stacks without having direct permissions to the underlying resources
User only needs cloudformation:* + iam:PassRole permissions
Use case: least privilege — let users deploy stacks without giving them full S3/EC2/RDS permissions

User (cloudformation:*, iam:PassRole) ──► CloudFormation ──► Service Role (s3:*, ec2:*) ──► Resources

AWS Infrastructure Composer (formerly Application Composer): visually design and build serverless applications quickly on AWS. Deploy AWS infrastructure code without needing to be an expert in AWS. Configure how your resources interact with each other. Ability to import existing CloudFormation / SAM templates to visualize them. Help to visualize, build, and deploy modern applications from all AWS services that are supported by AWS CloudFormation.

AWS Cloud Development Kit (CDK) accelerates cloud development using common programming languages to model your applications, to deplay infrastructure and applicationg runtime code together.

AWS Elastic Beanstalk is a managed service of Platform as a Service (PaaS), developer centric view of deploying an application on AWS (using EC2, ASG, ELB, RDS and etc). Instance configuration, OS handling, deployment strategy, capacity provisioning, load balancing and auto-scaling, application health-monitoring & responsiveness, everything except the actual application code is responsibility of AWS Elastic Beanstalk.
Elastic Beanstalk automatically handles capacity provisioning, load balancing, autoscaling and application health monitoring.

AWS CodeDeploy is a fully managed deployment service that automates software deployments to various compute services, such as Amazon Elastic Compute Cloud (EC2), Amazon Elastic Container Service (ECS), AWS Lambda, and your on-premises servers. Use CodeDeploy to automate software deployments, eliminating the need for error-prone manual operations.

AWS CodeCommit is a fully managed, scalable and highly available code repository, using Git technology. Collaborate with others on code. Code changes are automatically versioned.

AWS CodeBuild is a fully managed, serverless, scalable & highly availble code building service in the cloud. Compiles source code, run tests and produces packages that are ready to be deployed.

AWS CodePipeline is a fully managed continuous delivery service that helps you automate your release pipelines for fast and reliable application and infrastructure updates. Compatible with CodeCommit, CodeBuild, CodeDeploy, Elastic Beanstalk, CloudFormation, GitHub, etc.

AWS CodeArtifact is a secure, scalable, and cost-effective package management for software development.

AWS CodeStar is a unified UI to easily manage software develompent activities in one place.

AWS Cloud9 is a cloud IDE (Intergrated Development Environment) for writing, running and debugging code.

AWS Step Functions: serverless visual workflow to archestrate Labmda functions. Sequence, parallel, conditions, timeouts, error handling, human approval feature etc. Integrates with EC2, ECS, On-premises servers, API Gatewat, SQS queues, etc.
AWS Step Functions excel in complex workflow orchestration scenarios, offering advanced features such as state management, error handling, and parallel execution.

AWS Amplify: a set of tools and services that helps you develop and deploy scalable full stack web and mobile applications. Authentication, Storage, API (REST, GraphQL), CI/CD, PubSub, Analytics, AI/ML Predictions, Monitoring, Source Code from AWS, GitHub, etc.
Amplify has serverless architecture simplifies maintenance and scales automatically. There is no need to provision or manage EC2 instances. Lambda and API Gateway handle availability and response to traffic spikes automatically. Upload code and let Amplify handle deploying and running it.

AWS Device Farm: fully-managed service that tests your web and mobile apps against desktop browsers, real mobile devices and tablets. Run tests concurrently on multiple devices. Ability to configure device ettings (GPS, language, WiFi, Bluetooth, etc).

AWS Systems Manager (SSM) — hybrid AWS service to manage infrastructure at scale (EC2 + on-premises servers). Requires SSM Agent installed on managed instances.

SSM Session Manager:

Secure shell access to EC2 and on-prem servers — no SSH, no bastion host, no port 22 needed
Supports Linux, macOS, Windows
Session logs → S3 or CloudWatch Logs
Access controlled via IAM permissions (not SSH keys)

SSM Run Command:

Execute documents (scripts) or commands across multiple instances (using resource groups)
No SSH needed — uses SSM Agent
Output → AWS Console, S3, or CloudWatch Logs
Notifications → SNS (In Progress, Success, Failed)
Can be triggered by: EventBridge, manual (Console/CLI/SDK)
Integrated with IAM & CloudTrail

SSM Patch Manager:

Automates patching of managed instances — OS, application, and security updates
Supports EC2 + on-premises servers (Linux, macOS, Windows)
Patch on-demand or scheduled via Maintenance Windows
Scan instances and generate patch compliance reports (missing patches)
Uses AWS-RunPatchBaseline document

SSM Maintenance Windows:

Define a schedule for when to perform actions on instances (OS patching, driver updates, software install)
Contains: Schedule, Duration, Set of registered instances, Set of registered tasks

Maintenance Windows ── trigger (e.g., every 24h) ──► Run Command ──► EC2 (with SSM Agent)

SSM Automation:

Simplifies common maintenance and deployment tasks (restart instances, create AMI, EBS snapshot)
Automation Runbook = SSM Documents defining actions (pre-defined or custom)
Triggers: AWS Console/CLI/SDK, EventBridge, Maintenance Windows, AWS Config (for rule remediations)

⚠️ Exam trap: “Secure shell access without SSH/port 22” → SSM Session Manager. “Run script on 100s of instances” → SSM Run Command. “Automate patching schedule” → SSM Patch Manager + Maintenance Windows.

🎯 MASTER SUMMARY: Deployment, IaC & CI/CD Exam Guide

Part 1: Core Principles (Understand WHY → Derive WHAT)

Principle 1: IaC Tools Differ by Abstraction Level

CloudFormation = declarative YAML/JSON templates → AWS resources (lowest level, most control)
CDK = programming languages (Python, TypeScript) → generates CloudFormation templates
Elastic Beanstalk = PaaS, you provide code → it handles infra (highest abstraction)
Amplify = full-stack web/mobile framework → serverless backend + CI/CD

Principle 2: CI/CD Pipeline = Commit → Build → Deploy

AWS CI/CD chain: CodeCommit (source) → CodeBuild (build/test) → CodeDeploy (deploy) → orchestrated by CodePipeline. Each service is independent and can integrate with third-party tools.

Principle 3: CloudFormation Service Role = Least Privilege

Users don’t need permissions to underlying resources. They only need cloudformation:* + iam:PassRole. The Service Role has the resource permissions. This enables delegation without over-granting.

Principle 4: SSM = Manage Instances at Scale Without SSH

SSM Agent enables secure management of EC2 + on-prem servers. Five key features map to five different exam patterns:

Session Manager → secure shell (no SSH/port 22)
Run Command → execute scripts on fleet
Patch Manager → automate patching
Maintenance Windows → schedule operations
Automation → runbooks for complex tasks (+ Config remediation)

Principle 5: CodeDeploy = Multi-Target Deployment

CodeDeploy works on EC2, ECS, Lambda, AND on-premises — it’s the deployment tool that bridges cloud and on-prem.

Part 2: Decision Trees

IaC Tool Decision Tree

How do you want to define infrastructure?
│
├─ "YAML/JSON templates, full control" → CloudFormation
├─ "Programming language (Python/TS)" → CDK
├─ "Just upload my code, handle the rest" → Elastic Beanstalk
├─ "Full-stack web/mobile app" → Amplify
└─ "Visual drag-and-drop designer" → Infrastructure Composer

CI/CD Service Decision Tree

What CI/CD step do you need?
│
├─ SOURCE (store code) → CodeCommit (or GitHub)
├─ BUILD (compile/test) → CodeBuild
├─ DEPLOY (push to infra) → CodeDeploy
├─ ORCHESTRATE (chain all) → CodePipeline
├─ PACKAGE MGMT → CodeArtifact
└─ UNIFIED UI → CodeStar

SSM Feature Decision Tree

What do you need to do on EC2/on-prem?
│
├─ "Shell access without SSH" → Session Manager
├─ "Run script on many instances" → Run Command
├─ "Automate OS patching" → Patch Manager
├─ "Schedule maintenance tasks" → Maintenance Windows
├─ "Complex task automation / Config fix" → Automation (Runbooks)

Part 3: Scenario Pattern Recognition

Pattern: “Deploy infrastructure as code with full AWS resource control”

Keywords: IaC, template, declarative, all resources Answer: CloudFormation Why: Native AWS IaC, supports almost all resources, custom resources for unsupported.

Pattern: “Let users deploy stacks without giving them resource permissions”

Keywords: least privilege, deploy stacks, iam:PassRole Answer: CloudFormation Service Role Why: Service Role has resource permissions; user only needs cloudformation:* + iam:PassRole.

Pattern: “Just upload code, AWS handles everything else”

Keywords: PaaS, developer-centric, auto-scaling, health monitoring Answer: Elastic Beanstalk Why: Handles capacity, load balancing, scaling, monitoring. You only write code.

Pattern: “Define infrastructure using Python/TypeScript”

Keywords: programming language, CDK, familiar syntax Answer: AWS CDK Why: CDK compiles to CloudFormation. Use familiar languages instead of YAML/JSON.

Pattern: “Automate release pipeline: source → build → deploy”

Keywords: CI/CD, pipeline, automate releases Answer: CodePipeline (orchestrates CodeCommit + CodeBuild + CodeDeploy)

Pattern: “Deploy to EC2, ECS, Lambda, AND on-premises”

Keywords: deploy, multi-target, on-premises Answer: CodeDeploy Why: Only AWS deployment service that supports both cloud and on-prem.

Pattern: “Secure shell to EC2 without SSH keys or port 22”

Keywords: no SSH, no bastion, no port 22, secure shell Answer: SSM Session Manager Why: Uses SSM Agent + IAM permissions. Logs to S3/CloudWatch.

Pattern: “Run a script across 500 EC2 instances”

Keywords: fleet, multiple instances, run command, no SSH Answer: SSM Run Command

Pattern: “Schedule OS patching every Sunday at 2 AM”

Keywords: patch, schedule, compliance, OS updates Answer: SSM Patch Manager + Maintenance Windows

Pattern: “Auto-remediate non-compliant AWS Config rules”

Keywords: Config, remediate, auto-fix, non-compliant Answer: AWS Config + SSM Automation (Runbooks)

Pattern: “Full-stack web/mobile app with serverless backend”

Keywords: web app, mobile app, full-stack, serverless, Amplify Answer: AWS Amplify

Pattern: “Test mobile app on real devices”

Keywords: mobile testing, real devices, browsers Answer: AWS Device Farm

Part 4: Quick Reference Tables

IaC & Deployment Services

Service	What It Does	Abstraction
CloudFormation	IaC templates → AWS resources	Low (full control)
CDK	Code → CloudFormation templates	Medium
Beanstalk	Upload code → full environment	High (PaaS)
Amplify	Full-stack web/mobile framework	High (serverless)
Infrastructure Composer	Visual CloudFormation designer	Visual

CI/CD Pipeline Services

Service	Role	Serverless?
CodeCommit	Source repository (Git)	✅
CodeBuild	Build + test	✅
CodeDeploy	Deploy to EC2/ECS/Lambda/on-prem	✅
CodePipeline	Orchestrate pipeline	✅
CodeArtifact	Package management	✅

SSM Features

Feature	Purpose	Key Differentiator
Session Manager	Secure shell	No SSH/port 22, IAM-based
Run Command	Execute scripts on fleet	No SSH, EventBridge trigger
Patch Manager	Automate patching	Compliance reports
Maintenance Windows	Schedule operations	Schedule + duration + tasks
Automation	Complex task runbooks	Config remediation trigger

Part 5: Ultimate Instant-Answer Table

Question Contains	→ Instant Answer
“IaC, declarative templates”	CloudFormation
“Custom resources for unsupported”	CloudFormation
“Service Role, iam:PassRole”	CloudFormation Service Role
“Visual designer for CloudFormation”	Infrastructure Composer
“Define infra in Python/TypeScript”	CDK
“PaaS, upload code, handles rest”	Elastic Beanstalk
“Full-stack web/mobile serverless”	Amplify
“Automate CI/CD pipeline”	CodePipeline
“Source code repository (Git)”	CodeCommit
“Build and test code”	CodeBuild
“Deploy to EC2/ECS/Lambda/on-prem”	CodeDeploy
“Package management”	CodeArtifact
“No SSH, no port 22, secure shell”	SSM Session Manager
“Run script on fleet of instances”	SSM Run Command
“Automate OS patching”	SSM Patch Manager
“Schedule maintenance tasks”	SSM Maintenance Windows
“Config auto-remediation”	SSM Automation + Config
“Workflow orchestration, state machine”	Step Functions
“Test on real mobile devices”	Device Farm

🏆 The Golden Rules

CloudFormation = IaC king (declarative, almost all resources)
CDK → CloudFormation (CDK generates CF templates, not a replacement)
Beanstalk = PaaS (you code, AWS does everything else)
CodePipeline = orchestrator (chains Commit → Build → Deploy)
CodeDeploy = only multi-target (EC2 + ECS + Lambda + on-prem)
Session Manager = no SSH (no port 22, no bastion, no keys)
Run Command = fleet scripts (no SSH, resource groups, EventBridge)
Patch Manager + Maintenance Windows = scheduled patching
SSM Automation = Config remediation (Runbooks fix non-compliant resources)
Service Role = least privilege delegation (user doesn’t need resource permissions)

Machine Learning (ML) Services:

Amazon Rekognition automates image recognition and video analysis for your applications without machine learning (ML) experience.

Labeling
Content Moderation
Text Detection
Face Detection and Analysis (gender, age range, emotions)
Face Search and Verification (user verification, people counting)
Celebrity Recognition
Pathing (for sport game analysis)
Create database of “familiar faces”

Rekognition Content Moderation:

Detect inappropriate, unwanted, or offensive content (images + videos)
Use cases: social media, broadcast media, advertising, e-commerce
Set Minimum Confidence Threshold for flagging
Flag sensitive content for manual review in Amazon Augmented AI (A2I)
Helps comply with regulations

⚠️ Exam trap: “Moderate user-uploaded images” or “detect inappropriate content” → Rekognition Content Moderation + optionally A2I for human review.

Amazon Transcribe automatically convert speech to text, using deep learning process called automatic speech recognition (ASR).

Automatically remove Personally Identifiable Information (PII) using Redaction
Supports Automatic Language Identification for multi-lingual audio
Use cases:
- Transcribe customer service calls
- Automate closed captioning and subtitling
- Generate metadata for media assets (searchable archive)

⚠️ Exam trap: “Remove PII from audio/transcripts” → Amazon Transcribe with PII Redaction enabled.

Amazon Polly turn text into lifelike speech using deep learning. Allows to create applications that talk.

Pronunciation Lexicons: Customize word pronunciation (e.g., “St3ph4ne” → “Stephane”, “AWS” → “Amazon Web Services”)
SSML (Speech Synthesis Markup Language): Advanced customization
- Emphasize words/phrases
- Phonetic pronunciation
- Breathing sounds, whispering
- Newscaster speaking style

Amazon Translate natural and accurate language translation.

Localize content (websites, applications) for international users
Translate large volumes of text efficiently

Amazon Lex (technology that powers Alexa) easily add AI that understands intent, maintains context, and automates simple tasks across many languages to build chatbots, call center bots.

ASR (Automatic Speech Recognition) = speech to text
NLU (Natural Language Understanding) = recognize intent

Amazon Connect is an omnichannel cloud contact center that helps companies provide superior customer service at a lower cost. Amazon Connect provides a seamless experience across voice and chat for customers and agents.

Receive calls, create contact flows, cloud-based virtual contact center
Integrates with CRM systems and AWS services
No upfront payments, 80% cheaper than traditional contact centers

Amazon Comprehend fully managed and serverless service for natural language processing (NLP), that uses machine learning to find insights and relationships in text: define language of the text, extract key phrases, understands emotions in the text, etc. Create and group articles by topics that Comprehend will uncover.

Amazon Comprehend Medical:

Detects useful info in unstructured clinical text (physician’s notes, discharge summaries, test results)
Uses NLP to detect Protected Health Information (PHI) — DetectPHI API
Integrates with: S3 (stored docs), Kinesis Firehose (real-time), Transcribe (patient narratives)

Amazon SageMaker is a fully managed service for developers / data scientists to build ML models.

All ML processes in one place (label → build → train → tune → deploy)
No need to provision servers
ML workflow: Historical Data → Label → Build Model → Train & Tune → Apply to New Data → Prediction

Amazon Forecast is a fully managed service that uses ML to deliver highly accurate forecasts (product demand planning, financial planning, resource planning, etc).

Amazon Kendra is a fully managed document search service powered by ML.

Extract answers from within documents (text, PDF, HTML, PowerPoint, MS Word, FAQs)
Natural language search capabilities
Incremental Learning — learns from user interactions/feedback
Fine-tune search results (importance, freshness, custom ranking)
Data sources: S3, RDS, Google Drive, MS SharePoint, MS OneDrive, Salesforce, ServiceNow, 3rd party APIs

Kendra Architecture:
Data Sources ──► indexing ──► Knowledge Index ──► "Where is IT support?" ──► "1st floor"
(S3, RDS, etc)               (powered by ML)          (natural language)        (answer)

Amazon Personalize is a fully managed ML-service to build apps with real-time personalized recommendations.

Same technology used by Amazon.com
Personalized product recommendations, re-ranking, customized direct marketing
Integrates into: websites, apps, SMS, email marketing systems
Implement in days, not months (no need to build/train ML solutions)
Data sources: S3 (batch) or Personalize API (real-time)
Use cases: retail stores, media and entertainment

Personalize Architecture:
S3 (historical data) ─────────┐
                              ├──► Amazon Personalize ──► Websites, Mobile Apps, SMS, Emails
Personalize API (real-time) ──┘    (customized API)

Amazon Textract automatically extracts text, handwriting and data from any scanned documents using AI and ML.

Extract data from forms and tables
Read and process any document type (PDFs, images)
Use cases:
- Financial Services (invoices, financial reports)
- Healthcare (medical records, insurance claims)
- Public Sector (tax forms, ID documents, passports)

Textract Flow:
Document (ID, form, etc) ──► analyze ──► Amazon Textract ──► Structured JSON
                                                              {"Document ID": "123",
                                                               "Name": "...",
                                                               "DOB": "23.05.1997"}

Lex + Connect Integration (Call Center Pattern):

Phone Call ──► Connect ──► stream ──► Lex ──► invoke ──► Lambda ──► CRM
(schedule      (contact    (audio)   (intent              (action)   (database)
appointment)   center)              recognized)

AWS Machine Learning - Quick Reference

Service	Purpose	Key Feature
Rekognition	Image/video analysis	Face detection, content moderation
Transcribe	Speech → Text	PII redaction, multi-language
Polly	Text → Speech	Lexicons, SSML
Translate	Language translation	Localization
Lex	Chatbots	ASR + NLU (powers Alexa)
Connect	Contact center	80% cheaper, cloud-based
Comprehend	NLP text analysis	Sentiment, topics, entities
Comprehend Medical	Clinical text NLP	PHI detection
SageMaker	Build custom ML models	Full ML workflow
Forecast	Time-series predictions	Demand/resource planning
Kendra	Document search	Natural language, incremental learning
Personalize	Recommendations	Same as Amazon.com
Textract	Document data extraction	Forms, tables, handwriting
Bedrock	Generative AI (Foundation Models)	Claude, Llama, Titan, Stable Diffusion

Amazon Bedrock is a fully managed service for building generative AI applications using foundation models (FMs).

Access to multiple Foundation Models from AI companies:
- Anthropic (Claude)
- Meta (Llama)
- Amazon (Titan)
- Stability AI (Stable Diffusion)
- Cohere, AI21 Labs, and more
Serverless — no infrastructure to manage
Private — your data stays in your AWS account, not used to train models
Fine-tune models with your own data
Build agents that can execute tasks
Use cases: chatbots, content generation, summarization, code generation, image generation

⚠️ Exam trap: “Generative AI” or “Foundation Models” or “LLM on AWS” → Amazon Bedrock. SageMaker = build your own ML models.

Amazon Augmented AI (A2I) provides human review workflows for ML predictions.

Integrate with: Rekognition (content moderation), Textract (document review), SageMaker
Use when ML confidence is below threshold → route to human reviewer
Built-in workflows or custom workflows

⚠️ Exam trap: “Human review of ML predictions” or “manual review when confidence low” → Amazon A2I

ML Service Decision Tree

What do you need to do?
         │
    ┌────┴────┬─────────┬──────────┬──────────┬────────────┐
    ▼         ▼         ▼          ▼          ▼            ▼
 VISION    SPEECH    TEXT/NLP   SEARCH    PREDICT     GEN AI
    │         │         │          │          │            │
    ▼         ▼         ▼          ▼          ▼            ▼
Rekognition  ┌─┴─┐   ┌──┴──┐    ┌──┴──┐   ┌──┴───┐     Bedrock
             │   │   │     │    │     │   │      │
          Speech→ Text→  Comprehend Kendra Forecast  (Foundation
          Text   Speech  (NLP)    (docs) (time)     Models)
             │      │       │                │
             ▼      ▼       ▼                ▼
         Transcribe Polly  Medical?     Personalize
                              │         (recommend)
                              ▼
                         Comprehend
                          Medical

When to use which search service?

Search Type?
     │
     ├── Document Q&A ("Where is IT support?") ──► Kendra
     │   (natural language answers)
     │
     └── Full-text search (logs, partial match) ──► OpenSearch
         (search engine, analytics)

When to use which text extraction?

Extract from documents?
     │
     ├── Forms, tables, structured data ──► Textract
     │   (invoices, IDs, medical records)
     │
     └── Text in images/videos ──► Rekognition
         (signs, banners, license plates)

Custom ML vs Managed Services?

Need ML capability?
     │
     ├── Pre-built solution exists? ──► Use managed service
     │   (Rekognition, Comprehend, Forecast, etc.)
     │
     └── Need custom model? ──► SageMaker
         (your own algorithms, data)

Additional Exam Traps

⚠️ Exam trap: Kendra vs OpenSearch:

Kendra = document Q&A, natural language (“Where is the IT desk?” → “1st floor”)
OpenSearch = full-text search, log analytics, partial match

⚠️ Exam trap: Textract vs Rekognition text:

Textract = extract structured data from documents (forms, tables, IDs)
Rekognition = detect text in images/videos (signs, banners, license plates)

⚠️ Exam trap: SageMaker vs Managed Services:

SageMaker = build/train/deploy YOUR OWN custom ML models
Managed services = pre-built ML (Rekognition, Comprehend, Forecast)

⚠️ Exam trap: Bedrock vs SageMaker:

Bedrock = use existing Foundation Models (Claude, Llama, Titan)
SageMaker = build and train custom models from scratch

⚠️ Exam trap: Comprehend vs Comprehend Medical:

Comprehend = general NLP (sentiment, entities, topics)
Comprehend Medical = clinical text, PHI detection (HIPAA)

🎯 MASTER SUMMARY: Machine Learning Exam Guide

Part 1: Core Principles

Principle 1: Managed ML vs Custom ML

AWS offers two paths:

Managed services = pre-built ML, no experience needed (Rekognition, Comprehend, etc.)
SageMaker = build your own models, requires ML knowledge

Rule: If a managed service exists for your use case → use it. Custom ML only when needed.

Principle 2: Each Service Has ONE Primary Purpose

Purpose	Service
Image/Video analysis	Rekognition
Speech → Text	Transcribe
Text → Speech	Polly
Translation	Translate
Chatbots	Lex
Contact Center	Connect
Text NLP	Comprehend
Document Q&A	Kendra
Recommendations	Personalize
Document extraction	Textract
Time-series forecast	Forecast
Custom ML	SageMaker
Generative AI	Bedrock

Principle 3: Integration Patterns

Common AWS ML patterns:

Call center: Connect → Lex → Lambda → CRM
Content moderation: Upload → Rekognition → A2I (human review)
Document processing: Upload → Textract → Comprehend → store
Customer analytics: Data → Comprehend (sentiment) → insights

Principle 4: Human-in-the-Loop = A2I

When ML confidence is low, route to human review:

Rekognition content moderation
Textract document review
SageMaker predictions

Part 2: Instant-Answer Table

Question Contains	→ Instant Answer
“image recognition”	Rekognition
“video analysis”	Rekognition
“face detection”	Rekognition
“content moderation” + images	Rekognition
“celebrity recognition”	Rekognition
“speech to text”	Transcribe
“transcribe calls”	Transcribe
“remove PII from audio”	Transcribe (Redaction)
“closed captioning”	Transcribe
“text to speech”	Polly
“applications that talk”	Polly
“translate languages”	Translate
“localize content”	Translate
“chatbot”	Lex
“conversational bot”	Lex
“powers Alexa”	Lex
“call center”	Connect
“contact center”	Connect
“80% cheaper contact”	Connect
“sentiment analysis”	Comprehend
“NLP” + “text insights”	Comprehend
“clinical text” + “PHI”	Comprehend Medical
“physician notes”	Comprehend Medical
“document search” + “Q&A”	Kendra
“natural language search”	Kendra
“incremental learning”	Kendra
“product recommendations”	Personalize
“same as Amazon.com”	Personalize
“personalized marketing”	Personalize
“extract from forms/tables”	Textract
“invoice processing”	Textract
“ID documents”	Textract
“handwriting extraction”	Textract
“demand forecasting”	Forecast
“time-series prediction”	Forecast
“build custom ML model”	SageMaker
“train ML model”	SageMaker
“generative AI”	Bedrock
“foundation models”	Bedrock
“LLM on AWS”	Bedrock
“Claude/Llama/Titan”	Bedrock
“human review ML”	A2I
“manual review when low confidence”	A2I

Part 3: The “CANNOT” / Common Confusions

Confusion	Clarification
Kendra vs OpenSearch	Kendra = document Q&A; OpenSearch = full-text search/logs
Textract vs Rekognition	Textract = forms/tables; Rekognition = text in images
SageMaker vs Bedrock	SageMaker = custom models; Bedrock = use foundation models
Comprehend vs Medical	Comprehend = general; Medical = clinical/PHI
Polly vs Transcribe	Polly = text→speech; Transcribe = speech→text
Lex vs Connect	Lex = chatbot logic; Connect = phone/contact center

🏆 The Golden Rules

Rekognition = images/videos (faces, objects, content moderation)
Transcribe = speech→text (PII redaction, subtitles)
Polly = text→speech (Lexicons, SSML)
Lex = chatbots (powers Alexa, ASR+NLU)
Connect = contact center (80% cheaper)
Comprehend = text NLP (sentiment, entities, topics)
Comprehend Medical = clinical text (PHI, HIPAA)
Kendra = document Q&A (natural language, incremental learning)
Personalize = recommendations (same as Amazon.com)
Textract = document extraction (forms, tables, IDs)
Forecast = time-series (demand planning)
SageMaker = custom ML (build/train/deploy your own)
Bedrock = generative AI (foundation models: Claude, Llama, Titan)
A2I = human review (low confidence → manual review)
Managed service first (only SageMaker when no pre-built option)

Other AWS Services:

Amazon WorkSpaces: managed Desktop as a Service (DaaS) solution to easily provision Windows or Linux desktops. Cloud alternative to managing of on-premise Virtual Desktop Infrastructure (VDI). Scalable to thousands. Integrates with KMS. Pay-as-you-go pricing.

Amazon AppStream 2.0: desktop application streaming service. The application is delivered from within a web browser. Can be configured instance type per application type (CPU, RAM, GPU).

AWS IoT Core: serverless, secure & scalable to billions messages, service that allows easily connect IoT devices to AWS Cloud.

AWS AppSync: store and sync data across mobile and web apps in real-time. Makes use of GraphQL (mobile technology from Facebook). Intergrations with DynamoDB / Lambda.

AWS Ground Station: is a fully managed service that lets you ontrol satellite communications, process data and scale your satellite operations (weather forecasting, surface imaging, videobroadcasting, etc). Provides global network of satellite ground stations nea AWS regions. Allows to download satellite data to AWS VPC within seconds and send it to S3 or EC2 Instances.

Amazon Pinpoint: scalable two-way (outbound/inbound) marketing communications service. Supports email, SMS, push, voice and in-app messaging. Ability to segment and personalize messages with right content to customers. Possibility to receive replies. Scales to billions of messages per day. Use cases: run campaigns by sending marketing, bulk, transactional SMS messages. Stream events (TEXT_SUCCESS, TEXT_DELIVERED) → SNS, Kinesis Data Firehose, CloudWatch Logs. Versus Amazon SNS or Amazon SES: In SNS & SES you manage each message’s audience, content, and delivery schedule. In Pinpoint, you create message templates, delivery schedules, highly-targeted segments, and full campaigns.

Amazon Simple Email Service (SES): fully managed service to send emails securely, globally, and at scale. Allows inbound/outbound emails. Reputation dashboard, performance insights, anti-spam feedback. Statistics: email deliveries, bounces, feedback loop results, email open rates. Supports DKIM and SPF. Flexible IP deployment: shared, dedicated, customer-owned. Send via AWS Console, APIs, or SMTP. Use cases: transactional, marketing, and bulk email communications.

Amazon AppFlow: fully managed integration service to securely transfer data between SaaS applications and AWS. Sources: Salesforce, SAP, Zendesk, Slack, ServiceNow. Destinations: S3, Redshift, or non-AWS (Snowflake, Salesforce). Frequency: schedule, event-driven, or on-demand. Data transformation: filtering and validation. Encrypted over public internet or privately over AWS PrivateLink.

Instance Scheduler on AWS: AWS solution (deployed via CloudFormation, not a service) to automatically start/stop AWS services to reduce costs (up to 70%).

Supports: EC2, EC2 Auto Scaling Groups, RDS instances
Schedules managed in a DynamoDB table
Uses resource tags and Lambda to stop/start instances
Supports cross-account and cross-region resources
Example: stop dev EC2 instances outside business hours

AWS Marketplace digital catalog with thousands of software listings from independent software vendors (third-party).

AWS Data Exchange: find, subscribe to, and use third-party data in the cloud. Data providers publish data products → subscribers consume via S3, API, or Lake Formation. Use cases: financial data, weather, healthcare. No need to build custom ETL pipelines for external data.

AWS Data Pipeline: managed ETL service to process and move data between AWS compute and storage services, and on-premises sources. Defines data-driven workflows (dependencies). Runs on EC2 or EMR. Retries on failure. Legacy — prefer AWS Glue or Step Functions for new workloads.

⚠️ Exam trap: “Data Pipeline” on exam is usually legacy — modern answer is Glue (serverless ETL) or Step Functions (orchestration). But know Data Pipeline exists.

AWS Proton: fully managed delivery service for container and serverless applications. Platform teams create templates → developers deploy using self-service. Manages infrastructure provisioning + CI/CD. Think “Service Catalog for containers/serverless.”

AWS Wavelength: deploy AWS compute/storage at 5G telecom edge locations. Ultra-low latency for mobile devices. Extends VPC to Wavelength Zones. Use cases: real-time gaming, ML inference at edge, AR/VR, connected vehicles.

Amazon ECS Anywhere / EKS Anywhere: run ECS or EKS on on-premises or customer-managed infrastructure.

ECS Anywhere: register external instances with ECS, manage tasks from AWS console
EKS Anywhere: run EKS on your own hardware (VMware vSphere), disconnected from AWS
EKS Distro: same Kubernetes distribution used by EKS — run it yourself, fully self-managed

⚠️ Exam trap: “Run containers on-premises but manage from AWS” → ECS/EKS Anywhere. “Fully self-managed Kubernetes, same as EKS” → EKS Distro.

Amazon Elastic Transcoder: transcode media files (video/audio) stored in S3 into formats needed by consumer devices (phones, tablets, PCs). Pay per transcoding minute. Being replaced by AWS Elemental MediaConvert (more features, same purpose).

AWS License Manager: manage software licenses from vendors (Microsoft, SAP, Oracle). Track license usage, set rules, enforce limits. Integrates with EC2, RDS. Prevent license violations. Shared via AWS RAM across accounts.

Amazon Managed Grafana: fully managed Grafana for operational dashboards and observability. Queries from CloudWatch, Prometheus, X-Ray, Elasticsearch, Timestream. Workspace-based, integrates with IAM Identity Center for access.

Amazon Managed Service for Prometheus: fully managed, serverless Prometheus-compatible monitoring for containers (EKS, ECS). Stores metrics at scale. Query with PromQL. Pairs with Managed Grafana for visualization.

⚠️ Exam trap: “Container monitoring with Prometheus” → Managed Prometheus (metrics) + Managed Grafana (dashboards). NOT CloudWatch Container Insights (different approach).

AWS Audit Manager: continuously audit AWS usage to assess risk and compliance. Maps to frameworks (GDPR, HIPAA, SOC 2, PCI DSS). Collects evidence automatically from CloudTrail, Config, Security Hub. Generates audit-ready reports.

⚠️ Exam trap: “Continuous compliance auditing with evidence collection” → Audit Manager. “Compliance documents/agreements” → AWS Artifact. Different purposes.

Amazon Fraud Detector: fully managed service to identify potentially fraudulent online activities (online payment fraud, fake account creation, etc). Uses ML models trained on your data + Amazon’s fraud detection expertise. No ML experience needed.

AWS Serverless Application Repository: managed repository to deploy and publish serverless applications. Find pre-built Lambda functions and SAM templates. Supports public and private sharing.

Amazon Kinesis Video Streams: securely stream video from devices to AWS for analytics, ML, playback. Use cases: smart home cameras, industrial monitoring, computer vision with Rekognition.

🎯 MASTER SUMMARY: Other AWS Services Exam Guide

Part 1: Core Principles (Understand WHY → Derive WHAT)

Principle 1: “Managed Desktop/App” = WorkSpaces vs AppStream

WorkSpaces = full persistent desktop (DaaS) — user gets a complete OS (Windows/Linux)
AppStream 2.0 = single app streaming via browser — no full desktop, just the application
Memory trick: “Do they need a DESKTOP or just an APP?”

The key distinction is WHO manages the campaign logic:

Pinpoint = AWS manages templates, segments, schedules, campaigns → marketing-focused
SNS = YOU manage audience + content per message → notifications/alerts
SES = YOU manage audience + content → transactional/bulk email specifically

Principle 3: Integration Services Fill SaaS ↔ AWS Gaps

AppFlow = SaaS apps (Salesforce, SAP, Slack) ↔ AWS (S3, Redshift) — managed ETL for SaaS
AppSync = GraphQL for mobile/web apps ↔ DynamoDB/Lambda — real-time sync

Principle 4: Instance Scheduler = Solution, Not a Service

Deployed via CloudFormation. Uses DynamoDB + Lambda + tags. Supports cross-account/cross-region. Key for cost optimization questions.

Part 2: Instant-Answer Table

Question Contains	→ Instant Answer
“managed virtual desktop”	WorkSpaces
“DaaS, VDI replacement”	WorkSpaces
“stream desktop application via browser”	AppStream 2.0
“IoT devices to cloud”	IoT Core
“GraphQL, real-time sync”	AppSync
“satellite communications”	Ground Station
“marketing campaigns, segments”	Pinpoint
“two-way SMS/email campaigns”	Pinpoint
“transactional email, DKIM, SPF”	SES
“bulk email at scale”	SES
“transfer data from Salesforce/SAP to S3”	AppFlow
“SaaS integration”	AppFlow
“PrivateLink for SaaS data transfer”	AppFlow
“stop/start EC2 to save costs”	Instance Scheduler
“schedule EC2 on/off business hours”	Instance Scheduler
“third-party software catalog”	Marketplace
“subscribe to third-party data”	Data Exchange
“ETL pipeline, legacy orchestration”	Data Pipeline (prefer Glue/Step Functions)
“platform team templates for containers”	Proton
“5G edge, ultra-low latency mobile”	Wavelength
“run ECS/EKS on-premises”	ECS/EKS Anywhere
“self-managed Kubernetes like EKS”	EKS Distro
“transcode video in S3”	Elastic Transcoder / MediaConvert
“manage software licenses”	License Manager
“Grafana dashboards, observability”	Managed Grafana
“Prometheus container metrics”	Managed Prometheus
“continuous compliance audit”	Audit Manager
“detect online fraud with ML”	Fraud Detector
“pre-built Lambda/SAM templates”	Serverless App Repository
“stream video from devices”	Kinesis Video Streams

Part 3: Common Confusions

Confusion	Clarification
WorkSpaces vs AppStream	WorkSpaces = full desktop; AppStream = one app in browser
Pinpoint vs SNS	Pinpoint = campaigns/segments/templates; SNS = per-message notifications
Pinpoint vs SES	Pinpoint = marketing campaigns; SES = transactional/bulk email
AppFlow vs Glue	AppFlow = SaaS sources; Glue = AWS data sources (S3, RDS, etc.)
AppSync vs API Gateway	AppSync = GraphQL + real-time; API Gateway = REST/HTTP/WebSocket
Instance Scheduler vs ASG	Scheduler = stop/start on schedule; ASG = scale based on demand
Data Pipeline vs Glue	Data Pipeline = legacy ETL (EC2/EMR); Glue = modern serverless ETL
Audit Manager vs Artifact	Audit Manager = continuous audit with evidence; Artifact = download compliance docs
Proton vs Service Catalog	Proton = container/serverless templates; Service Catalog = any CloudFormation product
Managed Prometheus vs CloudWatch	Prometheus = PromQL, container-native; CloudWatch = AWS-native metrics
EKS Anywhere vs EKS Distro	Anywhere = AWS-managed on your infra; Distro = fully self-managed
Elastic Transcoder vs MediaConvert	Transcoder = legacy; MediaConvert = modern replacement (more features)

🏆 The Golden Rules

WorkSpaces = full desktop (DaaS, Windows/Linux, KMS)
AppStream = app streaming (browser-based, no desktop)
Pinpoint = marketing campaigns (segments, templates, schedules)
SES = email service (DKIM, SPF, transactional)
SNS = notifications (pub/sub, no campaign logic)
AppFlow = SaaS ↔ AWS (Salesforce, SAP → S3, Redshift)
AppSync = GraphQL (real-time mobile/web, DynamoDB)
Instance Scheduler = CloudFormation solution (DynamoDB + Lambda + tags)
IoT Core = billions of IoT messages (serverless, secure)
Ground Station = satellites (download to VPC, S3, EC2)
Wavelength = 5G edge (ultra-low latency for mobile devices)
Audit Manager = continuous compliance audit (evidence collection, frameworks)
Proton = container/serverless templates (platform teams → developers)
Data Pipeline = legacy (prefer Glue for ETL, Step Functions for orchestration)
Managed Grafana + Prometheus = container observability stack (PromQL metrics + dashboards)

Backup and Restore:

AWS Backup: fully-managed service to centrally manage and automate backups across AWS services. On-demand and scheduled backups. Supports PITR (Point-in-time Recovery). Retention Periods, Lifecycle Management, Backup Policies. Cross-Region Backup. Cross-Account backup (using AWS Organization).

Supported services: EC2, EBS, S3, RDS (all engines), Aurora, DynamoDB, DocumentDB, Neptune, EFS, FSx (Lustre & Windows), Storage Gateway (Volume Gateway)

Backup Plans:

Backup frequency (every 12 hours, daily, weekly, monthly, cron)
Backup window
Transition to Cold Storage (Never, Days, Weeks, Months, Years)
Retention Period (Always, Days, Weeks, Months, Years)
Tag-based backup policies

AWS Backup Vault Lock:

Enforce WORM (Write Once Read Many) state for all backups
Protects against: accidental/malicious deletes, retention period changes
Even root user cannot delete backups when enabled

⚠️ Exam trap: “Prevent anyone including root from deleting backups” → Backup Vault Lock (WORM). Similar to S3 Object Lock but for AWS Backup.

AWS DataSync: move large amount of data from on-premises to AWS (or between AWS storage services).

Agent-based — DataSync agent installed on-prem, connects via NFS/SMB
Destinations: S3 (any storage class), EFS, FSx (Windows & Lustre)
Replication tasks: scheduled hourly, daily, weekly. Incremental after first full load
Bandwidth throttling available — won’t saturate your network
File permissions and metadata preserved (NFS POSIX, SMB)
Can also do AWS → AWS transfers (e.g., EFS in one region → EFS in another)

⚠️ Exam trap: DataSync = data movement/sync (on-prem ↔ AWS, AWS ↔ AWS). AWS Backup = backup automation across AWS services. DataSync moves files; Backup creates snapshots/backups.

AWS Elastic Disaster Recovery (DRS): quickly and easily recover physical, virtual, and cloud-based servers into AWS.

Continuous block-level replication for your servers
Successor to CloudEndure Disaster Recovery
Protect critical databases, enterprise apps from ransomware attacks
Automatic conversion of servers to run natively on AWS

⚠️ Exam trap: “Continuous replication of servers for DR” → DRS (Elastic Disaster Recovery). “Lift-and-shift migration” → MGN. Both use agents + continuous replication, but DRS = DR (failover/failback), MGN = one-time migration.

AWS Fault Injection Simulator (FIS) — fully managed service for Chaos Engineering on AWS workloads.

Create disruptive events: CPU stress, memory stress, network latency, stop instances, inject API errors, throttle EBS I/O
Supports: EC2, ECS, EKS, RDS, Lambda, IAM (temporary policy errors)
Pre-built experiment templates — use or customize
Safety: stop conditions, rollback actions, IAM permissions boundary

⚠️ Exam trap: “Test resilience by randomly terminating instances” → FIS. “Netflix Simian Army” → inspiration for FIS but not an AWS service.

Disaster Recovery Overview:

Disaster = any event that negatively impacts business continuity or finances. Disaster Recovery (DR) = preparing for and recovering from a disaster.

DR scenarios:

On-premise → On-premise (traditional, very expensive)
On-premise → AWS Cloud (hybrid recovery)
AWS Region A → AWS Region B (cloud-native)

RPO and RTO:

RPO (Recovery Point Objective) — how much data loss you can tolerate (time between last backup and disaster). RTO (Recovery Time Objective) — how much downtime you can tolerate (time between disaster and recovery).

◄─── Data loss ───►◄─── Downtime ──►
                   │
    ●              ⚡              ●
   RPO          Disaster          RTO
(last backup)                  (back online)

Lower RPO = more frequent backups/replication = less data loss = more expensive
Lower RTO = faster recovery = resources pre-provisioned = more expensive

⚠️ Exam trap: RPO = data loss (backward-looking). RTO = downtime (forward-looking). Don’t confuse them — “minimize data loss” → optimize RPO. “Minimize downtime” → optimize RTO.

Disaster Recovery Strategies:

Four strategies, ordered from slowest/cheapest to fastest/most expensive:

Slower RTO ◄─────────────────────────────────────► Faster RTO
Cheaper                                            Expensive

 ┌──────────┬──────────┬──────────┬──────────┐
 │ Backup & │  Pilot   │  Warm    │ Multi    │
 │ Restore  │  Light   │ Standby  │ Site     │
 └──────────┴──────────┴──────────┴──────────┘
   Hours        10s min     Minutes    Seconds

1. Backup & Restore (High RPO/RTO, cheapest)

Data backed up to S3 (via Storage Gateway, Snowball) + EBS/RDS snapshots
On disaster: restore from backups, recreate infra
Slowest recovery but lowest cost

On-prem ──► Storage Gateway / Snowball ──► S3 ──► Glacier (lifecycle)
AWS:  EBS / RDS / Redshift ──► Scheduled Snapshots
Recovery: Snapshots ──► AMI ──► EC2 + RDS restore

2. Pilot Light (Faster than backup)

Critical core always running in cloud (e.g., RDS replication running, EC2 stopped)
On disaster: start EC2, scale up, Route 53 failover
Similar to Backup & Restore but DB is already replicated

On-prem (active)          AWS Cloud
┌────────────┐            ┌────────────────────┐
│ App Server │            │ EC2 (NOT running)  │
│ Primary DB │──repl──►   │ RDS (running)      │
└────────────┘            └────────────────────┘
                          Route 53 (failover)

3. Warm Standby (Minutes RTO)

Full system running at minimum size in AWS
On disaster: scale up to production load (ASG scales out)
More expensive than Pilot Light — everything is running, just small

On-prem (active)          AWS Cloud
┌────────────┐            ┌──────────────────────┐
│ App Server │            │ EC2 ASG (minimum)    │
│ Primary DB │──repl──►   │ RDS Secondary        │
└────────────┘            └──────────────────────┘
                          Route 53 → scale up on failover

4. Multi Site / Hot Site (Seconds RTO, most expensive)

Full production scale running in both locations simultaneously
Route 53 active-active routing
Instant failover — both sites serve traffic

On-prem (active)          AWS Cloud (active)
┌────────────┐            ┌──────────────────────┐
│ App Server │◄──R53──►   │ ELB → EC2 ASG (full) │
│ Primary DB │──repl──►   │ RDS Secondary         │
└────────────┘            └──────────────────────┘
   Route 53 active-active (or Aurora Global)

All AWS Multi Region = same as Multi Site but both sides are AWS:

Route 53 active-active → ELB → EC2 ASG → Aurora Global (primary ↔ secondary)

Comparison Table:

Strategy	RTO	RPO	Cost	What’s Running in AWS
Backup & Restore	Hours	High	💰	Nothing (just backups in S3)
Pilot Light	10s of min	Medium	💰💰	DB only (EC2 stopped)
Warm Standby	Minutes	Low	💰💰💰	Everything at minimum size
Multi Site / Hot Site	Seconds	Very low	💰💰💰💰	Everything at full production

⚠️ Exam trap: Pilot Light vs Warm Standby — both have DB replicating. The difference: Pilot Light has EC2 stopped (need to start), Warm Standby has EC2 running at minimum (need to scale up).

⚠️ Exam trap: “Cheapest DR” → Backup & Restore. “Lowest RTO/RPO” → Multi Site. “Balance cost and recovery” → Warm Standby.

⚠️ Exam trap: “Critical infrastructure up and running” = Pilot Light (only critical = DB). “Everything running at minimum” = Warm Standby. “Nothing running” = Backup & Restore. Key word is “critical” → Pilot Light.

Disaster Recovery Tips:

Backup: EBS Snapshots, RDS automated backups, S3/S3-IA/Glacier with lifecycle, CRR, Snowball/Storage Gateway from on-prem
High Availability: Route 53 DNS failover, RDS Multi-AZ, ElastiCache Multi-AZ, EFS, S3, VPN as backup for DX
Replication: RDS Cross-Region Replication, Aurora Global Databases, on-prem → RDS replication, Storage Gateway
Automation: CloudFormation/Elastic Beanstalk to recreate environments, CloudWatch Alarms → EC2 recover/reboot, Lambda for custom automation
Chaos: Netflix “Simian Army” (randomly terminating EC2 to test resilience)

DMS – Database Migration Service:

AWS DMS — quickly and securely migrate databases to AWS.

Resilient, self-healing
Source database remains available during migration
Runs on an EC2 instance (you must create it)
Supports Continuous Data Replication using CDC (Change Data Capture)
Multi-AZ Deployment: synchronous standby replica in different AZ → data redundancy, no I/O freezes, minimizes latency spikes

Migration types:

Homogeneous: same engine (e.g., Oracle → Oracle) — DMS only
Heterogeneous: different engine (e.g., SQL Server → Aurora) — requires AWS SCT first

Homogeneous:   Source DB ──► EC2 (DMS) ──► Target DB  (same engine)
Heterogeneous: Source DB ──► SCT (schema) + DMS (data) ──► Target DB  (different engine)

Continuous Replication (CDC):

Corporate DC                        AWS Cloud (VPC)
┌──────────────┐                    ┌─────────────────────────────┐
│ Oracle DB    │── data migration ─►│ DMS Replication Instance    │
│ (source)     │                    │ (Full load + CDC)           │
│              │                    │    Public Subnet            │
│ Server with  │                    │         │                   │
│ AWS SCT      │── schema convert ─►│         ▼                   │
│              │                    │ RDS MySQL (target)          │
└──────────────┘                    │    Private Subnet           │
                                    └─────────────────────────────┘

DMS Sources: On-prem DBs (Oracle, SQL Server, MySQL, MariaDB, PostgreSQL, MongoDB, SAP, DB2), Azure SQL, RDS (all incl. Aurora), S3, DocumentDB DMS Targets: On-prem DBs, RDS, Redshift, DynamoDB, S3, OpenSearch, Kinesis Data Streams, Apache Kafka, DocumentDB, Neptune, Redis, Babelfish

AWS SCT (Schema Conversion Tool):

Converts database schema from one engine to another
OLTP: SQL Server/Oracle → MySQL, PostgreSQL, Aurora
OLAP: Teradata/Oracle → Amazon Redshift
NOT needed for same engine migration (e.g., on-prem PostgreSQL → RDS PostgreSQL)
Prefer compute-intensive instances for SCT

⚠️ Exam trap: “Migrate database with minimal downtime, source stays available” → DMS. “Different DB engines” → DMS + SCT. “Same engine, different platform” (e.g., on-prem PostgreSQL → RDS PostgreSQL) → DMS only, no SCT needed.

⚠️ Exam trap: SCT converts schema, DMS migrates data — never reversed. Heterogeneous migration order: SCT first (convert schema) → DMS second (move data into converted schema).

RDS & Aurora Migrations:

RDS MySQL → Aurora MySQL:

Option 1: DB Snapshot from RDS MySQL → restore as Aurora MySQL DB
Option 2: Create Aurora Read Replica from RDS MySQL → when replication lag = 0, promote to own cluster (takes time, costs $)

External MySQL → Aurora MySQL:

Option 1: Percona XtraBackup → S3 → create Aurora MySQL from S3
Option 2: mysqldump utility → migrate into Aurora (slower than S3 method)

⚠️ Exam trap: RDS → Aurora = snapshot (native, cheapest, simplest). S3 dump path is for external/on-prem MySQL → Aurora. If both source and target are inside AWS, snapshot is always the best and most cost-effective option.

RDS PostgreSQL → Aurora PostgreSQL:

Option 1: DB Snapshot from RDS PostgreSQL → restore as Aurora PostgreSQL
Option 2: Create Aurora Read Replica → promote when lag = 0

External PostgreSQL → Aurora PostgreSQL:

Create backup → S3 → import using aws_s3 Aurora extension

Both databases running? → Use DMS for continuous replication

The 7 R’s of Cloud Migration:

Strategy	Description	AWS Service	Example
Retire	Turn off what you don’t need	—	Kill legacy apps (save up to 20%)
Retain	Keep on-prem for now	—	Compliance, unresolved dependencies
Relocate	Move to cloud version as-is	VMware Cloud on AWS	VMware SDDC → VMware Cloud on AWS
Rehosting (Lift & Shift)	Move as-is to AWS, no optimizations	MGN	VM → EC2 (save ~30%)
Replatforming (Lift & Reshape)	Minor cloud optimizations, no core changes	DMS, Beanstalk	MySQL → RDS MySQL
Repurchasing (Drop & Shop)	Switch to SaaS product	—	CRM → Salesforce, HR → Workday
Refactoring (Re-architect)	Rebuild cloud-native	Lambda, DynamoDB	Monolith → microservices

⚠️ Exam trap: “Lift-and-shift” = Rehosting (MGN). “Move to RDS without code changes” = Replatforming. “Rewrite as serverless” = Refactoring. “Move VMware SDDC to VMware Cloud on AWS” = Relocate. Don’t confuse Rehosting with Replatforming — rehosting changes nothing, replatforming makes small optimizations.

⚠️ Exam trap: The course says 7 R’s (includes Relocate). Some sources say 6 R’s (no Relocate). Know both — exam may reference either count.

On-Premises Strategy with AWS:

Download Amazon Linux 2 AMI as VM (.iso) — run on VMware, KVM, VirtualBox, Hyper-V
VM Import/Export — migrate existing VMs into EC2, create DR for on-prem VMs, export VMs back from EC2
AWS Application Discovery Service — gather info about on-prem servers for migration planning
- Agentless Discovery (Connector): VM inventory, config, performance (CPU, memory, disk)
- Agent-based Discovery (Agent): system config, performance, running processes, network connections
- Results viewable in AWS Migration Hub
AWS Migration Evaluator — build a data-driven business case for migration to AWS
- Install Agentless Collector → broad-based discovery of on-prem footprint
- Analyzes: current state, server dependencies → defines target state → develops migration plan
- Provides clear baseline of what’s running today
AWS Migration Hub — central location to track assessment, planning, and migration progress
- Migration Hub Orchestrator — pre-built templates to save time migrating enterprise apps (SAP, SQL Server)
- Receives status updates from MGN and DMS
AWS DMS — replicate on-prem → AWS, AWS → AWS, AWS → on-prem
AWS SMS (Server Migration Service) — incremental replication of on-prem live servers to AWS (legacy, replaced by MGN)

AWS Application Migration Service (MGN):

AWS MGN — the “AWS evolution” of CloudEndure Migration, replacing AWS Server Migration Service (SMS).

Lift-and-shift (rehost) solution for migrating applications to AWS
Converts physical, virtual, and cloud-based servers to run natively on AWS
Supports wide range of platforms, OS, and databases
Minimal downtime, reduced costs
Example: migrating on-prem Oracle DB to AWS EC2 instance (not RDS — that would be DMS/replatform)

Corporate DC / Any Cloud                    AWS Cloud
┌──────────────────────┐                    ┌───────────────────────────────┐
│ OS    ┐              │                    │  Staging          Production │
│ Apps  ├─► Replication │── continuous ──►   │  Low-cost EC2  → Target EC2  │
│ DB    │    Agent      │   replication      │  & EBS volumes   & EBS vols  │
│ Disks ┘              │                    │         (cutover) ──►        │
└──────────────────────┘                    └───────────────────────────────┘

⚠️ Exam trap: “Lift-and-shift to AWS, minimal downtime” → AWS MGN (Application Migration Service). NOT DMS (that’s for databases only). NOT SMS (deprecated, replaced by MGN).

VMware Cloud on AWS:

VMware Cloud on AWS — extend VMware-based on-prem data centers to AWS while keeping VMware Cloud software.

Use cases: migrate vSphere workloads, run across hybrid environments, DR strategy
Runs vSphere, vSAN, NSX on dedicated AWS hardware
Access AWS services: EC2, S3, FSx, RDS, Redshift, Direct Connect

Transferring Large Data into AWS:

Example: 200 TB, 100 Mbps internet connection:

Method	Setup Time	Transfer Time	Notes
Internet / VPN	Immediate	~185 days	200TB × 8 / 100 Mbps
Direct Connect 1 Gbps	>1 month	~18.5 days	Faster but long setup
Snowball	~1 week	~1 week	End-to-end, can combine with DMS

For ongoing replication: Site-to-Site VPN or DX with DMS or DataSync

⚠️ Exam trap: “Transfer 200 TB quickly” → Snowball (~1 week). NOT internet (185 days). NOT DX (setup alone >1 month). For ongoing sync after initial transfer → DMS or DataSync.

🎯 MASTER SUMMARY: Disaster Recovery & Migration Exam Guide

Part 1: Core Principles (Understand WHY → Derive WHAT)

Principle 1: RPO and RTO Are Cost Tradeoffs

Lower RPO/RTO = more money. RPO = how much data you can lose (backward). RTO = how much downtime you can accept (forward). Every DR strategy is a position on the cost ↔ speed spectrum. If the question says “regardless of cost” → Multi-Site. If “cheapest” → Backup & Restore.

Principle 2: DR Strategies Are a Spectrum of “What’s Running”

The 4 strategies differ by how much infrastructure is pre-provisioned in the DR region:

Nothing running → Backup & Restore
Critical DB only → Pilot Light
Everything at min → Warm Standby
Full production → Multi-Site

Key insight: You don’t need to memorize RTO numbers. Just ask: “How much needs to start/scale on failover?” More startup = more time = higher RTO.

Principle 3: Migration Tool = What You’re Moving

Each tool moves a specific thing. The exam tests whether you can pick the right one:

Moving databases → DMS (+ SCT if different engine)
Moving servers/VMs → MGN (lift-and-shift)
Moving files/data → DataSync
Moving bulk physical data → Snowball/Snowcone
DR failover of servers → DRS (Elastic Disaster Recovery)

Principle 4: SCT + DMS Have Distinct, Non-Overlapping Roles

SCT converts schema (structure). DMS migrates data (content). They’re never reversed. If engines are the same → no SCT needed. If engines differ → SCT first, DMS second.

Derivation trick: Schema = blueprint of the house. Data = furniture. You must build the house (SCT) before moving furniture in (DMS).

Principle 5: “Same Engine” = Simpler Migration Path

Same engine (homogeneous) eliminates complexity everywhere:

DMS: no SCT needed
RDS → Aurora: simple snapshot restore (no S3, no DMS needed)
On-prem PostgreSQL → RDS PostgreSQL: DMS only

Different engine (heterogeneous) always adds an extra step (SCT, conversion).

Principle 6: RDS → Aurora Is a Special Case (Stays Inside AWS)

When both source and target are AWS services, use native AWS operations (snapshot, read replica promotion). S3 dump/import path is for external databases entering AWS. The exam tests this: “most cost-effective RDS → Aurora” = snapshot, NOT S3.

Principle 7: DMS Runs on EC2 (You Manage the Instance)

DMS is not serverless — it runs on a replication instance (EC2). You choose instance type. Multi-AZ deployment gives HA for the replication instance itself. The source DB stays available during migration (non-disruptive).

Principle 8: AWS Backup ≠ DataSync ≠ S3 Lifecycle

Three different things that sound similar:

AWS Backup = centralized backup automation across AWS services (snapshots, PITR)
DataSync = move/sync files between on-prem and AWS (or AWS-to-AWS)
S3 Lifecycle = transition S3 objects between storage classes

Principle 9: MGN Replaced Both CloudEndure AND SMS

Historical evolution: CloudEndure Migration + AWS SMS → AWS MGN. If the exam mentions CloudEndure or SMS, the modern answer is MGN. Similarly, CloudEndure Disaster Recovery → AWS DRS.

Principle 10: Snowball Beats Internet for Large One-Time Transfers

Physics: shipping a device is faster than transferring hundreds of TB over a wire. Rule of thumb: if transfer calculation shows weeks/months → Snowball wins. Direct Connect needs >1 month setup, so for urgent large transfers it’s too slow.

Part 2: Decision Trees (Follow Keywords → Find Answer)

DR Strategy Decision Tree

What does the question ask for?
│
├─ "Cheapest" / "lowest cost" / "budget"
│  └─► Backup & Restore
│
├─ "Critical infrastructure running" / "core running"
│  └─► Pilot Light
│
├─ "Everything running at minimum" / "scaled down"
│  └─► Warm Standby
│
├─ "Lowest RTO" / "fastest recovery" / "regardless of cost" / "active-active"
│  └─► Multi-Site
│
└─ "Balance cost and recovery"
   └─► Warm Standby

Migration Tool Decision Tree

What are you migrating?
│
├─ DATABASE
│  ├─ Same engine? → DMS only
│  ├─ Different engine? → SCT + DMS
│  ├─ RDS → Aurora (same family)? → Snapshot restore
│  └─ External DB → Aurora? → S3 import (Percona/mysqldump) or DMS
│
├─ SERVERS / VMs / APPLICATIONS
│  ├─ Migration (one-time move)? → MGN
│  └─ DR (ongoing failover)? → DRS
│
├─ FILES / DATA
│  ├─ On-prem ↔ AWS sync? → DataSync
│  └─ Bulk physical transfer? → Snowball
│
└─ BACKUP MANAGEMENT
   └─ Centralized backup across services? → AWS Backup

The CANNOT List

You CANNOT…	Why
Use SCT for data migration	SCT only converts schema
Use DMS for schema conversion	DMS only moves data
Skip SCT for heterogeneous migration	Different engines need schema conversion
Use S3 dump for RDS → Aurora (cost-effectively)	Snapshot is native and free
Use MGN for database migration	MGN migrates servers, not databases
Delete Backup Vault Lock backups (even root)	WORM protection
Use DataSync without an agent (on-prem)	Agent required for on-prem source
Set up Direct Connect in < 1 month	Physical provisioning required

Part 3: Scenario Pattern Recognition

Pattern: “Migrate on-premises Oracle to Aurora PostgreSQL”

Keywords: different engines, Oracle, Aurora Answer: SCT (convert schema) + DMS (migrate data) Why: Heterogeneous migration — Oracle ≠ PostgreSQL, so schema conversion required first.

Pattern: “Migrate RDS MySQL to Aurora MySQL, most cost-effective”

Keywords: RDS → Aurora, same engine family, cost-effective Answer: Create snapshot from RDS → restore as Aurora Why: Native AWS operation, no intermediate storage cost. S3 path is for external databases.

Pattern: “Lift-and-shift on-premises servers to AWS with minimal downtime”

Keywords: lift-and-shift, servers, minimal downtime Answer: AWS MGN (Application Migration Service) Why: MGN does continuous replication of servers → cutover with minimal downtime. NOT DMS (databases only).

Pattern: “DR with critical infrastructure always running”

Keywords: critical, running, DR Answer: Pilot Light Why: Only critical components (DB) always on. EC2 stopped until disaster. “Critical” is the keyword.

Pattern: “Fastest possible disaster recovery”

Keywords: lowest RTO, fastest, regardless of cost Answer: Multi-Site / Hot Site Why: Full production on both sides, active-active routing = seconds RTO.

Pattern: “Transfer 200 TB of data to AWS quickly”

Keywords: large data, TB, quickly Answer: AWS Snowball Why: Internet = months, DX = weeks + setup time. Snowball = ~1 week end-to-end.

Pattern: “Centrally manage backups across RDS, DynamoDB, EFS, EBS”

Keywords: centrally manage, backups, multiple services Answer: AWS Backup Why: Only service that orchestrates backups across all these services. S3 Lifecycle only manages S3 objects.

Pattern: “Prevent backup deletion even by root user”

Keywords: prevent deletion, root, immutable, compliance Answer: AWS Backup Vault Lock (WORM) Why: WORM = Write Once Read Many. Even root can’t delete. Similar to S3 Object Lock.

Pattern: “Migrate on-prem PostgreSQL to RDS PostgreSQL”

Keywords: same engine, different platform Answer: DMS only (no SCT needed) Why: Same engine = homogeneous migration. SCT only needed when engines differ.

Pattern: “Ongoing replication after initial database migration”

Keywords: continuous, ongoing, replication, CDC Answer: DMS with CDC (Change Data Capture) Why: DMS supports continuous replication, not just one-time migration.

Pattern: “Gather information about on-prem servers before migration”

Keywords: discovery, planning, inventory, on-premises Answer: AWS Application Discovery Service → Migration Hub Why: Agentless (VM inventory) or Agent-based (processes, network). Results viewed in Migration Hub.

Pattern: “DR for servers with continuous block-level replication”

Keywords: DR, servers, continuous replication, failover/failback Answer: AWS DRS (Elastic Disaster Recovery) Why: DRS = ongoing DR with failover. MGN = one-time migration. Both use continuous replication but different purpose.

Pattern: “Move large files from on-prem NFS to S3/EFS”

Keywords: files, on-prem, NFS, SMB, sync Answer: AWS DataSync Why: Agent-based, preserves file permissions, incremental sync. Not DMS (databases) or Snowball (physical).

Pattern: “Extend VMware environment to AWS”

Keywords: VMware, vSphere, hybrid, extend Answer: VMware Cloud on AWS Why: Runs vSphere/vSAN/NSX on dedicated AWS hardware. Keep VMware tools, access AWS services.

Pattern: “Test application resilience by injecting faults”

Keywords: chaos, fault injection, resilience, stress test Answer: AWS FIS (Fault Injection Simulator) Why: Managed chaos engineering — CPU stress, stop instances, API errors. Pre-built templates.

Pattern: “Build a business case for migration to AWS”

Keywords: business case, cost analysis, baseline, current state Answer: AWS Migration Evaluator Why: Agentless Collector discovers on-prem footprint → analyzes → builds data-driven migration plan. NOT Application Discovery Service (that discovers servers, not costs).

Pattern: “Track migration progress across multiple services”

Keywords: track, central dashboard, migration status, MGN + DMS Answer: AWS Migration Hub (+ Orchestrator for enterprise app templates) Why: Central location aggregating status from MGN and DMS. Orchestrator has pre-built templates for SAP, SQL Server.

Migration Services Comparison

Service	Migrates	Direction	Key Feature
DMS	Databases	Any direction	CDC, source stays available
SCT	DB Schema	N/A (conversion)	Heterogeneous engine conversion
MGN	Servers/VMs/Apps	To AWS	Lift-and-shift, replaces SMS
DRS	Servers (DR)	To AWS	Failover/failback, replaces CloudEndure DR
DataSync	Files/Data	On-prem ↔ AWS, AWS ↔ AWS	Agent-based, incremental
Snowball	Bulk data	Physical shipping	Large one-time transfers
AWS Backup	Backups	Within AWS	Centralized backup management
Migration Evaluator	Business case	Assessment	Data-driven cost analysis
Migration Hub	Tracking	Central dashboard	Tracks MGN + DMS progress
App Discovery	Server inventory	On-prem → AWS	Agentless or agent-based

DR Strategy Quick Compare

	Backup & Restore	Pilot Light	Warm Standby	Multi-Site
RTO	Hours	10s of min	Minutes	Seconds
Cost	💰	💰💰	💰💰💰	💰💰💰💰
DB	Snapshots only	Running	Running	Running
App servers	Nothing	Stopped	Min capacity	Full prod
Route 53	Manual update	Failover	Failover	Active-active
On failover	Restore everything	Start EC2, scale	Scale up	Already active

RDS/Aurora Migration Paths

From	To	Best Method
RDS MySQL	Aurora MySQL	Snapshot restore
RDS PostgreSQL	Aurora PostgreSQL	Snapshot restore
External MySQL	Aurora MySQL	Percona XtraBackup → S3
External PostgreSQL	Aurora PostgreSQL	Backup → S3 → aws_s3 extension
Any DB (ongoing)	Any target	DMS with CDC
Different engine	Different engine	SCT + DMS

Legacy → Modern Service Mapping

Legacy	Modern Replacement
CloudEndure Migration	AWS MGN
AWS SMS (Server Migration)	AWS MGN
CloudEndure Disaster Recovery	AWS DRS

Part 5: Ultimate Instant-Answer Table

Question Contains	→ Instant Answer
“Lift-and-shift” / “rehost”	MGN
“Database migration”	DMS
“Different DB engines” / “heterogeneous”	SCT + DMS
“Same engine, different platform”	DMS only (no SCT)
“Schema conversion”	SCT
“RDS → Aurora, cost-effective”	Snapshot restore
“External MySQL → Aurora”	S3 (Percona XtraBackup)
“Continuous DB replication” / “CDC”	DMS
“DR, continuous block replication”	DRS
“Cheapest DR”	Backup & Restore
“Critical infrastructure running”	Pilot Light
“Everything running at minimum”	Warm Standby
“Lowest RTO, regardless of cost”	Multi-Site
“Active-active DR”	Multi-Site
“Centralized backup automation”	AWS Backup
“Prevent backup deletion by root”	Backup Vault Lock (WORM)
“Move files on-prem ↔ AWS”	DataSync
“Transfer 200 TB quickly”	Snowball
“Discover on-prem servers for migration”	Application Discovery Service
“Build business case for migration”	Migration Evaluator
“Migration planning and tracking”	Migration Hub
“Pre-built migration templates (SAP, SQL Server)”	Migration Hub Orchestrator
“Chaos engineering” / “fault injection”	FIS
“Extend VMware to AWS” / “Relocate VMware”	VMware Cloud on AWS
“Source DB stays available during migration”	DMS
“Migrate VMs to EC2”	VM Import/Export or MGN
“Replace CloudEndure Migration”	MGN
“Replace SMS”	MGN
“Replace CloudEndure DR”	DRS
“Backup across RDS, DynamoDB, EFS, EBS”	AWS Backup
“PITR (Point-in-time Recovery)”	AWS Backup
“Minimize data loss”	Optimize RPO
“Minimize downtime”	Optimize RTO
“Ongoing sync after initial transfer”	DataSync or DMS
“Replatform” / “minor optimizations”	DMS (e.g., MySQL → RDS MySQL)
“Refactor” / “re-architect”	Serverless / cloud-native rebuild

Part 6: Elimination Checklist

Choosing a DR Strategy

□ Is cost the primary concern?
  → Yes = Backup & Restore
  → No = continue
□ Does it mention "critical" infrastructure running?
  → Yes = Pilot Light
  → No = continue
□ Does it say "everything running" at minimum/scaled down?
  → Yes = Warm Standby
  → No = continue
□ Does it say "fastest" / "lowest RTO" / "active-active"?
  → Yes = Multi-Site

Choosing a Migration Tool

□ Are you migrating a DATABASE?
  → Yes: Same engine? → DMS only
  → Yes: Different engine? → SCT + DMS
  → Yes: RDS → Aurora (same family)? → Snapshot
  → No = continue
□ Are you migrating SERVERS / VMs / APPS?
  → For migration (one-time)? → MGN
  → For DR (ongoing failover)? → DRS
□ Are you moving FILES / DATA?
  → On-prem ↔ AWS sync? → DataSync
  → Bulk physical? → Snowball
□ Are you managing BACKUPS?
  → Across AWS services? → AWS Backup

Is SCT Needed?

□ Are source and target DB engines DIFFERENT?
  → Yes = SCT + DMS
  → No (same engine) = DMS only, NO SCT

🏆 The Golden Rules

RPO = data loss, RTO = downtime (backward vs forward from disaster)
More money = faster recovery (the entire DR spectrum is a cost tradeoff)
“Critical running” = Pilot Light (not Warm Standby, not Backup & Restore)
SCT = schema, DMS = data (never reversed, SCT always first)
Same engine = no SCT (homogeneous migration skips schema conversion)
RDS → Aurora = snapshot (native, cheapest — S3 path is for external DBs)
Servers → MGN, Databases → DMS (never confuse what each tool migrates)
MGN replaced SMS AND CloudEndure Migration (always pick MGN for lift-and-shift)
DRS replaced CloudEndure DR (always pick DRS for disaster recovery of servers)
Backup Vault Lock = even root can’t delete (WORM, like S3 Object Lock)
DataSync = files, AWS Backup = snapshots (different mechanisms, different purpose)
Snowball wins for large one-time transfers (faster than internet or DX when > 100 TB)
DMS source stays available (non-disruptive migration — key selling point)
Application Discovery → Migration Evaluator → Migration Hub (discover → build business case → track)
7 R’s: Rehost (MGN) ≠ Replatform (DMS) ≠ Relocate (VMware) ≠ Refactor (rebuild) (know which R matches which tool)

🎯 CROSS-TOPIC DECISION TREES

These cut across multiple MASTER SUMMARY sections — use when the question doesn’t clearly fit one topic.

Decision Tree 1: “Data Needs to Move”

What kind of data is moving?
│
├─► DATABASE
│   ├─ Same engine? → DMS only
│   ├─ Different engine? → SCT + DMS
│   ├─ RDS → Aurora (same family)? → Snapshot restore
│   └─ External MySQL → Aurora? → Percona XtraBackup → S3
│
├─► FILES / OBJECTS
│   ├─ Network OK (< 1 week)?
│   │   ├─ One-time / scheduled sync → DataSync
│   │   ├─ Ongoing hybrid access → Storage Gateway
│   │   └─ FTP/SFTP for external users → Transfer Family
│   └─ Network bad (> 1 week)?
│       ├─ < 14 TB → Snowcone
│       └─ > 14 TB → Snowball Edge
│
├─► SERVERS / VMs
│   ├─ Migrate to AWS (one-time) → MGN (lift-and-shift)
│   └─ DR failover/failback → DRS
│
└─► CROSS-REGION / CROSS-ACCOUNT within AWS
    ├─ S3 → S3 → S3 Replication (CRR/SRR)
    ├─ S3 → EFS/FSx → DataSync (no agent needed)
    ├─ RDS/Aurora → Read Replica → promote
    ├─ DynamoDB → Global Tables
    └─ EBS → Snapshots → copy to target region

Decision Tree 2: “Real-Time Processing”

What needs to happen in real-time?
│
├─► STREAMING DATA (continuous, ordered)
│   ├─ Need ordering + replay? → Kinesis Data Streams
│   ├─ Need delivery to S3/Redshift/OpenSearch? → Kinesis Firehose
│   ├─ Need SQL on streams? → Kinesis Data Analytics
│   └─ Need Apache Kafka compatible? → Amazon MSK
│
├─► EVENT-DRIVEN (discrete events, react)
│   ├─ AWS service state change? → EventBridge
│   ├─ Metric threshold crossed? → CloudWatch Alarm
│   ├─ Message queue (decouple)? → SQS
│   ├─ Fan-out to many? → SNS (or SNS + SQS)
│   └─ Orchestrate steps? → Step Functions
│
├─► LOG PROCESSING
│   ├─ Real-time → CloudWatch Subscription Filters
│   ├─ Near real-time to S3 → Firehose
│   └─ Batch/archive → S3 Export (up to 12h delay)
│
└─► API / REQUEST PROCESSING
    ├─ Sync (immediate response) → Lambda + API Gateway
    ├─ Async (fire-and-forget) → Lambda + SQS/SNS
    └─ Long-running → Step Functions / ECS tasks

Decision Tree 3: “Search / Query Data”

What kind of search/query?
│
├─► FULL-TEXT SEARCH (partial match, any field)
│   └─► OpenSearch
│       Pattern: DynamoDB (storage) + OpenSearch (search)
│
├─► STRUCTURED QUERIES (SQL)
│   ├─ On data in S3? → Athena (serverless, pay-per-query)
│   ├─ On data warehouse? → Redshift
│   ├─ On CloudTrail logs in S3? → Athena
│   └─ On relational data? → RDS / Aurora
│
├─► KEY-VALUE LOOKUP (by primary key)
│   └─► DynamoDB (single-digit ms)
│
├─► LOG SEARCH
│   ├─ CloudWatch Logs → Logs Insights
│   ├─ Custom logs at scale → OpenSearch
│   └─ VPC traffic → VPC Flow Logs + Athena
│
└─► WHO DID WHAT (audit)
    └─► CloudTrail → S3 → Athena

Decision Tree 4: “Speed Up / Reduce Latency”

What needs to be faster?
│
├─► CONTENT DELIVERY (static/dynamic to users)
│   ├─ Global users, cacheable → CloudFront
│   ├─ Global users, TCP/UDP (gaming, IoT) → Global Accelerator
│   └─ Specific geo + legal needs → CloudFront + Geo Restriction
│
├─► DATABASE READS
│   ├─ Same queries repeated → ElastiCache (Redis/Memcached)
│   ├─ Read-heavy RDS → Read Replicas (up to 15 for Aurora)
│   ├─ DynamoDB reads → DAX (microsecond cache)
│   └─ Global reads → DynamoDB Global Tables / Aurora Global DB
│
├─► API RESPONSES
│   ├─ API Gateway → enable caching
│   ├─ Lambda cold starts → Provisioned Concurrency
│   └─ Lambda + RDS → RDS Proxy (connection pooling)
│
├─► EC2 LAUNCH / BOOT TIME
│   ├─ Static components → Golden AMI (pre-baked)
│   ├─ Dynamic config → User Data scripts
│   ├─ Both → Hybrid (Golden AMI + User Data)
│   └─ EBS volumes → enable EBS Fast Snapshot Restore
│
└─► NETWORK / DATA TRANSFER
    ├─ On-prem ↔ AWS → Direct Connect (dedicated)
    ├─ Backup DX path → Site-to-Site VPN
    ├─ EC2 ↔ EC2 same AZ → Placement Group (cluster)
    └─ HPC storage → FSx for Lustre

Decision Tree 5: “Secure This”

What needs securing?
│
├─► DATA AT REST
│   ├─ S3 → SSE-S3, SSE-KMS, or SSE-C
│   ├─ EBS → KMS encryption
│   ├─ RDS/Aurora → KMS (enable at creation)
│   ├─ DynamoDB → KMS (AWS owned or customer managed)
│   └─ Secrets → Secrets Manager (rotation) or SSM Parameter Store
│
├─► DATA IN TRANSIT
│   ├─ HTTPS everywhere → ACM certificates
│   ├─ S3 → bucket policy with aws:SecureTransport
│   └─ VPN / DX → encrypted by default
│
├─► NETWORK
│   ├─ Instance level → Security Groups (stateful)
│   ├─ Subnet level → NACLs (stateless)
│   ├─ VPC level → Network Firewall (L3-L7)
│   ├─ Web apps → WAF (L7, CloudFront/ALB/API GW)
│   └─ DDoS → Shield (Standard free, Advanced paid)
│
├─► ACCESS CONTROL
│   ├─ "Who can access AWS resources" → IAM Policies
│   ├─ "Org-wide guardrails" → SCPs
│   ├─ "Cross-account" → Resource Policy or IAM Role
│   ├─ "Temporary credentials" → STS AssumeRole
│   └─ "External identity" → Cognito / SSO (IAM Identity Center)
│
└─► AUDIT / COMPLIANCE
    ├─ "Who did what" → CloudTrail
    ├─ "Is it compliant" → Config
    ├─ "Automated compliance audit" → Audit Manager
    └─ "Security findings dashboard" → Security Hub

Cross-Topic Instant-Answer Table

Scenario Keywords	→ Answer	Topic Area
“Reduce boot time” + “static + dynamic”	Golden AMI + User Data	EC2/Deployment
“Search any field” / “partial text”	OpenSearch (not DynamoDB)	Database
“Query S3 data with SQL”	Athena	Database/Analytics
“React to S3 upload”	S3 Event → Lambda or EventBridge	Serverless
“Decouple microservices”	SQS (or SNS for fan-out)	Messaging
“Global low-latency DB”	DynamoDB Global Tables	Database
“Global low-latency SQL”	Aurora Global Database	Database
“Cache DB queries” (relational)	ElastiCache	Database
“Cache DB queries” (DynamoDB)	DAX	Database
“Cache API responses”	API Gateway Caching	Serverless
“Migrate DB, no downtime”	DMS with CDC	DR/Migration
“Move servers to AWS”	MGN	DR/Migration
“Multi-account security baseline”	Control Tower + SCPs	Security
“Central log analysis”	CloudWatch + Subscription Filters	Monitoring
“Cost per project/team”	Cost Allocation Tags	Billing
“Prevent action org-wide”	SCP (not Config — Config only detects)	Security
“Auto-fix non-compliant”	Config + SSM Automation	Monitoring
“Encrypt at rest, auto-rotate key”	KMS with automatic rotation	Security
“Share resources cross-account”	AWS RAM	IAM/Networking
“DNS failover”	Route 53 Failover routing + Health Check	Route 53

AWS Cloud Practitioner certificate:

https://www.w3schools.com/aws/aws_quiz.php

https://pages.awscloud.com/NAMER-partner-GC-Partner-Cert-Readiness-Cloud-Practitioner-2024-conf.html

https://www.udemy.com/course/aws-certified-cloud-practitioner-new/

https://media.datacumulus.com/aws-ccp/AWS%20Certified%20Cloud%20Practitioner%20Slides%20v28.pdf

in progress..

https://www.examtopics.com/discussions/amazon/view/68991-exam-aws-certified-solutions-architect-associate-saa-c02/

02a-AWS

AWS (Amazon Web Services):

AWS Cloud Computing:

AWS Global Infrastructure:

AWS Shared Responsibility Model:

AWS Identity and Access Management:

⚠️ IAM Exam Traps Summary

🎯 IAM Quick Decision Table

🎯 MASTER SUMMARY: IAM & Organizations Exam Guide

Part 1: Core Principles (Understand WHY → Derive WHAT)

Principle 1: Implicit Deny by Default

Principle 2: Explicit Deny ALWAYS Wins

Principle 3: Permissions = Intersection, Not Union

Principle 4: Management Account is Untouchable

Principle 5: Scope Determines Tool

Principle 6: Cross-Account = Two Choices

Principle 7: Temporary Credentials > Long-term

Principle 8: Authentication ≠ Authorization

Principle 9: Service-Linked Roles are Special

Principle 10: IAM is Global

Part 2: Decision Trees

Cross-Account Access Decision

Restriction Scope Decision

Identity Provider Decision

The “CANNOT” List

Part 3: Scenario Pattern Recognition

Pattern: “Restrict ALL member accounts from using a service”

Pattern: “Allow developers to create IAM users but prevent privilege escalation”

Pattern: “User needs to access resources in two accounts simultaneously”

Pattern: “Millions of mobile app users need S3 access”

Pattern: “Corporate employees need SSO to multiple AWS accounts”

Pattern: “Detect untagged resources across organization”

Pattern: “Prevent creating resources in unapproved regions”

Pattern: “Share VPC subnets across accounts”

Pattern: “Find resources shared with external accounts”

Pattern: “Temporary credentials for cross-account access”

Pattern: “Require MFA for sensitive operations”

Pattern: “Standardize tag format across organization”

Pattern: “Connect IAM Identity Center to on-premises AD”

Part 4: Quick Reference Tables

SCP vs Permission Boundary vs IAM Policy

Directory Services Comparison

STS API Quick Reference

Part 5: Ultimate Instant-Answer Table

Part 6: Elimination Checklist

🏆 The Golden Rules

Amazon VPC:

CIDR – IPv4:

IP Addresses:

Subnets:

Internet Gateway (IGW):

NAT (Network Address Translation):

Bastion Host:

Security Groups vs NACLs:

VPC Flow Logs:

VPC Peering:

VPC Endpoints:

AWS PrivateLink (VPC Endpoint Services):

Site-to-Site VPN:

Direct Connect (DX):

AWS Client VPN:

Transit Gateway:

VPC Traffic Mirroring:

Egress-only Internet Gateway:

Networking Costs:

AWS Network Firewall:

🎯 MASTER SUMMARY: VPC & Networking Exam Guide

Part 1: Core Principles (Understand WHY → Derive WHAT)

Principle 1: Everything Starts with Routing

Principle 2: Public vs Private = Route to IGW

Principle 3: Stateful vs Stateless = The Fundamental Security Split

Principle 4: Private Subnet Internet Access = NAT (IPv4) or Egress-only IGW (IPv6)

Principle 5: AWS Services from VPC = Endpoints (Stay Private)

Principle 6: On-Premises Connectivity = Speed vs Cost vs Time

Principle 7: Transitivity Doesn’t Exist in VPC Peering

Principle 8: Transit Gateway = The Universal Hub

Principle 9: Network Protection is Layered

Principle 10: Egress Costs Money, Ingress is Free

Part 2: Decision Tree (Follow Keywords → Find Answer)

Connectivity Decision Tree