AWS Architecture Review Checklist



So you have analysed the application requirements and have an initial idea of the AWS architecture.  Here's a high level checklist you can use to make sure you haven't missed any significant considerations.

Availability

✅ Are the components distributed across at least two availability zones?

AWS services come in three availability flavours:

    1. Multi-AZ by default e.g. Route53, CloudFront, S3, SQS
    2. Multi-AZ optional e.g. ALB, RDS, EFS, Auto Scale groups
    3. Manual configuration required e.g. EC2 instances

✅ Are the components distributed across multiple regions?  Do they need to be?

Availability across regions can be costly in terms of data transfer and infrastructure.  Make sure there is a string business rationale for this.

Security

✅ Are the components running under IAM roles defined using a least-privilege perspective?

✅ Has a WAF been enabled, with appropriate rules?

✅ Are components encapsulated within VPCs, Security Groups & SubNets (public & private) leveraging configuration & NAT Gateways to restrict communication to only expected routes?

✅ Are all S3 buckets private?

✅ Have you performed a threat modelling exercise and employed mitigations for your findings?

Data Security

"Dance like nobody’s watching. Encrypt like everyone is.” 
Werner Vogels

✅ Is the data encrypted in transit? i.e. HTTPS, SQS

✅ Is the data encrypted at rest? i.e. in S3 buckets, databases

State

✅ Are the components running on EC2 instances stateless?

  • Prefer storing state in a database, object store or file system over the local instance.
  • Prefer caching data in a distributed cache (e.g. Elasticache) over an in-memory cache.

This will facilitate both scale out and recoverability.

Scalability

✅ Are your components stateless?  How long do they take to initialise and be available to respond to requests?

✅ Can you leverage serverless options? e.g. Lambda, Aurora serverless

    ✅ Are the min & max values correct?

✅ Are you making use of Auto Scaling Groups

    ✅ Are you using a schedule or a metric?

    ✅ Are the min & max values correct?

Performance

✅ Are the EC2 instance type appropriate for the workload i.e. CPU intensive vs memory intensive vs general purpose?

✅ Are static resources distributed using a CDN e.g. CloudFront.

✅ What are the performance-sensitive components?  Do they leverage data storage with minimal latency? e.g. ElastiCache vs RDS, DynamoDb vs DAX.

✅ How is cache invalidation handled (if caching is used)?

✅ Are you introducing unnecessary latency anywhere e.g. queue batch sizes, pass-thru lambdas rather than API Gateway configuration direct to the resource?

Recoverability

✅ Do the components recover from failure automatically?  

    This an advantage of some of the AWS services:

  • S3
  • RDS 
  • ALB
  • EC2 Auto Scaling Groups
  • EC2 Auto Recovery (single instances)
✅ Do you take backups of the data?  How frequently?

✅ What is your Recovery Point Objective?  What is your restore time?

Reliability

✅ Are you using ACID or BASE (for distributed systems) transactions.

✅ When using queues or message buses, are messages durable (i.e. persisted)?

✅ When using queues or message buses, do messages remain on the queue/bus until successfully processed?

Communication Decoupling

✅ Are you using an Application Load Balancer for HTTP/HTTPS inter-component communication?

✅ Are you using a Network Load Balancer for non-HTTP inter-component communication?

✅ Is asynchronous decoupling in use wherever possible?

  • SQS
  • Kinesis
  • Amazon MQ
  • Amazon MSK

Monitoring / Supportability / Observability

✅ Is the CloudWatch agent part of your default deployment for an EC2 instance?

✅ Is a log aggregator being used?  e.g. CloudWatch, Splunk

✅ What are the key metrics of the health of the system? How are they being monitored?

Deployability

✅ Have you automated the build and deployment process?

✅ How large/granular are the deployable components?  Are all "transactions" able to be isolated to within the bounds of a single deployable component?

✅ How is versioning & upgrading handled?  Especially for disconnected clients like web and mobile apps.

✅ Have you implemented infrastructure-as-code?  How many steps to configure a new environment?

✅ Is your deployment infrastructure separated from your application infrastructure?

A failure in the application infrastructure or connection to it should not impact the deployment infrastructure.  Consider using another cloud provider or AWS region.

Data Storage

✅ Does your data storage technology align with use cases, non-functional and quality criteria?

  • Relational databases, incl. serverless
  • Graph databases
  • Document (& other NOSQL) databases
  • OpenSearch
  • S3
  • Remember, you are not constrained to the "out of the box" AWS offerings, you can host your own on  EC2 instances
  • Also, some database providers have fully cloud hosted options that can be leveraged from your AWS infrastructure e.g. Neo4J Aura

Legality

✅ Is the system subject to any regulatory requirements e.g. GDPR?  How are they being met?

✅ Are there any third party libraries / components being used?  What are their license restrictions?  Does the system comply with them?  Are the components supported &/or vendor stable?

✅ What Service Level Agreements are you contractually obligated to meet?  What metrics and measurements do you have in place to validate these are being met?

Testing

✅ Are you supporting "artificial" data/transactions to track, monitor and validate data flows?

✅ What strategies are you adopting to validate performance, load, and correctness?

✅ Do you have automated post-deployment "smoke" tests?

✅ Do you have APIs that allow set up and tear down of data?

✅ What fitness functions do you have in place to verify the metrics for key Non Functional Requirements?  Is the automation of these within the deployment pipeline?

Comments