AWS Architecture Review Checklist

So you have analysed the application requirements and have an initial idea of the AWS architecture. Here's a high level checklist you can use to make sure you haven't missed any significant considerations.

Availability

✅ Are the components distributed across at least two availability zones?

AWS services come in three availability flavours:

Multi-AZ by default e.g. Route53, CloudFront, S3, SQS
Multi-AZ optional e.g. ALB, RDS, EFS, Auto Scale groups
Manual configuration required e.g. EC2 instances

✅ Are the components distributed across multiple regions? Do they need to be?

Availability across regions can be costly in terms of data transfer and infrastructure. Make sure there is a string business rationale for this.

Security

✅ Are the components running under IAM roles defined using a least-privilege perspective?

✅ Has a WAF been enabled, with appropriate rules?

✅ Are components encapsulated within VPCs, Security Groups & SubNets (public & private) leveraging configuration & NAT Gateways to restrict communication to only expected routes?

✅ Are all S3 buckets private?

✅ Have you performed a threat modelling exercise and employed mitigations for your findings?

Data Security

"Dance like nobody’s watching. Encrypt like everyone is.”

Werner Vogels

✅ Is the data encrypted in transit? i.e. HTTPS, SQS

✅ Is the data encrypted at rest? i.e. in S3 buckets, databases

State

✅ Are the components running on EC2 instances stateless?
Prefer storing state in a database, object store or file system over the local instance.
Prefer caching data in a distributed cache (e.g. Elasticache) over an in-memory cache.
This will facilitate both scale out and recoverability.

Scalability

✅ Are your components stateless? How long do they take to initialise and be available to respond to requests?

✅ Can you leverage serverless options? e.g. Lambda, Aurora serverless

✅ Are the min & max values correct?

✅ Are you making use of Auto Scaling Groups

✅ Are you using a schedule or a metric?

✅ Are the min & max values correct?

Performance

✅ Are the EC2 instance type appropriate for the workload i.e. CPU intensive vs memory intensive vs general purpose?
✅ Are static resources distributed using a CDN e.g. CloudFront.
✅ What are the performance-sensitive components? Do they leverage data storage with minimal latency? e.g. ElastiCache vs RDS, DynamoDb vs DAX.
✅ How is cache invalidation handled (if caching is used)?
✅ Are you introducing unnecessary latency anywhere e.g. queue batch sizes, pass-thru lambdas rather than API Gateway configuration direct to the resource?

Recoverability

✅ Do the components recover from failure automatically?

This an advantage of some of the AWS services:

S3
RDS
ALB
EC2 Auto Scaling Groups
EC2 Auto Recovery (single instances)

✅ Do you take backups of the data? How frequently?

✅ What is your Recovery Point Objective? What is your restore time?

Reliability

✅ Are you using ACID or BASE (for distributed systems) transactions.

✅ When using queues or message buses, are messages durable (i.e. persisted)?

✅ When using queues or message buses, do messages remain on the queue/bus until successfully processed?

Communication Decoupling

✅ Are you using an Application Load Balancer for HTTP/HTTPS inter-component communication?

✅ Are you using a Network Load Balancer for non-HTTP inter-component communication?

✅ Is asynchronous decoupling in use wherever possible?

SQS
Kinesis
Amazon MQ
Amazon MSK

Monitoring / Supportability / Observability

✅ Is the CloudWatch agent part of your default deployment for an EC2 instance?

✅ Is a log aggregator being used? e.g. CloudWatch, Splunk

✅ What are the key metrics of the health of the system? How are they being monitored?

Deployability

✅ Have you automated the build and deployment process?

✅ How large/granular are the deployable components? Are all "transactions" able to be isolated to within the bounds of a single deployable component?

✅ How is versioning & upgrading handled? Especially for disconnected clients like web and mobile apps.

✅ Have you implemented infrastructure-as-code? How many steps to configure a new environment?

✅ Is your deployment infrastructure separated from your application infrastructure?

A failure in the application infrastructure or connection to it should not impact the deployment infrastructure. Consider using another cloud provider or AWS region.

Data Storage

✅ Does your data storage technology align with use cases, non-functional and quality criteria?

Relational databases, incl. serverless
Graph databases
Document (& other NOSQL) databases
OpenSearch
S3
Remember, you are not constrained to the "out of the box" AWS offerings, you can host your own on EC2 instances
Also, some database providers have fully cloud hosted options that can be leveraged from your AWS infrastructure e.g. Neo4J Aura

Legality

✅ Is the system subject to any regulatory requirements e.g. GDPR? How are they being met?

✅ Are there any third party libraries / components being used? What are their license restrictions? Does the system comply with them? Are the components supported &/or vendor stable?

✅ What Service Level Agreements are you contractually obligated to meet? What metrics and measurements do you have in place to validate these are being met?

Testing

✅ Are you supporting "artificial" data/transactions to track, monitor and validate data flows?

✅ What strategies are you adopting to validate performance, load, and correctness?

✅ Do you have automated post-deployment "smoke" tests?

✅ Do you have APIs that allow set up and tear down of data?

✅ What fitness functions do you have in place to verify the metrics for key Non Functional Requirements? Is the automation of these within the deployment pipeline?

software project musings

Search This Blog