Availability
✅ Are the components distributed across at least two availability zones?
- Multi-AZ by default e.g. Route53, CloudFront, S3, SQS
- Multi-AZ optional e.g. ALB, RDS, EFS, Auto Scale groups
- Manual configuration required e.g. EC2 instances
✅ Are the components distributed across multiple regions? Do they need to be?
Availability across regions can be costly in terms of data transfer and infrastructure. Make sure there is a string business rationale for this.
Security
✅ Are the components running under IAM roles defined using a least-privilege perspective?
✅ Has a WAF been enabled, with appropriate rules?
✅ Are components encapsulated within VPCs, Security Groups & SubNets (public & private) leveraging configuration & NAT Gateways to restrict communication to only expected routes?
✅ Are all S3 buckets private?
✅ Have you performed a threat modelling exercise and employed mitigations for your findings?
Data Security
✅ Is the data encrypted in transit? i.e. HTTPS, SQS
✅ Is the data encrypted at rest? i.e. in S3 buckets, databases
State
✅ Are the components running on EC2 instances stateless?
- Prefer storing state in a database, object store or file system over the local instance.
- Prefer caching data in a distributed cache (e.g. Elasticache) over an in-memory cache.
Scalability
✅ Are your components stateless? How long do they take to initialise and be available to respond to requests?
✅ Can you leverage serverless options? e.g. Lambda, Aurora serverless
✅ Are the min & max values correct?
✅ Are you making use of Auto Scaling Groups
✅ Are you using a schedule or a metric?
✅ Are the min & max values correct?
Performance
✅ Are the EC2 instance type appropriate for the workload i.e. CPU intensive vs memory intensive vs general purpose?
✅ Are static resources distributed using a CDN e.g. CloudFront.
✅ What are the performance-sensitive components? Do they leverage data storage with minimal latency? e.g. ElastiCache vs RDS, DynamoDb vs DAX.
✅ How is cache invalidation handled (if caching is used)?
✅ Are you introducing unnecessary latency anywhere e.g. queue batch sizes, pass-thru lambdas rather than API Gateway configuration direct to the resource?
Recoverability
✅ Do the components recover from failure automatically?
This an advantage of some of the AWS services:
- S3
- RDS
- ALB
- EC2 Auto Scaling Groups
- EC2 Auto Recovery (single instances)
Reliability
Communication Decoupling
✅ Are you using an Application Load Balancer for HTTP/HTTPS inter-component communication?
✅ Are you using a Network Load Balancer for non-HTTP inter-component communication?
✅ Is asynchronous decoupling in use wherever possible?
- SQS
- Kinesis
- Amazon MQ
- Amazon MSK
Monitoring / Supportability / Observability
✅ Is the CloudWatch agent part of your default deployment for an EC2 instance?
✅ Is a log aggregator being used? e.g. CloudWatch, Splunk
✅ What are the key metrics of the health of the system? How are they being monitored?
Deployability
✅ Have you automated the build and deployment process?
✅ How large/granular are the deployable components? Are all "transactions" able to be isolated to within the bounds of a single deployable component?
✅ How is versioning & upgrading handled? Especially for disconnected clients like web and mobile apps.
✅ Have you implemented infrastructure-as-code? How many steps to configure a new environment?
✅ Is your deployment infrastructure separated from your application infrastructure?
A failure in the application infrastructure or connection to it should not impact the deployment infrastructure. Consider using another cloud provider or AWS region.
Data Storage
✅ Does your data storage technology align with use cases, non-functional and quality criteria?
- Relational databases, incl. serverless
- Graph databases
- Document (& other NOSQL) databases
- OpenSearch
- S3
- Remember, you are not constrained to the "out of the box" AWS offerings, you can host your own on EC2 instances
- Also, some database providers have fully cloud hosted options that can be leveraged from your AWS infrastructure e.g. Neo4J Aura
Legality
Testing
✅ Are you supporting "artificial" data/transactions to track, monitor and validate data flows?
✅ What strategies are you adopting to validate performance, load, and correctness?
✅ Do you have automated post-deployment "smoke" tests?
✅ Do you have APIs that allow set up and tear down of data?
✅ What fitness functions do you have in place to verify the metrics for key Non Functional Requirements? Is the automation of these within the deployment pipeline?
Comments
Post a Comment