Listen to this post
Mark Twain once said “I didn’t have time to write a short letter, so I wrote a long one instead”. The AWS Well-Architected Framework white-paper, is a study on how to properly create and support an IT solution. It was written by thirteen AWS employees with intimidating titles, none of whom are Mark Twain! By that I mean it’s nearly eighty pages long. Although its current format is not without merit, I would like to provide us all with an extremely abridged version, for the purposes of rapid revision and last second cramming. Let’s see how much I can fit into the next thousand or so words!
The Executive Summary
What comes after is a cut down version of the document in question, with brief asides to lighten the mood. It’s not an opinion piece, or at least it’s not my opinion. Although I agree with the majority of it, there are parts where I would add caveats if the point wasn’t brevity. I’ve underlined places where I have directly referenced the original document, if you want to cut this down to a couple dozen catch phrases. They by no means are a one to one match with anything in the original. I have put effort into removing repetition and specialist language, but there was only so far I could go without affecting the integrity of what is being said. So, I present to you the brief Well-Architected Framework according to Phil!
It is important to have a provable level of operational excellence, but that can be a very nebulous target to hit. The solution starts with determining priorities based on customer needs, governance and compliance requirements as well as any risk-benefit that may apply. Both your organisation and culture need to be structured to engage the problem space and support your desired outcomes. Our CTO here at Naimuri likes to quote Conway’s Law – “Any organization that designs a system will produce a design whose structure is a copy of the organization’s communication structure”. The Well-Architected Framework suggests that the organisational structure should be moulded first! Once you start designing the structure of the solution, you must pack in best practices such as logging, monitoring and traceability. This design should support rapid development and deployment, and almost accidentally that should bring you the ability to rapidly remediate issues.
If you have ticked all the boxes above then hopefully it’s downhill to get your new solution into production. You need to understand all of the risks associated with going to production. Prepare for production events such as deployments, rollbacks, security breaches, etc. These events need to be regularly practiced, even if the events do not occur regularly. The team’s response should be comfortable and clear with defined lines of escalation.
Finally, my favourite point, both the system and the staff should be continuously improving. This needs to be supported and encouraged by the business.
The Well-Architected Framework covers prevention as well as cure. Your well educated workforce should be focused on applying security best practices in all areas. This covers the authentication and authorisation of both people (users, developers, auditors, etc), and machines (internal services, logging and monitoring solutions, etc). Here a strong focus should be put onto using temporary credentials over long running ones, and centralising the solution.
When a security event does occur you need to be alerted immediately, which means you should already be monitoring logs and metrics. Alerts have to be set up before the first event, not years later when you discover there was a breach. Then, you should have your staff well drilled for what to do in response to the alerts. Have a plan and practice!
Your defensive capability needs to be layered from the application (e.g. firewalls), down through the infrastructure (e.g. vulnerability scanning) into the network (e.g. NACL). I could say a lot more on the different means of securing your solution, but let’s keep moving… Data needs to be classified by its sensitivity and secured both at rest and in transit appropriately.
Reliability is all about planning for the unusual and understanding your own architecture and tools. Be aware of limitations such as service quotas and network constraints. With Cloud based architecture you can autoscale and self heal. Centralised logging, monitoring and analysis should help you learn from your mistakes and not make them habits.
When the requirement for major changes to your solutions occurs, time needs to be set aside to carefully plan. This involves making a documented list of instructions for the rollout (a.k.a. playbook). This should include resilience and functional testing. Most importantly, don’t let yourself be forced into making a series of disparate sweeping changes in quick succession!
I’m not going to get into the definitions of RTO, RPO and DR, suffice to say that Cloud services give you plenty of tools to quickly recover when things go wrong. So long as you have done your research, identified what needs to be backed up and come up with, preferably automated, recovery solutions, your clients can get to the point that they may wonder if anything actually went wrong at all. While they are asking each other “did you see that?” you should have a carefully scripted plan (also a.k.a. playbook) to investigate the incident.
“Well-architected systems use multiple solutions and features to improve performance” – I like that, it’s a nice and succinct way of saying choose the right tooling and use benchmarks to come to the best solution. Pay attention to how you plan to use your compute resources before you choose them. The same can be said for your storage resources and again for your networking solution.
In all cases regular review and evolution is key if you want to maintain a high standard. Newly available services may represent faster, more reliable and cheaper results. Constant monitoring and analysis is the best way to tell if your existing architecture has shortcomings. Often there are tradeoffs that must be considered, and in this you must determine how they affect the most critical areas of your solution, your efficiency and your customer’s experience.
As your solution grows, cost optimisation becomes increasingly important, to the point where you should consider employing a team whose sole responsibility is to minimise the costs. This kicks into place at the development stage as you can monitor and limit the resources available to your teams in line with their requirements. Change control policies can be set up to establish when resources should be decommissioned. Likewise, an eye should be kept on supply and demand when it comes to the resources being used for your solutions. The simplest example being an internal business offering can often be turned off at night, thus more than halving your cost. The services available, as well as their costs, change over time. If you want to get the best solution and the cheapest price, you have to be willing to put in the work for it.
There are sizeable savings to be had by choosing the correct service for the job. Think about using a Lambda or a short running ECS job to run nightly tests versus a dedicated EC2 instance. It is all about choosing the right resource type and pricing model for your cost targets. Data transfer charges are one of the big gotchas in the cloud, there is more than one way to skin that cat, choose the cheapest cat skinning method for your problem space.
What Have I Skipped?
As Tim Burton’s/Jack Nicholson’s Joker once said, “You can’t make an omelette without breaking eggs”… Also François de Charette said it… I seem to like quotes today… I digress. The fact is that I can’t cut the Well-Architected Framework down to a twentieth of its original size without losing something. I have indeed lost some really great points. I would rather not see this article as a replacement for the original work, rather a supplement. The white-paper is roughly cut into two pieces, which can be described as answering the questions of ‘why’ and ‘how’ respectively. Here, I have presented you with the answer to ‘what’. What do you need to think about when architecting a solution?
That’s all I have for today. As former president Barack Obama once said *mic drop*