I was lucky enough to attend the AWS Builders’ Day recently and what follows is my attempt to digest and relay the excellent event hosted by Amazon. This blog is a run through of everything I can remember from the day, with links and pointers to things which might be of interest.
The 3 featured topics of the event were:
- Artificial Intelligence
The challenge of my day was to balance the subjects on offer and choose those presentations that I thought would bring the most value to myself, Naimuri and our customers.
First up was Lecture 1
Containers: State of the union
This was an initial introduction to the day and a whirlwind tour of containers and how they are core concept for AWS. Abby covered a brief history of how things have progressed in her, and probably most other developers’, world over the last 6 years.
She began with the loved and loathed world of monolithic applications and the pain of developing and managing those systems. These systems were replaced with microservices, the saviour to all things monolithic, but among other things, they introduced the complexity of managing and deploying lots of small, quickly changing services. Enter containers, schedulers and the ever evolving world of ECS, EKS and Fargate within AWS.
Abby then lead into what was to follow for the containers track, ECS and the new world of managed services for Kubernetes in the form of EKS and Fargate.
She gave a nice example of Monzo using Kubernetes, highlighting its power to help them scale massively and quickly. This example was however, followed by a warning quote:
“Deploying Kubernetes in a highly available configuration on AWS is not for the faint of heart and requires you to get familiar with its internals, but we are very pleased with the results.”
She also recommended the Medium blogs written by Nathan Peck, an AWS container advocate. https://medium.com/@nathankpeck
Finally, she directed people to two useful slack channels:
- awsdevelopers.slack.com – A private AWS channel which needs an invite from Amazon to access. DM Abby and twitter or contact AWS support with your email if you wish gain access.
- amazon-ecs.slack.com – This is not offical and is open to all, but is apparently a great place to ask for help.
Next up, Lecture 2
An Introduction to Deep Learning: Theory, Use Cases & Tools
I was hoping this would be a high level overview of the state of machine learning and not a deep dive into the guts of AI, as despite having a Maths degree, I still feel queasy when I look at the equations used within AI.
The lecture started strongly with the quote:
“The common feeling among developers: AI is hard; As developers were not smart enough to do it!”
This had me interested and summed up my feelings of AI at the moment… It’s not for the uninitiated and it’s not a magic wand; just use AI and everything will be figured out for you.
Unfortunately for me the lecture then dived into a fair bit of complex maths and I have to admit my mind did wander for some time!
Luckily, I came-to in time to hear an excellent example of the “Local Minimum Problem” and why if you don’t have LOTS of data, AI probably isn’t the solution you are looking for.
Julien gave an example of how machine learning could be used for making shopping recommendations using unsupervised learning. In this instance the program would try to make recommendations; If a user clicked on a link, that could then be used as an input to classify whether the suggestion was helpful.
The talk also covered GANs (generative adversarial network), which is the kind of AI wizardry we have probably all seen in the media. In this instance we saw a demo of a very convincing computer generated person using pictures of celebrities. Take a look yourself.. There’s a technical blog is available here: http://torch.ch/blog/2015/11/13/gan.html or a fun video of the results here: https://www.youtube.com/watch?v=XOxxPcy5Gr4
Julian steered people to the demo put together in this github repo: https://tcwang0509.github.io/pix2pixHD/
This was a very engaging example of the use of GAN to construct a HD image using a basic model which was labelled with features. The demo shows a street scene reduced down to basic flat areas. The GAN program then generates what it thinks should be displayed over this basic scene.
This medium post contains a quick overview of the fantastic work happening within image processing using AI: https://medium.com/mlreview/10-deep-learning-projects-based-on-apache-mxnet-8231109f3f64
SockEye was mentioned as an interesting piece of work by AWS into machine learning and image recognition.
Amazon Elastic Container Service for Kubernetes (Amazon EKS)
Ric is someone whose face is familiar to me as he has been to a couple of Manchester AWS Meetups; He gives exceptional, informative talks on all things AWS. Today, Ric was giving a talk on Amazon’s brand new EKS services. This is so new, Ric didn’t even have access to it – he was only able to access it via a colleague’s account! This did mean that the talk was a little light on details and relied upon the slides quite a bit.
Ric explained why Amazon had introduced EKS… although Kubernetes was created for managing containers and the estate they run on, setting up and managing Kubernetes is hard work! Amazon have done the legwork to make setting up Kubernetes in the AWS cloud much simpler.
Instead of using KOPS to setup and manage the cluster, Amazon will provide it for you.
They will run 3 masters across different Availability Zones (AZ) to provide a Highly Available solution. The master nodes are monitored and replaced automatically if they become unhealthy. You won’t be able to connect via ssh to these worker nodes or install anything on to the machines and logs will be available via CloudWatch. The user of the service is responsible for creating worker nodes which can connect to these masters. Packer was recommended as a way to create AMI working instances.
EKS will support native VPC networking with a CNI plugin. Apparently, normal Kubernetes can be very chatty over the network. EKS will also be easier to test opening traffic to the public, everything can stay within the VPC without being exposed to the outside world.
A final bit of useful advice from Ric was a pointer to https://github.com/aws-samples which contains loads of handy examples/workshops of AWS services. This looks like a great starting place if you want to learn about a specific service.
Building Global Serverless Backends powered by Amazon DynamoDB Global Tables
The bits that I found most intriguing in Adrian’s talk were actually the bits not specifically about AWS. He talked about how hard it is to achieve 4 or 6 nines of available in a system;
99.99% availability actually means 52 mins/year of down time.
99.9999% availability means the system can only be down for 31 seconds/year!
To do this a lot of fault tolerances have to be engineered into a system, which eventually means a system has to be running in multiple AZs, everything needs to be able to run in parallel and everything needs to be automated to have any chance of getting 6 nines availability.
Going to multiple AZ introduces an interesting problem, latency and network inconsistencies. To adopt this approach successfully, all applications need to run asynchronously. There is no telling when a system will respond to a request, so they should work without expecting one straight away. Also, they should have an automatic retry with an exponential backoff policy,
Adrian mentioned an example where a retry policy was used, but without a backoff policy. The system in mention had lots of applications which all suffered a small network failure and then went on to cause a huge network failure when they all retried multiple times without waiting!
Another great example was Netflix’s UI implementation. Their UI will initially render part of the application and defaults are displayed to the user. Then, in the background, many asynchronous calls are made and the content is loaded as it becomes available. The user is not made to wait to use the application; it is available to them as early as possible and the content follows afterwards.
Also, Adrian spoke about using Simian Army (developed by Netflix) to test the fault tolerances of a system, i.e. how well will your system cope if random parts it start going down or don’t respond as quickly as they should?
Following on from these examples, Adrian spoke about using Serverless to help with availability, ie if you don’t have to worry about keeping the servers running to run your code, then at least some of the complication is taken away.
He also spoke about making data available when using multiple global AZs and how DynamoDB Global Tables could be used to avoid the pain of having to write custom code to move data between global databases. He gave a brief but compelling demo of how easy and quick it was for data to be replicated across two DB tables which were in two different AZs.
And finally, Lecture 5
Advanced Container Management and Scheduling
Abby Fuller (as per Lecture 1)
Abby Fuller returned to give the final lecture of the day. This was a deep dive into the containers, scheduling and the things she had learnt whilst in the field with AWS. It should be mentioned that her talk focused around using the ECS service to orchestrate containers.
A full version of the lecture can be watched here: https://www.twitch.tv/videos/226951698
The talk contains lots of little, thought-provoking nuggets of valuable knowledge which I’ll try and summarise for you.
First up was the Task Placement, this is the ECS policy defining how containers are placed across multiple host, the options are:
- Bin pack – this policy will attempt to pack containers as densely as possible, filling up each host as much as it can before moving onto the next one.
- Spread – attempt to spread the containers as evenly as possible.
- Affinity – group specified containers together on a host.
- Distinct – Allow only one per type of container on a host.
Much more about this subject can be found here: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-placement-strategies.html
AWS offers 3 type types of load balancers:
- Application Load Balancer
- Best for HTTP/HTTPS traffic using request routing, best of microservice container architecture.
- Network Load Balancer
- Best for extreme performance, this type of LB does not need warming up and is great for spiky traffic such as event booking site which see heavy increases in traffic and are quiet the rest of the time.
- Classic Load Balancer
- Largely redundant now and only provided as legacy to support EC2-classic network.
Lots more about this can be found here: https://aws.amazon.com/elasticloadbalancing/
Docker Image Sizes
When creating a system which will have lots of docker container running on a server, careful image creation can be important to help reduce the size of the images used disk and the amount of network traffic required to pull the new image onto the host.
Where possible use a shared base image, reducing the footprint of images on a box. Think about what is being put into the image; limit the data written to the containers as much as is possible. Each line in a Docker file is a layer which can take up space. Combining statements can reduce the number of layers created. Chaining RUN statements can be a good way of reducing the number of lines used. Be aware that RUN, ADD, COPY all add layers to an image.
Docker caching can be complicated but should help to reduce the size on dis and time taken to retrieve layers. One point Abbey added is that if you are using a git fetch command to pull the latest from a repo, Docker will cache this line even if the content in the repo changes and will not try to pull after the first attempt.
Clean up Docker images which are not being used, two useful commands are:
$ docker image prune $ docker system prune
Abby recommended taking a look at Spotify’s garbage collection implementation https://github.com/spotify/docker-gc
Put in place monitoring and alerting to track when a cluster is running out of resource and put in place a policy to manage how to scale (both up and down). Cloudwatch was one recommendation to do the monitoring.
Another was to use the cluster query language in AWS to get information back from the running cluster. See here for further details: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/cluster-query-language.html
So that is it, everything that I can remember from my AWS Builders’ Day! I hope some of the material in here has been beneficial. I know I found lots of golden nuggets of information that I will be using in my day job.
Check out https://www.twitch.tv/aws/events?filter=past if you want to watch any of the lectures from the day.
The full agent for the day is available here: https://aws.amazon.com/events/aws-builders-day-uki-2018/
Thanks for reading.