Myself and the team here at Naimuri have been working on a product to process large amounts of data in a short period of time; this insight will focus on our efforts to scale the product to improve throughput.
Let me paint a picture of where we began:
10 applications were created, which given an input of JSON, would process data, return a result and then exit; I’ll be referring to these as processors. The output from each of these processors can be fed back into the input of the other processors, resulting in a feedback loop which can run for many iterations.
We knew we needed to write more processors for the product and that a complete run should take no longer than 4 hours.
As previously mentioned the processor works on a feedback loop from the other processors output. This results in exponential growth in output, so as you can imagine the data we were processing got big quickly!
How Do We Grow?
We needed a solution which allowed us to scale simply and quickly, we knew we would be using AWS as our infrastructure of choice and a few ideas were bounced around:
- Put each processor into a RESTful microservice which could be called by a controller – increasing the number of processors would increase the throughput.
- Use queues for each processor and make each processor listen to the queue.
- Dockerise the processors and start them using a controller and Docker-machine.
All of the ideas lead to similar questions; how to make sure that the processors are healthy, how do we quickly increase their number when we need higher throughput and how do we decrease their number when demand is low?
The solution we settled upon was Hashi Corps Nomad along with Docker containers. This application is a scheduler of virtualised, containerised, and standalone applications. It is designed to work at scale, orchestrating applications across a distributed infrastructure.
Why it Works
We settled upon Nomad for the following reasons:
- AWS instances are treated as resources; concerning ourselves with what application was running where, was a thing of the past.
- Adding more machines to the cluster increases the compute resources; scaling infrastructure becomes simple.
- Nomad monitors each job (processors in our case). If the jobs start failing Nomad will attempt to restart them. If the EC2 instance running the job fails, the jobs are moved to a healthy compute resource.
- Integration with Consul, adding EC2 instances to the estate is simple with Consul. By bootstrapping Consul and Nomad agents into each of our EC2 instances, Nomad can add the new instance to the compute resource estate using Consul’s auto join tag to discover and register itself with the Nomad/Consul cluster.
- Nomad can run a short lived Docker container (and many other applications), this removed any need to turn our processors into long running services or write a controller to start a processor. Nomad handles this for us and cleans up afterwards.
- Nomad also answered a wider problem of how to start and keep alive long running processes such as databases, Kibana, and general tooling applications. Nomad is given the job of starting these and keeping them running.
- Effective use of compute resource; each Nomad agent knows how much resource is available per instance, the Nomad server is responsible for making sure that the jobs are ran somewhere with enough resources. With some clever bin packing we can make effective use of our resources.
The following diagram outlines how we integrated Nomad. To simplify the diagram I have excluded Consul, which lives outside of the data flow and is responsible for maintaining a registry of all the services. Other services have been excluded from this diagram, e.g. persistence of the data, the system does look like a continuous loop but completes when there is no more processing to do.
- Data is added to the queue, the message contains everything the processors need to run.
- The job runner was created which could take the queued messages and convert them into job schedules for Nomad.
- Nomad on the diagram refers to a cluster of Nomad agents ran as servers. All of the EC2 instances are also running agents in client mode.
- Nomad is supplied with jobs and allocates them into the EC2 compute resource pool. It manages them through the whole lifecycle including where in the pool they are ran.
- The output from the processors is placed back onto the queue for further processing.
Nomad has been pivotal in helping our product, we’ve had the odd teething problem, but it has helped us scale in such a short period of time that it was well worth adopting.
By monitoring the size of the resource pool we are able to increase the number of EC2 instances available in the pool accordingly to the compute resource requirements. Currently this scaling of EC2 resources is a manual process, but it is something we are looking to automate in the future.