Despite not being the most glamorous aspects of software development, logging and monitoring are essential functions to help understand how your software works. Whether you’re a developer, systems administrator or project manager, or whether you’re working on a lightweight microservice or a colossal, monolithic system – logging and monitoring are your faithful friends. If you need to figure out what’s gone wrong, predict when problems might arise, or if you need to ensure correctness in your system, then logging and monitoring will be there to help.
Listen to this post
|
Logging
Logging is the art of writing out useful runtime information, as it happens, to a file system or datastore; externalising a history of what your software has done “just in case” something goes wrong. Having this data available at some point in the future lets you check that everything has been behaving correctly.
Why Log?
For Development
Developers need the ability to diagnose and discover problems and problem areas in their applications, both during development, and more crucially, after deployment. In production it’s rare for a developer to have access to the debugging tools that are usually available while the application is being worked on, and so logging is usually the last and sometimes the only way to acquire the information required to correctly resolve problems. By looking at any application logs, developers can gain a better insight into how their application is being used – which isn’t just crucial for debugging, but is also important to direct and drive future development.
For Operations
Operations and sysadmins require logs to ensure the health of the overall system – a lack of logs from a machine is a bad thing; usually meaning that system is down, has no network, or has some other associated issue. By assessing the logs leading up to this point, hopefully the problem can be diagnosed. This use case does border with monitoring and there is some crossover. However, logs can be much more granular than monitoring, and when looked at in conjunction with other monitoring solutions this allows for a much more in-depth assessment of the landscape.
For Security
By assessing logs (looking for patterns, diagnosing anomalies, tracking mean and median usage, and other analysis methods) security teams can detect any outliers, where usage patterns are different from normal. Perhaps there’s a sudden and unexpected spike of activity from an unusual geolocation – this could be a security issue, so having logs and alerting available will help security teams quickly assess and remediate any danger.
A high level of confidence can be gained in the security setup of the system by tracking logs this way. For example, logs could show that unauthorised users were denied access to areas of the site they shouldn’t be able to access.
Logs are crucial for audit purposes – e.g. in the case of an external investigation, having a great logging setup could make the difference between proving innocence and a hefty fine.
What to Log?
Everything, right? Well, logging too much can be almost as problematic as too little. There’s a happy medium to be found between not having enough logging to diagnose a problem, and information overload. Generally this is handled with different logging levels for different deployment environments – the development environment might output everything, and the live production system might just output error traces.
What Do I Do With All This Data?
By choosing a logging solution that allows anomaly detection, usage pattern analysis, and assessment of granular log data, teams and companies can adjust what they develop and how their product works accordingly. For example, log analysis could show that 90% of visitors are abandoning items in their basket, and on further analysis the checkout page has a high page load time – perhaps by improving the page load time those 90% could be turned into paying customers.
Some really powerful analysis could make all the difference, with many logging providers offering Machine Learning algorithms pre-built into their systems. These algorithms provide a simple interface to trend analysis and a powerful anomaly detection system comes from this.
Additionally, some providers have other interesting abilities for analysis. Elasticsearch, for example, allows normalising geo-ip data by population sizes, which gives a way to visualise market penetration per capita – a potentially powerful tool to e.g. assess the success of a global marketing strategy.
Monitoring
Monitoring is a crucial part of the maintenance phase of an application and when used in conjunction with logging can provide both a high level overview and an in-depth analysis of the application and the infrastructure on which it’s running.
Where logging is usually more event driven – a log line is output when some interaction happens (e.g. a user puts something in their basket, or a new http connection is made); monitoring is usually done through polling – a monitoring agent will request some information from the host that it’s running on, and repeat that request periodically. By regularly making these requests over a large enough time frame, and by centralising the data onto a monitoring server, a picture of the health of the system can be built up.
Types of Monitoring
Traditionally, monitoring has been infrastructure focussed, with many services reporting hardware performance metrics such as CPU utilisation, disk IO, and network load; but with new monitoring methods there’s more of a full-stack solution. This solution provides the ability to collect middleware metrics such as database queries per second or database error rates; and with new Application Performance Monitoring (APM) additions, key application metrics can be monitored.
Infrastructure
The hardware, or virtual hardware, that the machines and services are running on can be monitored – usually by installing an agent that collects CPU usage, disk IO, network IO, memory capacity – and numerous pieces of information. This can be used to assess whether the machine is too idle or overloaded.
Middleware
The software services that run on the infrastructure but aren’t the application itself. Examples include NoSQL servers, the Docker daemon, Apache, NGINX etc. Data can be collected about error rates, query rates, general health, throughput and more.
Application
Also known as APM, application monitoring covers APIs, web applications, mobile applications etc. (i.e. the front end that the user sees). Some providers offer code snippets or libraries that have to be built into your application and provide monitoring automatically. Other providers have an API that can be used to make specific monitoring calls – this requires manual development work. Pretty much any area of the application can be monitored using this method, but some examples include transaction traces; application health, throughput, and latency; and per-function application performance.
Synthetic
Some monitoring systems provide ‘Synthetic’ monitoring, which is a kind of live regression testing – where the output of the web application or web API is tested to ensure it’s responding correctly. These range from simply pinging a URL, or GET requests to a page with a validation string to check for, to fully scripted API calls with scripted validation. Data can be collected on page load times, page availability, and component load times.
Analytics
Finally there are custom business metrics, such as tracking items in shopping carts, time spent on each page, a customer’s journey through the site, advertising revenue, etc.. This is usually done via javascript in the client’s browser.
What to Monitor?
Ideally all of the above should be monitored: collecting and storing data is cheap, and it can be much more expensive not having that information when it’s needed. Data collected should be granular enough to diagnose issues and collected regularly enough that trends can be spotted and problems resolved before they turn into critical issues.
What Do I Do With All This Data?!
Once the data is collected it can be analysed and assessed. Much like with logging, Machine Learning algorithms can be applied, or the data just visualised and used for diagnostics, or shaping development.
It should also be used to alert relevant parties* on certain problematic conditions – such as CPU utilisation being above 95% for more than 10 minutes, or new database queries taking longer than the rolling 5 minute average.
*This is usually the unfortunate Ops support team member who’s on call – but preferably an automated remediation system.
Conclusion
Hopefully you can see why logging and monitoring should be viewed as core parts of your application, website, or service.
They can help you take big steps towards avoiding major business risks like software bugs, hardware issues, and malicious threats.
Together they help provide a deeper understanding of your software, your customers, and ultimately the service or products that your business provides.