Sep 15, 2014
Operational Tips for SaaS Continuous Availability
One of the challenges with running any SaaS operation is ensuring that the platform is working as expected. The goal is to have adequate monitoring coverage so that there is a high degree of confidence that the platform is working as expected, or Operations personnel are notified if it is not. Additionally, retaining and reviewing trends in time stamped platform data helps operators understand how the platform behaves under known circumstances and better spot outliers that may indicate an issue or a change in platform usage. In this blog post I will talk about how we at Apperian approach monitoring and point out some tools that we use to help accomplish our goal of a continuously available, well-performing platform. Stateful and Time-Series Monitoring I like to think of monitoring in two broad categories: stateful monitoring and time-series analysis. Stateful monitors are configured to track a known attribute (e.g. disk usage) and change state based on a threshold (e.g. disk is 85% full). Stateful monitors are great for watching deterministic attributes that have defined thresholds, such as disk space usage, memory usage, whether a specific process is running, etc. At Apperian, we use stateful monitoring to capture the current health of all of the individual components of our platform and send alerts to our Operations team if any platform attribute is not within the range we define. However, there are two ways in which stateful monitoring is inadequate. First, it only provides a point-in-time view of the platform, meaning that important trends in the attribute values over time are lost. Second, it does not allow operators to easily correlate the behavior of two or more attributes. This brings us to time-series analysis. Once time-series data on the attributes mentioned above, or runtime information about your application, is continuously collected, operators can use this analysis to expand their monitoring coverage beyond what just stateful monitoring provides. How To Use Time-Series Data Once an operator has time-series data, he can track the behavior of the platform though different usage patterns and be able to spot outliers indicating a problem or change in usage that might not otherwise trigger a stateful alert. For instance, at Apperian we track the number of requests hitting our servers every minute and, over time, are able to view usage trends that have a surprisingly stable pattern. We see a request peak around 3PM EST each weekday and our lowest daily activity is around 12AM EST. Requests on weekends tend to be constant and near weekday low levels. Knowing this pattern, we can watch for deviations, which would indicate something out of the ordinary is happening and help us track down either an issue or a change in platform usage. In addition to being able to establish platform trends, operators can also correlate two or more data trends to better understand interdependencies. To continue with my web request example from above, at Apperian, we look at the correlation between requests, server CPU usage and server memory usage to help gauge our infrastructure capacity. Exponential increases in CPU or memory usage with a linear increase in requests may indicate that we have reached an infrastructure capacity limit and that additional resource need to be added to the platform. Monitoring Tools Traditionally, tools to help with both stateful monitoring and time-series analysis were either open source or commercially available software packages that were installed inside the data center and custom configured to monitor the operator’s infrastructure and application. For very complex SaaS operations, there may even be custom written software that monitors or collects data on the main application. Here at Apperian, we do still use some tools installed inside the data center to help us monitor the platform (for instance, we heavily use Nagios for stateful monitoring); however, we have started also using cloud based monitoring services to help simplify and enhance our monitoring capability. NewRelic One cloud-based service that has been a great help is NewRelic. NewRelic at its core is an Application Performance Monitoring tool, meaning that it is able to trace what is happening inside your application in real-time and provide metrics on how well (or poorly) your application is performing. The metrics are then time-stamped and saved, enabling the operator to conduct time-series analysis. Tools like this have been around for a long time inside the data center; however, a SaaS platform like NewRelic makes it trivial to get up and running. Our main use of NewRelic is their PHP and Python modules, both of which are installed on the application servers and hook into the PHP and Python code. These modules are able to track execution times, execution paths, external calls to databases or other web services, and this information can be correlated to other things like browser types, server statistics, etc. All of the data is anonymized and sent to NewRelic’s service, where it is immediately available for time-series analysis. We use NewRelic almost every day, not only to monitor the health of the application, but to also diagnose vague issues such as application slowness, saving us literally days of development time. NewRelic is a great example of how a SaaS monitoring platform can quickly and easily improve your monitoring coverage and complement more traditional monitoring that you may already have in use. Conclusion Monitoring a SaaS platform is a complex task that we take very seriously at Apperian. It involves both knowing the current state of the platform and understanding and reviewing trends in the behavior of the platform over time. But getting it right can make a huge difference in how customers view the stability and value of the product. Traditional monitoring tools certainly still have their place, but a new wave of cloud based monitoring platforms, such as NewRelic, is making it very easy for application developers and SaaS operators to easily extend their monitoring coverage and improve the performance and availability of their platforms.