Threshold alerts

I’ve previously discussed alerts in the context of fault management. On the 7000, alerts are just events of interest to administrators, like “disk failure” or “backup completion” for example. Administrators can configure actions that the system takes in response to particular classes of alert. Common actions include sending email (for notification) and sending an SNMP trap (for integration into environments which use SNMP for monitoring systems across a datacenter). A simple alert configuration might be “send mail to admin@mydomain when a disk fails.”

While most alerts come predefined on the system, there’s one interesting class of parametrized alerts that we call threshold-based alerts, or just thresholds. Thresholds use DTrace-based Analytics to post (trigger) alerts based on anything you can measure with Analytics. It’s best summed up with this screenshot of the configuration screen:

Click for larger view

The possibilities here are pretty wild, particularly because you can enable or disable Analytics datasets as an alert action. For datasets with some non-negligible cost of instrumentation (like NFS file operations broken down by client), you can automatically enable them only when you need them. For example, you could have one threshold that enables data collection when the number of operations exceeds some limit, and then turn collection off again if the number fell below a slightly lower limit:

Implementation

One of the nice things about developing a software framework like the appliance kit is that once it’s accumulated a critical mass of basic building blocks, you can leverage existing infrastructure to quickly build powerful features. The threshold alerts system sits atop several pieces of the appliance kit, including the Analytics, alerts, and data persistence subsystems. The bulk of the implementation is just a few hundred lines of code that monitors stats and post alerts as needed.