New metrics on

You may have noticed that the service got a big facelift last week. The new version of the software has a lot of new features, among them some pretty substantial improvements to Analytics. Recall that we already had metrics for Node.js HTTP server and client operations, garbage collection, socket ops, filesystem ops, and CPU executions. We’ve now got a slew more.

First we have system calls. Whether a program is writing to disk, using the network, or talking to other processes on the same system, it’s making system calls. This metric shows you who’s making which syscalls and how long they’re taking (as a heatmap).

Then we have basic resource usage: CPU, memory, and network:

CPU: aggregated CPU usage shows what percent of total CPU your instance is using. This number’s a bit tricky to interpret: our compute nodes have more than 1 CPU, so your app can be using more than 100% of CPU within a given second. But your app can be compute-bound even if it’s only at 100% if you only have 1 main thread (as Node.js does) and it’s saturating 1 CPU.

CPU: aggregated wait time measures how much time threads in your instance spend ready to run but waiting for a CPU. Some amount of friction is expected, but if this number gets high it likely means the system is under unusually high load.

Memory: resident set size shows how much DRAM your instance is using. There’s a separate metric for maximum resident set size which is generally constant and shows how much memory your instance is allowed to use. When your instance exceeds its max RSS, you will see non-zero excess memory reclaimed, indicating that the system is paging out some of your instance’s memory. Your app won’t experience performance problems from this unless you also see non-zero pages paged in. When you see these, your app will be slow because it has to wait for memory to be brought in from disk.

Network: bytes and packets sent/received are pretty self-explanatory. These measure network throughput.

For reference, there are a number more arcane (but sometimes very useful) new metrics as well:

Most importantly, we now support predicating and changing decompositions.  So while looking at system calls decomposed by application name, right-click on a particular application you’re interested in and you can select only the system calls from that application and then decompose by something else, like system call name.  This is an incredibly powerful feature for iterating on a performance investigation, but the details will have to wait for another post.

Thanks to everyone at Joyent for the hard work on the new standup. We hope you all find the new metrics useful and we look forward to getting your feedback!