Challenges deploying PostgreSQL (9.2) for high availability
A detailed post about experiences trying to keep a fleet of large, high-volume PostgreSQL databases online.
All about swap on illumos systems
All about swap on illumos systems. Part 2 of 2 in a series on how illumos manages physical memory and swap.
Physical memory on illumos systems
An overview of how illumos systems manage physical memory. Part 1 of 2 in a series on how illumos manages physical memory and swap.
Shell redirection example
Explaining some surprising bash behavior.
Modifying USDT providers with translated arguments
Diving deep on a bug in the Node.js USDT translators.
Performance Puzzler: The Slow Server
What happens when one server in a distributed system gets slow?
Visualizing PostgreSQL Vacuum Progress
As heavy users of PostgreSQL since 2012, we’ve learned quite a bit about operating PostgreSQL at scale. Our Manta object storage system uses a large fleet of sharded, highly-available, replicated PostgreSQL clusters at the heart of the metadata tier. When an end user requests their object, say http://us-east.manta.joyent.com/dap/public/kartlytics/videos/2012-09-06_0000-00.mov, Manta winds up looking in this PostgreSQL cluster for the metadata for that object in order to find the storage servers hosting copies of the object (along with the size, checksum, and other useful information).
TCP puzzlers
It’s been said that we don’t really understand a system until we understand how it fails. Despite having written a (toy) TCP implementation in college and then working for several years in industry, I’m continuing to learn more deeply how TCP works – and how it fails. What’s been most surprising is how basic some of these failures are. They’re not at all obscure. I’m presenting them here as puzzlers, in the fashion of Car Talk and the old Java puzzlers. Like the best of those puzzlers, these are questions that are very simple to articulate, but the solutions are often surprising. And rather than focusing on arcane details, they hopefully elucidate some deep principles about how TCP works.
Programming language debuggability
A few weeks ago I had the fortune to attend a talk by Douglas Crockford on various ideas related to JavaScript past and future and programming language design. The talk was thoughtful and well-received (and thanks to Crockford and Bigcommerce for making it happen!). The main theme, not new to those familiar with Crockford’s work, was that as programmers, we have a responsibility to create bug-free programs; and that therefore when faced with choices about which language features to use, we should always choose the safest ones – that is, we should always avoid the features that can sometimes be used unsafely when a safer feature is available.
Debugging enhancements in Node 0.12
(also on the Joyent blog) Background Node.js provides the richest postmortem debugging facilities of just about any dynamic language. Postmortem debugging refers to debugging programs from a core file, heap dump, or some similar dump of memory. With a core file of a Node program (which is essentially a snapshot of your program’s state), you can: Find all JavaScript objects and classify them by their properties. This is very useful for identifying large memory leaks.
Understanding DTrace ustack helpers
or, everything you ever wanted to know about stack traces I promised this post over a year ago, and now that someone’s actually working on a new ustack helper, I thought it was finally time to write about what ustack helpers are, how they work, and how I went about building one for Node.js. Only a handful of ustack helpers have ever been written: Node, Java, Python, and PHP (the last of which is believed lost to the sands of time), so this post is mainly for a narrow audience of developers, plus anyone who’s interested in how this all works.
Tracing Node.js add-on latency
Node.js has some built-in DTrace probes for looking at HTTP request start/done and GC start/done, and I wrote nhttpsnoop to report latency based on these probes. There’s also a future project that will make it easy to trace any asynchronous operations. But in the meantime, there’s one category of latency that’s missing, which is latency of asynchronous operations kicked off by add-ons. As an example, I’ll use node-sqlite3. Below, I have sqlite-demo.
Stopping a broken program in its tracks
Last week I debugged a Node issue where under some conditions, fork() failure could result in some file descriptors being closed, leading shortly to bedlam. In the best cases, the program would immediately crash, but often it would just do the wrong thing. This happens a lot, of course, and in many ways it’s worse than a crash because you have no core dump nor even a stack trace to go on.
Node.js in production: runtime log snooping
This post also appeared on the Joyeur blog. This post is the one of several about how we operate Node.js in production at Joyent. Most of my experience comes from working on our Manta Storage Service, which currently comprises a few thousand Node processes across three datacenters. Logging is one of the most primitive but most valuable forms of postmortem debugging. Logs let you figure out why a program crashed (or did the wrong thing) after it’s already done so.
Kartlytics: Applying Big Data Analytics to Mario Kart 64
This post also appears on the Joyeur blog. If you missed it, Joyent recently launched Manta, a web-facing object store with compute as a first-class operation. Manta makes it easy to crunch on Big Data in the cloud, and we’ve seen it used by both ourselves and others and others to solve real business problems involving Big Data. But it’s not just for user behavior and crash dump analysis: Manta has profoundly changed the way we operate here at Joyent’s SF office.
Fault tolerance in Manta
Since launching Manta last week, we’ve seen a lot of excitement. Thoughtful readers quickly got to burning questions about the system’s fault tolerance: what happens when backend Manta components go down? In this post, I’ll discuss fault tolerance in Manta in more detail. If anything seems left out, please ask: it’s either an oversight or just seemed too arcane to be interesting. This is an engineering internals post for people interested in learning how the system is built.
Inside Manta: Distributing the Unix shell
Photo by Jackie Reid, NOAA Today, Joyent has launched Manta: our internet-facing object store with compute as a first class operation. This is the culmination of over a year’s effort on the part of the whole engineering team, and I’m personally really excited to be able to share this with the world. There’s plenty of documentation on how to use Manta, so in this post I want to talk about the guts of my favorite part: the compute engine.
Debugging dynamic library dependencies on illumos
In this short follow-up to my post on illumos process tools, I’ll expand a bit on ldd and pldd, which print the dynamic linking dependencies of binaries and processes, respectively, and crle, which prints out the runtime linker configuration. These tools are available in most illumos distributions including SmartOS. Understanding builds (and broken builds in particular) can be especially difficult. I hate running into issues like this one: $ ffmpeg ld.
illumos tools for observing processes
illumos, with Solaris before it, has a history of delivering rich tools for understanding the system, but discovering these tools can be difficult for new users. Sometimes, tools just have different names than people are used to. In many cases, users don’t even know such tools might exist. In this post I’ll describe some tools I find most useful, both as a developer and an administrator. This is not intended to be a comprehensive reference, but more like part of an orientation for users new to illumos (and SmartOS in particular) but already familiar with other Unix systems.
OSCON Slides
Thanks to all who attended my talk at OSCON on Node.js in production: postmortem debugging and performance analysis. Just a reminder: all of the technology I described is open source, most of it part of illumos. For more info, check out the links in slide 22. For the curious, there are also some slides on implementation details I didn’t have time to cover.
NodeConf slides
NodeConf was a great success this year. Thanks to @mikeal for organizing and everyone who spoke and attended. The slides from my talk on DTrace, Node.js, and Flame Graphs are here.
ACM Turing Centenary Celebration
This past weekend, I was very fortunate to have a chance to attend the ACM’s Turing Centenary Celebration in honor of the 100th anniversary of the birth of Alan Turing. The event brought together nearly every living Turing Award winner for a series of talks and panel discussions on subjects like AI, theory of computation, computer architecture, and the role of computation in other fields. A webcast of the entire event is already available online.
Debugging Node.js in Production (Fluent slides)
Thanks to all who attended my talk at Fluent today on “Debugging Node.js in Production”. The slides are available here. There’s some extra content there that I didn’t have time to cover in just 40 minutes, most notably some implementation notes about mdb_v8 and the DTrace ustack helper. Please leave questions, comments, or feedback below (or @dapsays)!
Debugging RangeError from a core dump
Last week, I tweeted: I had just run into this nasty Node.js error: $ node foo.js timers.js:96 if (!process.listeners('uncaughtException').length) throw e; ^ RangeError: Maximum call stack size exceeded What went wrong? It was reasonably obvious from the error message that the program blew its stack, which I assumed was likely the result of some errant recursive function, which was surprising, because I didn’t know I was using any recursive functions. But given that the problem is too many function invocations on the stack, the obvious question is: what’s on the stack?
Profiling Node.js
(For returning readers, this is basically a “tl;dr” version of my previous post on Node.js performance. The post below also appears on the Node.js blog.) It’s incredibly easy to visualize where your Node program spends its time using DTrace and node-stackvis (a Node port of Brendan Gregg’s FlameGraph tool): Run your Node.js program as usual. In another terminal, run: $ dtrace -n 'profile-97/execname == "node" && arg1/{ @[jstack(100, 8000)] = count(); } tick-60s { exit(0); }' > stacks.
Managing Node.js dependencies with shrinkwrap
This post also appears on the Node.js blog. Photo by Luc Viatour (flickr) Managing dependencies is a fundamental problem in building complex software. The terrific success of github and npm have made code reuse especially easy in the Node world, where packages don’t exist in isolation but rather as nodes in a large graph. The software is constantly changing (releasing new versions), and each package has its own constraints about what other packages it requires to run (dependencies).
Playing with Node/V8 postmortem debugging
2016 Update: The commands and output in this post are pretty dated now, but you can find an up-to-date docs (including a tutorial and instructions for updated binaries) in the mdb_v8 user guide. “Post Mortem” by C. MacLaurin Several weeks ago I posted about postmortem debugging for Node.js, a critical technique for understanding fatal software failure (and thereby keeping up software quality). Now that the underlying pieces are freely available1, you can use the documentation below to start debugging your own Node programs.
Where does your Node program spend its time?
Photo by Julian Lim (flickr) Performance analysis is one of the most difficult challenges in building production software. If a slow application isn’t spending much time on CPU, it could be waiting on filesystem (disk) I/O, network traffic, garbage collection, or many other things. We built the Cloud Analytics tool to help administrators and developers quickly identify these sources of latency in production software. When the app is actually spending its time on CPU, the next step is figuring out what it’s doing: what functions it’s executing and who’s calling them.
USDT Providers Redux
In this post I’m going to review DTrace USDT providers and show a complete working example that I hope will be a useful reference for people interested in building providers for their own applications. First, the prerequisites: DTrace is the comprehensive dynamic tracing framework available on Illumos-based, BSD, and MacOS systems. If you’ve never used DTrace, check out dtrace.org; this post assumes you’ve at least played around with it. USDT (Userland Statically Defined Tracing) is the mechanism by which application developers embed DTrace probes directly into an application.
Node.js/V8 postmortem debugging
2016 Update: The commands and output in this post are pretty dated now, but you can find an up-to-date docs (including a tutorial) in the mdb_v8 user guide. Photo by Auntie P (flickr) I recently wrote an article for ACM Queue discussing postmortem debugging for dynamic environments. I argued that while native environments for years have provided rich tools for understanding software failures postmortem, most popular dynamic environments have not developed similar facilities, and this gap is becoming more important as these environments grow popular for building complex core components of large distributed systems.
Surge 2011
We had a great time last week attending Surge in Baltimore. Highlights for me included Baron Schwartz’s visualizations of MySQL execution time (not entirely unlike Brendan’s, but with the addition of modeling), Geir Magnusson’s discussion of scalability at Gilt, and Raymond Blum’s discussion of backup/restore at Google (including recovery after the GMail outage from several months ago). In his keynote, Ben Fried introduced what turned out to be an important theme in several talks: the importance of hiring generalists.
New metrics on no.de
You may have noticed that the no.de service got a big facelift last week. The new version of the software has a lot of new features, among them some pretty substantial improvements to Analytics. Recall that we already had metrics for Node.js HTTP server and client operations, garbage collection, socket ops, filesystem ops, and CPU executions. We’ve now got a slew more. First we have system calls. Whether a program is writing to disk, using the network, or talking to other processes on the same system, it’s making system calls.
JavaScript Lint on SmartOS
Photo by Sam Fraser-Smith Back at Fishworks, we used a tool called JavaScript Lint (JSL) for static code analysis. You may know that lint was originally written to identify potential semantic problems in C code, like use of uninitialized variables or blocks of unreachable code. Lint warnings are usually static checks that could reasonably have been compiler warnings, but for whatever reasons those checks didn’t make it into the compiler.
Distributed Web Architures @ SF Node.js Meetup
At the Node Meetup here at Joyent’s offices a few weeks ago I gave a brief talk about Cloud Analytics as a distributed web architecture. Matt Ranney of Voxer and Curtis Chambers of Uber also spoke about their companies’ web architectures. Thanks to Jackson for putting the videos together. All around it was a great event!
OSCON Slides
I’ve posted the slides from Brendan’s and my talk yesterday at OSCON Data called “Design and Implementation of a Real-Time Cloud Analytics Platform.” Thanks to everyone who attended. We’d love to get your feedback! And thanks to @adrianco for this photo of our demo.
Heatmap coloring
Brendan has written a great 5-part series on filesystem latency. Towards the end of part 5, he alluded to a lesser-known feature of Cloud Analytics heatmaps, which is the ability to tweak the way heatmaps are colored. In this post I’ll explain heatmap coloring more fully. Keep in mind that this is a pretty advanced topic. That’s not to say that it’s hard to understand, but rather it’s somewhat complicated and you rarely actually need to tweak this.
Heatmaps and more heatmaps
I’ve talked with a few Cloud Analytics users who weren’t sure how to interpret the latency heatmaps. While we certainly aren’t the first to use heatmaps in performance analysis, they’re not yet common and they can use some explanation the first time you see them. I posted a simple example a few weeks ago, but video’s worth a thousand words so I’ve made a screencast covering a different example and explaining the heatmap basics:
Presenting at OSCON Data 2011
Brendan’s and my proposal for OSCON Data 2011 has been accepted! The talk is called “Design and Implementation of a Real-Time Cloud Analytics Platform”. We’ll dive into the implementation of Cloud Analytics with particular focus on the design considerations that allow it to work in real-time on a large production cloud. We’ll also cover some real-world case studies of people using CA to understand application behavior. If you’ve used no.de’s Analytics to look at your application, we’d love to hear from you!
Example: HTTP request latency and garbage collection
A few weeks ago I posted about Cloud Analytics on no.de. I described the various metrics we included in the initial launch and how they’re relevant for Node applications. Today I’ve got a small example using HTTP request latency and garbage collection. I wrote a small HTTP server in Node.js that simply echoes all requests, but in doing so deliberately creates a bunch of garbage. That is, it allocates a bunch of extra memory while processing each request but doesn’t save it anywhere.
Welcome to Cloud Analytics
We’ve been talking for several weeks now about our work on Cloud Analytics. Today, we’re showing the world what we’ve put together. Now available on Joyent’s Node.js Service: a first look at Cloud Analytics in action – on your very own Node.js application. Cloud Analytics (CA for short) is a tool for real-time performance analysis of production systems and applications deployed in the cloud. If you’ve got a Node SmartMachine already, you can log into your account and start using CA right now.
Tonight at 6: Solving Big Problems (with Cloud Analytics)
Here’s just a quick note to say that Joyent is hosting a talk tonight called “Solving Big Problems with Real-time Visibility into Cloud Performance.” Bryan and Brendan will be showing off our Cloud Analytics tools and discussing how we’ve used them already to analyze actual customer performance problems in real time. If you can’t make the live talk, check out the live streamcast.
Joining Joyent
It’s been almost two months since I announced my departure from Oracle, but I didn’t say what I was doing next. After taking several weeks off to decompress, I’ve joined Joyent to work with Bryan and Brendan on tackling the problem of observability in the cloud. We’re working on a project we call Cloud Analytics, which will provide an interface for dynamically instrumenting systems in the cloud and visualizing the results immediately for both real-time performance analysis and long term capacity planning.
Leaving Oracle
I first came to Sun over 4 years ago for an internship in the Solaris kernel group. I was excited to work with such a highly regarded group of talented engineers, and my experience that summer was no disappointment: I learned a lot and had a blast. After college, I joined Sun’s Fishworks team. Despite my previous experience at Sun, I didn’t really know what to expect here, and I certainly didn’t imagine then where I’d be now and how much I’d value the experiences of the last three years.
SS7000 Software Updates
In this entry I’ll explain some of the underlying principles around software upgrade for the 7000 series. Keep in mind that nearly all of this information are implementation details of the system and thus subject to change. Entire system image One of the fundamental design principles about SS7000 software updates is that all releases update the entire system no matter how small the underlying software change is. Releases never update individual components of the system separately.
Replication for disaster recovery
When designing a disaster recovery solution using remote replication, two important parameters are the recovery time objective (RTO) and the recovery point objective (RPO). For these purposes, a disaster is any event resulting in permanent data loss at the primary site which requires restoring service using data recovered from the disaster recovery (DR) site. The RTO is how soon service must be restored after a disaster. Designing a DR solution to meet a specified RTO is complex but essentially boils down to ensuring that the recovery plan can be executed within the allotted time.
Another detour: short-circuiting cat(1)
What do you think happens when you do this: # cat vmcore.4 > /dev/null If you’ve used Unix systems before, you might expect this to read vmcore.4 into memory and do nothing with it, since cat(1) reads a file, and “> /dev/null” sends it to the null driver, which accepts data and does nothing. This appears pointless, but can actually be useful to bring a file into memory, for example, or to evict other files from memory (if this file is larger than total cache size).
A ZFS Home Server
This entry will be a departure from my recent focus on the 7000 series to explain how I replaced my recently-dead Sun Ultra 20 with a home-built ZFS workstation/NAS server. I hope others considering building a home server can benefit from this experience. Of course, if you’re considering building a home server, it pays to think through your goals, constraints, and priorities, and then develop a plan. Goals: I use my desktop as a general purpose home computer, as a development workstation for working from home and on other hobby projects, and as a NAS file server.
Replication in 2010.Q1
This post is long overdue since 2010.Q1 came out over a month ago now, but it’s better late than never. The bullet-point feature list for 2010.Q1 typically includes something like “improved remote replication”, but what do we mean by that? The summary is vague because, well, it’s hard to summarize what we did concisely. Let’s break it down: Improved stability. We’ve rewritten the replication management subsystem. Informed by the downfalls of its predecessor, the new design avoids large classes of problems that were customer pain points in older releases.
Remote Replication Introduction
When we first announced the SS7000 series, we made available a simulator (a virtual machine image) so people could easily try out the new software. At a keynote session that evening, Bryan and Mike challenged audience members to be the first to set up remote replication between two simulators. They didn’t realize how quickly someone would take them up on that. Having worked on this feature, it was very satisfying to see it all come together in a new user’s easy experience setting up replication for the first time.
Threshold alerts
I’ve previously discussed alerts in the context of fault management. On the 7000, alerts are just events of interest to administrators, like “disk failure” or “backup completion” for example. Administrators can configure actions that the system takes in response to particular classes of alert. Common actions include sending email (for notification) and sending an SNMP trap (for integration into environments which use SNMP for monitoring systems across a datacenter). A simple alert configuration might be “send mail to admin@mydomain when a disk fails.
Anatomy of a DTrace USDT provider
Note: the information in this post has been updated with a complete source example. This post remains for historical reference. I’ve previously mentioned that the 7000 series HTTP/WebDAV Analytics feature relies on USDT, the mechanism by which developers can define application-specific DTrace probes to provide stable points of observation for debugging or analysis. Many projects already use USDT, including the Firefox Javascript engine, mysql, python, perl, and ruby. But writing a USDT provider is (necessarily) somewhat complicated, and documentation is sparse.
2009.Q2 Released
Today we’ve released the first major software update for the 7000 series, called 2009.Q2. This update contains a boatload of bugfixes and new features, including support for HTTPS (HTTP with SSL using self-signed certificates). This makes HTTP user support more tenable in less secure environments because credentials don’t have to be transmitted in the clear. Another updated feature that’s important for RAS is enhanced support bundles. Support bundles are tarballs containing core files, log files, and other debugging output generated by the appliance that can be sent directly to support personnel.
Compression followup
My previous post discussed compression in the 7000 series. I presented some Analytics data showing the effects of compression on a simple workload, but I observed something unexpected: the system never used more than 50% CPU doing the workloads, even when the workload was CPU-bound. This caused the CPU-intensive runs to take a fair bit longer than expected. This happened because ZFS uses at most 8 threads for processing writes through the ZIO pipeline.
Compression on the Sun Storage 7000
Built-in filesystem compression has been part of ZFS since day one, but is only now gaining some enterprise storage spotlight. Compression reduces the disk space needed to store data, not only increasing effective capacity but often improving performance as well (since fewer bytes means less I/O). Beyond that, having compression built into the filesystem (as opposed to using an external appliance between your storage and your clients to do compression, for example) simplifies the management of an already complicated storage architecture.
Fault management
The Fishworks storage appliance stands on the shoulders of giants. Many of the most exciting features – Analytics, the hybrid storage pool, and integrated fault management, for example – are built upon existing technologies in OpenSolaris (DTrace, ZFS, and FMA, respectively). The first two of these have been covered extensively elsewhere, but I’d like to discuss our integrated fault management, a central piece of our RAS (reliability/availability/serviceability) architecture. Let’s start with a concrete example: suppose hard disk #4 is acting up in your new 7000 series server.
HTTP/WebDAV Analytics
Mike calls Analytics the killer app of the 7000 series NAS appliances. Indeed, this feature enables administrators to quickly understand what’s happening on their systems in unprecedented depth. Most of the interesting Analytics data comes from DTrace providers built into Solaris. For example, the iSCSI data are gathered by the existing iSCSI provider, which allows users to drill down on iSCSI operations by client. We’ve got analogous providers for NFS and CIFS, too, which incorporate the richer information we have for those file-level protocols (including file name, user name, etc.
User support for HTTP
In building the 7000 series of NAS appliances, we strove to create a solid storage product that’s revolutionary both in features and price/performance. This process frequently entailed rethinking old problems and finding new solutions that challenge the limitations of previous ones. Bill has a great example in making CIFS (SMB) a first-class data protocol on the appliance, from our management interface down to the kernel itself. I’ll discuss here the ways in which we’ve enhanced support for HTTP/WebDAV sharing, particularly as it coexists with other protocols like NFS, CIFS, and FTP.
Back in the Sun
I loved my internship so much, I’m now back at Sun full-time. I joined last July, and I’m hoping to start blogging again about some of the interesting work I’ve been doing since then.
Don't forget about /dev/poll
When I posted my first benchmark results, I did not realize that the limit on /dev/poll’s file descriptors is a soft one. Using setrlimit(2), you can change the limit (in my case from about 250 to over 64,000). With that, I’ve run some new benchmarks. These are the same as before, but now I’m running them with /dev/poll as well. The bottom line is that, as expected, it scales very well (similarly to event ports).
Event ports and performance
So lots of people have been talking about event ports. They were designed to solve the problem with poll(2) and lots of file descriptors. The goal is to scale with the number of actual events of interest rather than the number of file descriptors one is listening on, since the former is often much less than the latter. To make this a little more concrete, consider your giant new server which maintains a thousand connections, only about a hundred of which are active at any given time.
libevent and Solaris event ports
For those who dwell in subterranean shelters, event ports (or the “event completion framework,” if you want) are the slick new way to deal with events from various sources, including file descriptors, asynchronous i/o, other user processes, etc. Adam Leventhal even thinks they’re the 20th best thing about Solaris 10. Okay, that’s not huge, but competing with zfs, dtrace, and openness, that ain’t half bad. Though the interface is, of course, well-defined, the framework is still evolving.
Dazed and confused
Today I started my summer internship on the Solaris Kernel group at Sun Microsystems. I’ll figure out something interesting to put here.
All posts