Written by dap
on April 19, 2024

All about swap on illumos systems

In the previous post, I gave an overview of the Big Four Consumers of physical memory on illumos systems. I also explained why it’s nebulous to talk about utilization of physical memory, why we talk about "memory pressure" instead, and how the Big Four Consumers each respond when the system runs low on memory. In this post, I’ll talk in depth about "swap". As we’ll see, swap is closely related to physical memory.

The references in this post ("[IL]", "[SPT]", "[SP]", etc.) are the same as the previous post and described again under "References" below.

What is swap?

(Ref: SPT p. 164-169)

Swap can be confusing because the term "swap" is used to refer to a bunch of distinct (but related) things. I’ve chosen more specific terminology for this post. I’ll say that swap space is a synthetic resource that’s used to manage general-purpose memory on the system. I’ll call the data that’s stored there swap data. This includes user thread stacks, user heaps, private filesystem mappings, tmpfs, and lots more. Outside this post, "swap" might refer to either of these things, both of them, or the kernel subsystem that manages swap space, or the memory scheduler (that swaps out entire threads and processes), or "hard swapping" (the most extreme response to memory pressure), or even the Linux analog of what illumos calls paging! "Anonymous memory" is also sometimes used to mean what I’m calling "swap space" or "swap data".

By "synthetic", I mean that swap space is a made-up resource: the system defines a number that I’m calling the total swap space and it enforces a rule that allocations of memory pages (outside most of the Big Consumers I discussed in the last post) allocate from this pool of swap space. But it’s not a specific area of memory, it may exceed physical memory, and it can change over time!

How is the total swap space determined? It’s (1) the space on all swap devices, plus (2) some amount of physical memory. How much physical memory? It’s basically whatever’s left over after all the other Big Consumers (like ZFS and kmem) get their allocations. As a result, this varies over time: as these other subsystems use more memory, total swap space actually shrinks! As I’ll explain later, this definition allows the system to ensure that there are always pages of physical memory available when they’re needed for swap data — without requiring that each allocation from swap space allocate physical pages, too.

A simple, somewhat wrong model of swap space

I’d like to say that swap is a simple resource: there’s a total size (even if it varies over time), an amount utilized, and you can allocate right up until 100% utilization, after which you can’t allocate any more. That’s … kind of true.

Let’s work through an example. Suppose the system has 1024 MiB of total swap, of which 768 MiB is currently utilized. A program uses mmap to create a 128 MiB copy-on-write mapping (e.g., a writable, private mapping of /dev/zero). This kind of mapping requires allocating a corresponding amount of swap space. There’s enough swap space for this, so the allocation succeeds and so the mmap succeeds. Now 896 MiB of swap are used and 128 MiB are available. Another program uses mmap to create a 256 MiB copy-on-write mapping This mapping needs 256 MiB of swap space, but only 128 MiB are available. The allocation fails. The mmap fails. Simple enough!

Some inconvenient truths

So far, this sounds much simpler than physical memory, where allocations rarely fail outright: instead, they typically get slower and slower as the system frantically tries to find memory. It still matters when stuff gets paged out because things get slow, but that’s really a problem of physical memory saturation, not swap space.

The main problem with this model is that in practice, unless you’ve set a cap on swap space, running out of swap by definition means you’re also running out of physical memory (since swap is allowed to consume all unused physical memory). So you do still find that swap allocations get slower. Non-global zones can have their own caps on swap space, and in those cases I expect you don’t have to worry about swap space allocations getting slower as you run low on swap space.

Another complexity relates to the fact that the system allows swap space to be allocated "noreserve" — that is, without reducing available swap space. User thread stacks and MAP_NORESERVE mappings fall into this bucket. Using the terminology in this post: for noreserve mappings, the system allocates a resource for some swap data, but there’s no change to total swap nor available swap. Just like with normal reserved mappings, the physical pages are not allocated right away. If and when the physical pages are allocated later, total swap and available swap remain unaffected. I’ll discuss some consequences of this later.

Why do we have swap space, anyway?

Muddling through these details, it’s easy to forget why swap space exists in the first place and why it works the way it does. If we didn’t have swap space, but we allocated physical memory pages the way we do today, we’d have a different problem: what if we’re out of memory when it’s time to allocate pages for some piece of swap data? That can happen at pretty inconvenient points like during a push onto the stack (e.g., calling any function). There’s not a great way to communicate this failure to a program at that point. And there’s likely nothing the program could do about the problem in that context. Instead, the system forces programs to reserve swap space ahead of time, during an operation that can fail explicitly (e.g., malloc(), which can return an error). This way, the system can clearly communicate when it can’t provide the memory. And we can know that when it’s time to allocate pages for swap data, we’ll have the space.

Additionally, defining swap the way we have essentially allows administrators to cap the amount of swap data that can be created, which effectively caps the system’s memory oversubscription ratio.

An alternative would be to allocate physical memory pages as soon as swap space is allocated. This approach would mean a lot of operations today would have to do a lot more work (e.g., fork() would have to copy all the parent process’s pages into the child process) and use a lot more memory (because you’d have multiple copies of the same pages).

Swap space gives us the best of all worlds. There remains the question about those NORESERVE mappings I mentioned earlier. When such a mapping is created, neither total nor available swap space changes. Is this a problem? In an extreme example, imagine program A has a regular, reserved mapping (e.g., a writable, private filesystem mapping) that it hasn’t touched yet. Then program B creates and uses NORESERVE mappings that allocate all available physical memory. Now A tries to write to its mapping. Might there be no memory available to fill it? At one point I thought this would be a problem. Keith Wesolowski pointed out that for B to have pages allocated for its NORESERVE mappings, they must first have been reserved (more on this below). That means there must have been swap space available even accounting for A’s reservation. So there should still be memory that can be used for A. Given that, what’s the purpose of MAP_NORESERVE? I’m honestly not sure. It seems like a sort of second-class consumer of physical memory where it gets allocated from swap when possible, but for whatever reason the consumer is able to deal with the case where it’s not possible. (That’s not to say that it’s dealt with all that gracefully: it could be that this was a user thread stack and now the process gets a fatal signal when it tries to use it.)

To summarize: swap space is like memory, but it’s not memory. It’s a synthetic resource that exists to help manage memory in the system. Whether any particular hunk of memory is considered swap data, and if so, whether it reserves swap space up front is essentially a policy choice that’s up to the programmer.

Swap devices

When people hear about swap, they often think about memory being paged out to disk, and something about being able to use disk space to pretend like you have more physical memory than you really do. But in our discussion of swap space, I’ve barely mentioned disks or swap devices! These aren’t relevant for thinking about swap space as a resource or understanding its utilization or saturation. Everything I said so far holds even without a swap device. Swap devices are relevant for understanding how the system decides how much total swap space there is and how the system responds to memory pressure.

I said earlier that total space available for swap includes some amount of physical memory plus any swap devices (disks) configured. I also said the amount of physical memory is essentially what’s left over after other Big Consumers allocate their memory, and that this changes over time. Suppose you have a system with 16 GiB of physical memory and no swap device. At boot, nearly all of memory is free, and most of it might comprise total swap space. As the system does work, the kernel heap and the ARC grow, and total swap space goes down (not just available swap space!). As programs reserve pages of swap space, the amount of available swap goes down, too. It doesn’t matter when those pages are actually allocated, because the whole point is to avoid a case where we don’t have a single page of physical memory to back swap data when we need it.

That said, if programs don’t wind up allocating pages for all that memory (e.g., if they create a private file mapping but don’t actually write to every single page of the mapping), then the corresponding number of physical pages will remain unused. That’s inefficient, though maybe not the end of the world if you prefer predictable performance. But if every page of swap space does get allocated, you have a different problem: the system won’t be able to shrink the amount of memory used for swap if it needs that memory for something else, like the kernel heap.

Now we finally get to swap devices: when free memory is low on the system, the pageout process finds allocated pages of swap data, writes their contents to the swap device, and then frees the page. In the example where the kernel needs more heap space on a system that’s running out of memory, free memory drops low enough that pageout starts writing allocated pages of swap data to the swap device, freeing that memory, and those pages might be used the next time the kernel heap needs memory. In this way, total swap space can shrink just like other memory consumers in response to memory pressure. But it can only do this if you have a place to put that memory — a swap device — and it comes at some performance cost.

You might be tempted to run without a swap device, figuring that if the system is paging out swap data, things are already bad, and you want to avoid getting into that situation in the first place. You can do this, but the cost is the inefficient use of physical memory I mentioned above, plus a reduced ability to shrink total swap in response to other demands on memory. In practice, these are kind of big problems, particularly since /tmp, /var/run, and some other paths are usually backed by swap, which makes it easy to accidentally use a lot of it.

Note that having a swap device doesn’t guarantee that all swap data will have a place to go if it needs to be paged out — in fact, since space on swap devices is added to total swap space, and some total memory is also part of total swap space, total swap space is always bigger than the total space available on swap devices. During pageout, if we run out of space on swap devices, the remaining swap data has to remain resident in physical memory. At this point, that memory is effectively pinned down — and that may well not be the memory you want to be resident.

Swap space accounting

Another confusing thing about swap is that the system (documentation and code) uses similar terminology with different meanings.^[1] Swap space starts off available, becomes reserved when a resource is created that might need physical memory later (e.g., a copy-on-write mapping), becomes allocated when the physical pages are actually assigned to it, and becomes used when those pages are written to the swap device and reclaimed for something else. (SPT p.165-169)

We largely ignore this decomposition because from a swap consumer’s perspective, it doesn’t matter when pages are allocated or when they’re written to disk: they’re unavailable to other allocations as soon as the swap space is reserved. But these states are useful when you’re looking at swap utilization.

Observing swap utilization and saturation

Let’s define swap utilization as the amount of swap space that’s been allocated. As a user of the system, you want to monitor available swap space (i.e., total swap space minus the utilization) because when the system runs out, programs won’t be able to allocate more memory. You can observe the utilization of swap with vmstat or swap -s (though you’ll need more information described below to interpret swap -s output). You can view saturation via various kstats, but saturation generally means a swap reservation failed, and some program will usually report an explicit error when this happens.

Once again, vmstat 1 shows the amount of available swap, which appears to be the amount of unreserved swap. swap -s shows how much is allocated, reserved-but-not-allocated, and what’s available.

When a swap reservation fails, the anon_alloc_fail kstat is bumped. (You might think from the name of this kstat that this reflects when memory failed to be allocated to back swap data, but as we said, that usually doesn’t happen because the memory has already been reserved. "Alloc" in this kstat is an allocation from the perspective of the "total swap space" resource, where this is what swap calls reservation.) You can see this with kstat -m memory_cap. If this is ever non-zero, you know a swap reservation failed in the past.

I/O to the swap device reflects saturation of physical memory, not swap space, since it means that pageout is working to free up memory.

Observing swap space utilization and saturation

To observe utilization of swap space, use swap -sh. You’ll want to refer to the diagram above under Swap space accounting to understand the breakdown.

To observe saturation of swap space, use kstat -p :::anon_alloc_fail (to look for swap allocation failures) or look for I/O to the swap device.

Play with it yourself

I wrote a little REPL called swappy that lets you create regions of swap memory, use them, view the swap and physical memory accounting, etc. The README works through some examples.

References

As with the previous post, the information here is synthesized from several different sources, mostly:

[SI] Solaris Internals, Second Edition. Richard McDougall and Jim Mauro. Prentice Hall, 2007.
[SPT] Solaris Performance and Tools. Richard McDougall, Jim Mauro, and Brendan Gregg Prentice Hall, 2007.
[SP] Systems Performance. Brendan Gregg. Prentice Hall, 2014.
[IL] illumos-gate source code. https://github.com/illumos/illumos-gate/tree/dca2b4edbfdf8b595d031a5a2e7eaf98738f5055. This commit (from the "master" branch) is pretty old as of the time of posting but I expect the pieces I’ve linked to haven’t changed substantively.

This post describes relatively modern versions of the illumos operating system, though I compiled this information around 2019. The big picture hasn’t changed significantly in a long time.

Where feasible I’ve put specific references in the text, but in most cases none of these sources concisely explains the whole picture.

Appendix: why is learning this so hard?

As with physical memory, I have not found a manual page or book that explains most of this for a user of the system.

The terminology around "swap" is incredibly confusing. The term "swap" can mean:

"Swap [data]": a type of resource that includes anonymous memory (which includes the stack, heap, and MAP_PRIVATE filesystem mappings), all use of tmpfs, space used to swap out threads and processes, and in the past also included crash dumps. All of these uses require allocating from the pool of available swap. That pool is made up of any physical swap devices plus some amount of physical memory made available for this purpose.
"Swap device": a disk device used to store swap data. The system only writes data to the swap device when low on memory. This process is called paging…except on Linux:
Linux uses "swapping" to refer to the process of writing pages of memory (swap data) to the swap device in order to free up that memory. Unix historically called this paging, not swapping. (What Unix calls swapping, Linux doesn’t do it all.)
"Swap [a process or thread]": to write process or thread data (not the anonymous memory, but critical information about the process and thread itself) to the swap device in order to free up memory for other uses. The system does this only under pretty desperate low-memory conditions.

More terminology confusion: in the context of swap, "allocation" usually means allocating pages from the pool that was previously reserved. This usually can’t fail. However, the kstat that gets bumped when you fail to reserve swap (not allocate it) is called anon_alloc_fail.

1. This isn’t the system’s fault. It’s decomposing swap space into disjoint states. However, the terminology we’re using is standard for resources, virtual or otherwise. It’s just an unfortunate place where worlds collide.

← → Top