Written by dap
on April 19, 2024

Physical memory on illumos systems

I’ve long been confused by physical memory management and memory pressure in the operating system. After running up against it several times over the years without diving in, I took a recent opportunity to better understand it. I wrote this post for my own future reference with hopes that it might also be useful for somebody else. I’m sure I’ve made some errors here and I welcome any corrections. I’ve done my best to cite references here using [SI], [IL], [SPT], etc. See the end for what these refer to.

Big Four Consumers of physical memory

(Ref: SI Ch 10, p.522, p.524, SPT p.144-146, SP p.271)

On illumos systems, you can attribute most physical memory usage to one of these Big Four Consumers:

Kernel memory, including the kernel heap (kmem), thread stacks, and other chunks of memory. I use "kmem" for short, but this bucket includes other stuff.
ZFS ARC (ZFS’s read cache)
Page cache, including non-ZFS file data and ZFS file data for mmap’ed files
Swap. I’ll cover this in detail in a follow-up post. Swap includes user thread stacks, user heaps, all tmpfs data, and lots of other bits of data. This is kind of a general-purpose pool of memory.

Memory that’s unused is called (of course) free memory.^[1]

Diagram: Big Four Consumers of physical memory

Now, these are just consumers of physical memory — they’re not contiguous blocks and they don’t have fixed bounds. They grow and shrink over time. ^[2]

Utilization, saturation, memory pressure

Utilization of physical memory is tricky. You can say what percentage of memory is free or not-free. But it’s possible for memory to be 95% utilized (in that only 5% is free) and still a program can allocate another 20% of physical memory. This works because when the system gets low on free memory, it can often free some up from elsewhere. How does that work?

That’s where the idea of memory pressure comes in. We say that a system is experiencing memory pressure when there’s a demand for physical memory that exceeds what’s immediately available and the system tries to reclaim memory from existing consumers. Memory pressure thus reflects saturation of physical memory.

How does the system know when to try to reclaim memory? The kernel maintains a count of available physical memory pages called freemem. It also defines a number of thresholds that are used to determine how low on memory the system is (minfree, lotsfree, throttlefree, desfree, and others). The system periodically checks freemem against these thresholds and kicks off mechanisms described below to free up memory. (See this block comment (which postdates most of my digging for this post) and SI p.522.) Each of the Big Consumers has a different way of responding to memory pressure:^[3]

The kernel heap (kmem) supports kmem_reap, which attempts to free memory that’s not currently needed so that it can be used for future allocations. kmem_reap() is invoked from various places in the kernel where a piece of code believes memory is low. This is usually based on freemem, but can also happen because some allocation has just failed. See callers of kmem_reap() (IL).
Like kmem, the ZFS ARC provides a reap interface. This causes the ARC to reduce its target size, evict some of the data from the cache, and free up memory. Every few seconds, the ARC checks whether free memory is low enough to kick off an ARC reap. See arc_reap_cb_check() and its users (IL).
The "pageout" process frees memory by taking pages from the page cache and from swap space. Data from the page cache may be dirty, in which case it needs to be written to disk before the page can be freed. Swap space pages must always be written to the swap device. pageout kicks off when free memory is too low. See pageout() (IL).

Diagram: How Big Four Consumers respond to memory pressure

Shrinking the ARC, paging, and swapping happen in their own threads whenever the system decides to do them. They don’t usually hold up a particular allocation. kmem_reap() also usually runs in its own thread, but some users of kmem that really want memory will wait for kmem_reap() to complete so they can try their allocation again.

Remember: with the debatable exception of swap, the Big Consumers are not themselves independent resources. They all allocate from the same pool of physical pages. When you run low on physical memory, all of these consumers are affected, and typically all of them are tapped to free up memory (but in different, ad hoc ways).

Desperate measures

There’s another mechanism for freeing memory that doesn’t correspond to one of these consumers: swapping is a process where the kernel takes entire processes and threads and saves their state to the swap device. The system only does this when it’s desperate: free memory must be quite low for some time. See sched.c and sched() (IL). Do not confuse this "swapping" with "paging" (which is what pageout does, above). There’s also "hard swapping", which refers to even more desperate measures like unloading unused kernel modules.^[4]

There’s one more mechanism worth mentioning: when the system is quite low on memory, it throttles sleeping allocations, forcing them to wait until more memory is available. See page_create_throttle() (IL). (See also: SPT 144-146.) This is one way the system can become unresponsive when low on memory.

Observing memory pressure

It’s important to monitor the system’s response to memory pressure for two reasons: (1) as I said above, it’s hard to monitor memory utilization, so these responses are your primary indication that the system is suffering for lack of memory; and (2) these responses can cause the system to slow to a crawl. Paging and swapping both involve I/O and require more I/O later when the paged-out data is needed again. The reap operations are supposed to be background activities, but they have been known to cause latency bubbles in pathological cases.

vmstat 1 is the easiest way to observe evidence of memory pressure:

The "sr" column indicates pages scanned by the pageout process. A non-zero number here means pageout is running, so there’s been some memory pressure recently.
The third column ("kthr w") reflects the number of threads swapped out. If this is non-zero, things have gotten very bad some time in the past.
The "free" column indicates the amount of free memory. This is the trigger for most of the mechanisms described above, but to interpret it, you need to know the values of the various thresholds and you need to know more about the policies than I’m including here. You can find this information in the block comment I linked above or the table I referenced in SPT. It’s probably easier to use the other indicators.

There are lots of other indicators:

I/O to the swap device reflects physical memory pressure, since it means that pageout is working to free up memory. You can see this with iostat or zpool iostat.
Kmem reaping bumps a cache-specific "reap" counter. You can list all of these with kstat -c kmem_cache -p :::reap. You can trace kmem_cache_reap() to see reaps live.
The ZFS ARC provides kstats that include the current size (c) and information about evictions. You can see these with kstat -m zfs -n arcstats. You can trace this activity by tracing arc_reap_cb().
Pageout activity is reported in the per-CPU "vm" kstats, which you can see with kstat -m cpu -n vm. See above about viewing this with vmstat. You can see more specific paging information using vmstat -p 1, which will show you what kinds of pages are being paged out and in.
The specific values of freemem and the thresholds that govern when the system activates these various mechanisms are available in the kstat -m unix -n system_pages kstat. See the block comment linked above or the referenced table in SPT to understand what mechanisms are activated by what thresholds.
As mentioned above, when the system actually swaps out processes and threads, the number of threads swapped out is reported as the third column in the vmstat 1 output. You can see the kstat behind this with kstat ::sysinfo:swpque.

Consumer	Indicator of utilization	Indicator of memory pressure
Real-time	Post hoc
kmem	`memstat` (mdb dcmd)	`kmem_reap` (DTrace)	`kstat -c kmem_cache -p :::reap`
ZFS ARC	`arc_reap_cb` (DTrace)	See ARC kmem cache kstats
Page cache, swap (anonymous memory)	`vmstat`, "sr" column indicates pageout activity
Swapper ("memory scheduler")	N/A	`vmstat`, "w" column indicates threads swapped out

Consumer

Indicator of utilization

Indicator of memory pressure

Real-time

Post hoc

kmem

memstat (mdb dcmd)

kmem_reap (DTrace)

kstat -c kmem_cache -p :::reap

ZFS ARC

arc_reap_cb (DTrace)

See ARC kmem cache kstats

Page cache, swap (anonymous memory)

vmstat, "sr" column indicates pageout activity

Swapper ("memory scheduler")

N/A

vmstat, "w" column indicates threads swapped out

What can you do when you observe the system responding to memory pressure? The main thing is to try reducing the demand for memory. If you want to see who the big users of memory are (which might reflect what’s responsible for the memory pressure), the only way I know to summarize usage is with mdb -ke "::memstat".

References

The information here is synthesized from several different sources, mostly:

[SI] Solaris Internals, Second Edition. Richard McDougall and Jim Mauro. Prentice Hall, 2007.
[SPT] Solaris Performance and Tools. Richard McDougall, Jim Mauro, and Brendan Gregg Prentice Hall, 2007.
[SP] Systems Performance. Brendan Gregg. Prentice Hall, 2014.
[IL] illumos-gate source code. https://github.com/illumos/illumos-gate/tree/dca2b4edbfdf8b595d031a5a2e7eaf98738f5055. This commit (from the "master" branch) is pretty old as of the time of posting but I expect the pieces I’ve linked to haven’t changed substantively.

This post describes relatively modern versions of the illumos operating system, though I compiled this information around 2019. The big picture hasn’t changed significantly in a long time.

Where feasible I’ve put specific references in the text, but in most cases none of these sources concisely explains the whole picture.

Next stop: swap

In part two, I discuss swap and how it relates to physical memory.

Appendix: why is learning this so hard?

This section is probably not useful but I found it helpful to write down why I found learning this so hard.

I have not found a manual page or book that explains any of this stuff for a user of the system.

The terminology around physical memory is confusing. Although we didn’t talk about segments, they’re an essential part of the implementation. But "segment" can refer to:

A region of virtual address space that has been mapped to a particular resource. In user processes, that resource is usually a file or anonymous memory. Segments are created by a particular segment driver.
A segment driver is a kernel facility (sort of like a class, in the object-oriented sense) that implements a particular type of mapping. For example, seg_vn is a segment driver that implements mappings for files and anonymous memory. The segment drivers all have names that start with seg, which might make you think they’re the names of segments, but they’re not: segment (see above) refers to a chunk of the address space that’s mapped using one of these segment drivers.
X86 Segmentation, which as far as I can tell has nothing to do with any of this.

"Page" and "paging" can refer to:

The x86 architectural feature (e.g., physical pages, page tables, etc.)
Writing filesystem or swap data to disk (paging out) or reading filesystem or swap data from disk (paging in)

It’s confusing that these subsystems all draw from the same pool of physical memory and the kernel doesn’t even keep track of how big each one is. Physical memory is the resource that you can run out of. When you do, you’re running out of all of these things. But they’re neither implemented nor documented as a coherent unit: you kind of need to learn separately about pageout, the "memory scheduler", kmem reaping, ARC, and how availrmem is managed to piece the big picture together.

Even within the source, there are at least a dozen different critical global variables, few of which actually have a clear description of what they’re intended to represent, what sorts of operations might change them, etc. There are sometimes dozens or even hundreds of places that modify them directly. Even if you just want to know what their default values are, that’s non-trivial — they’re initialized and then adjusted in several scattered places.

1. "Free memory" includes the freelist, which are pages with no useful content, and the cachelist. The cachelist contains pages of filesystem data that are no longer mapped. These can be reused directly if a program needs that file data again, or they can be repurposed for something else just like the pages on the freelist. That’s why the cachelist is considered part of free memory.

2. Really, this is just a matter of accounting, and the kernel doesn’t even keep track of how large each consumer is. The ::memstat dcmd mentioned later works by iterating all pages and looking at what subsystem it’s assigned to.

3. One thing I found surprising about this is that pageout is only used for anonymous memory and filesystem pages (and even then, only for mapped pages and non-ZFS pages). None of these other memory categories is pageable in the demand paging sense. Instead, they have other ways of reducing their usage ("reap").)

4. Linux does not ever do what Unix calls "swapping" (swapping out of entire processes). (SP p.271) Confusingly, the Linux analog for what the Unix pageout process does (what Unix calls "paging") is called "swapping". It’s done by "kswapd" instead of "pageout". (SP p.279).

← → Top