Memory
Table of Contents
Kernel memory handling1
Kernel memory allocation
- Buddy system
The kernel uses a buddy system with power-of-two sizes. For order 0, 1, 2, …, 9 it has lists of areas containing 2order pages. If a small area is needed and only a larger area is available, the larger area is split into two halves (buddies), possibly repeatedly.
- getfreepage
The routine
__get_free_page()
will give us a page. The routine__get_free_pages()
will give a number of consecutive pages. (A power of two, from 1 to 512 or so. The above buddy system is used.) - kmalloc
The routine kmalloc() is good for an area of unknown, arbitrary, smallish length, in the range 32-131072 (more precisely: 1/128 of a page up to 32 pages), preferably below 4096. For the sizes, see
<linux/kmalloc_sizes.h>
. Because of fragmentation, it will be difficult to get large consecutive areas from kmalloc(). These days kmalloc() returns memory from one of a series of slab caches (see below) with names like "size-32", …, "size-131072". - Priority
there is a bit specifying whether we would like a hot or a cold page (that is, a page likely to be in the CPU cache, or a page not likely to be there). If the page will be used by the CPU, a hot page will be faster. If the page will be used for device DMA the CPU cache would be invalidated anyway, and a cold page does not waste precious cache contents.
//linux/gfp.h /* Zone modifiers in GFP_ZONEMASK (see linux/mmzone.h - low four bits) */ #define __GFP_DMA 0x01 /* Action modifiers - doesn't change the zoning */ #define __GFP_WAIT 0x10 /* Can wait and reschedule? */ #define __GFP_HIGH 0x20 /* Should access emergency pools? */ #define __GFP_IO 0x40 /* Can start low memory physical IO? */ #define __GFP_FS 0x100 /* Can call down to low-level FS? */ #define __GFP_COLD 0x200 /* Cache-cold page required */ #define GFP_NOIO (__GFP_WAIT) #define GFP_NOFS (__GFP_WAIT | __GFP_IO ) #define GFP_ATOMIC (__GFP_HIGH) #define GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS) #define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS)
Uses:
GFP_KERNEL
is the default flag. Sleeping is allowed.GFP_ATOMIC
is used in interrupt handlers. Never sleeps.GFP_USER
for user mode allocations. Low priority, and may sleep. (Today equal to GFPKERNEL.)GFP_NOIO
must not call down to drivers (since it is used from drivers).GFP_NOFS
must not call down to filesystems (since it is used from filesystems – see, e.g., dcache.c:shrink_dcache_memory
and inode.c:shrink_icache_memory
).
- vmalloc
The routine vmalloc() has a similar purpose, but has a better chance of being able to return larger consecutive areas, and is more expensive. It uses page table manipulation to create an area of memory that is consecutive in virtual memory, but not necessarily in physical memory. Device I/O to such an area is a bad idea. It uses the above calls with
GFP_KERNEL
to get its memory, so cannot be used in interrupt context.
Turning off overcommit
Since 2.5.30 the values are: 0 (default): as before: guess about how much overcommitment is reasonable, 1: never refuse any malloc(), 2: be precise about the overcommit - never commit a virtual address space larger than swap space plus a fraction overcommitratio of the physical memory. Here /proc/sys/vm/overcommitratio (by default 50) is another user-settable parameter. It is possible to set overcommitratio to values larger than 100.
Stack overflow
A simple demo that catches SIGSEGV:
#include <stdio.h> #include <stdlib.h> #include <signal.h> void segfault(int dummy) { printf("Help!\n"); exit(1); } int main() { int *p = 0; signal(SIGSEGV, segfault); *p = 17; return 0; }
Without the exit() here, this demo will loop because the illegal assignment is restarted. This simple demo fails to catch stack overflow, because there is no stack space for a call frame for the segfault() interrupt handler. If it is desired to catch stack overflow one first must set up an alternative stack. As follows:
... int main() { char myaltstack[SIGSTKSZ]; struct sigaction act; stack_t ss; ss.ss_sp = myaltstack; ss.ss_size = sizeof(myaltstack); ss.ss_flags = 0; if (sigaltstack(&ss, NULL)) errexit("sigaltstack failed"); act.sa_handler = segfault; act.sa_flags = SA_ONSTACK; if (sigaction(SIGSEGV, &act, NULL)) errexit("sigaction failed"); ...
The Linux Page Cache and pdflush:2
writes data out
Linux usually writes data out of the page cache using a process called
pdflush. At any moment, between 2 and 8 pdflush threads are running on
the system. You can monitor how many are active by looking at
/proc/sys/vm/nr_pdflush_threads
. Whenever all existing pdflush threads
are busy for at least one second, an additional pdflush daemon is
spawned. The new ones try to write back data to device queues that are
not congested, aiming to have each device that's active get its own
thread flushing data to that device. Each time a second has passed
without any pdflush activity, one of the threads is removed. There are
tunables for adjusting the minimum and maximum number of pdflush
processes, but it's very rare they need to be adjusted.
pdflush tunables
Exactly what each pdflush thread does is controlled by a series of parameters in /proc/sys/vm:
/proc/sys/vm/dirty_writeback_centisecs
(default 500): In hundredths of a second, this is how often pdflush wakes up to write data to disk. The default wakes up the two (or more) active threads every five seconds.
Because of all this, it's unlikely you'll gain much benefit from lowering the writeback time; the thread spawning code assures that they will automatically run themselves as often as is practical to try and meet the other requirements.
/proc/sys/vm/dirty_expire_centiseconds
The first thing pdflush works on is writing pages that have been dirty for longer than it deems acceptable.
(default 3000): In hundredths of a second, how long data can be in the page cache before it's considered expired and must be written at the next opportunity. Note that this default is very long: a full 30 seconds. That means that under normal circumstances, unless you write enough to trigger the other pdflush method, Linux won't actually commit anything you write until 30 seconds later.
/proc/sys/vm/dirty_background_ratio
(default 10): Maximum percentage of active that can be filled with dirty pages before pdflush begins to write them
Note that some kernel versions may internally put a lower bound on this value at 5%.
Most of the documentation you'll find about this parameter suggests it's in terms of total memory, but a look at the source code shows this isn't true. In terms of the meminfo output, the code actually looks at
MemFree + Cached - Mapped
-- Summary: when does pdflush write?
In the default configuration, then, data written to disk will sit in memory until either a) they're more than 30 seconds old, or b) the dirty pages have consumed more than 10% of the active, working memory. If you are writing heavily, once you reach the dirtybackgroundratio driven figure worth of dirty memory, you may find that all your writes are driven by that limit.
/proc/sys/vm/dirty_ratio
/proc/sys/vm/dirtyratio (default 40): Maximum percentage of total memory that can be filled with dirty pages before processes are forced to write dirty buffers themselves during their time slice instead of being allowed to do more writes.
Reference
- Understanding the Linux Virtual Memory Manager By Mel Gorman