next up previous contents index
Next: 8.2 Slabs Up: 8. Slab Allocator Previous: 8. Slab Allocator   Contents   Index

Subsections


8.1 Caches

One cache exists for each type of object that is to be cached. For a full list of caches available on a running system, run cat /proc/slabinfo . This file gives some basic information on the caches. A excerpt from the output of this file looks like

slabinfo - version: 1.1 (SMP)
kmem_cache            80     80    248    5    5    1 :  252  126
urb_priv               0      0     64    0    0    1 :  252  126
tcp_bind_bucket       15    226     32    2    2    1 :  252  126
inode_cache         5714   5992    512  856  856    1 :  124   62
dentry_cache        5160   5160    128  172  172    1 :  252  126
mm_struct            240    240    160   10   10    1 :  252  126
vm_area_struct      3911   4480     96  112  112    1 :  252  126
size-64(DMA)           0      0     64    0    0    1 :  252  126
size-64              432   1357     64   23   23    1 :  252  126
size-32(DMA)          17    113     32    1    1    1 :  252  126
size-32              850   2712     32   24   24    1 :  252  126

As is obvious, the fields do not have a header to indicate what each column means. Each of them correspond to a field in the struct kmem_cache_s structure. The fields listed here are

cache-name A human readable name such as ``vm_area_struct''
num-active-objs Number of objects that are in use
total-objs How many are available in total including unused
obj-size The size of each object, typically quite small
num-active-slabs Number of slabs containing objects that are active
total-slabs How many slabs in total exist
num-pages-per-slab The pages required to create one slab, typically 1

If SMP is enabled like in the example excerpt, two more fields will be displayed after a colon. This refer to the per CPU cache described in the last section. The fields are

limit is he number of free objects the pool can have before half of it is given to the global free pool

batchcount The number of objects alloc-ed for the processor when no objects are free

To speed allocation and freeing of objects and slabs they are arranged into three lists; slabs_full, slabs_partial and slabs_free. slabs_full has all its objects in use. slabs_partial has free objects in it and so is a prime candidate for allocation of objects. slabs_free has no allocated objects and so is a prime candidate for slab destruction.

8.1.1 Cache Descriptor

All information describing a cache is stored in a struct kmem_cache_s declared in mm/slab.c. This is an extremely large struct and so will be described in parts.

190 struct kmem_cache_s {
193         struct list_head        slabs_full;
194         struct list_head        slabs_partial;
195         struct list_head        slabs_free;
196         unsigned int            objsize;
197         unsigned int            flags;
198         unsigned int            num;
199         spinlock_t              spinlock;
200 #ifdef CONFIG_SMP
201         unsigned int            batchcount;
202 #endif
203

Most of these fields are of interest when allocating or freeing objects.

slabs_* are the three lists the slabs are kept on described above

objsize is the size of each object

flags determine how parts of the allocator will behave when dealing with the cache. See Section 8.1.2

num The number of objects contained in each slab

206         unsigned int            gfporder;
209         unsigned int            gfpflags;
210 
211         size_t                  colour;
212         unsigned int            colour_off;
213         unsigned int            colour_next;
214         kmem_cache_t            *slabp_cache;
215         unsigned int            growing;
216         unsigned int            dflags;
217 
219         void (*ctor)(void *, kmem_cache_t *, unsigned long);
222         void (*dtor)(void *, kmem_cache_t *, unsigned long);
223 
224         unsigned long           failures;
225

This block deals with fields of interest when allocating or freeing slabs from the cache.

gfporder is the size of the slab in pages. The slab is 2$^{gfporder}$ pages big as required by the buddy allocator

gfpflags is the GFP flags to use when allocating memory. See Section 6.5

colour is the number of different cache lines that can be used. This will be further discussed in Section [*]

colour_off is the byte alignment to keep slabs at. For example, slabs for the size-X caches are aligned on the L1 cache

colour_next is the next colour line to use. This value wraps back to 0 when it reaches colour

growing is set to indicate if the cache is growing or not. If it is, it is much less likely this cache will be selected to reap free slabs under memory pressure

dflags are dynamic flags. See Section 8.1.3

ctor is for a complex object which provides a constructor function to be called on each new object. This is a pointer to that function. May be NULL

dtor is the object destructor. May be NULL

failures does not appear to be used anywhere

227         char                    name[CACHE_NAMELEN];
228         struct list_head        next;

These are set during cache creation

name is the human readable name of the cache
next is the next cache on the cache chain

229 #ifdef CONFIG_SMP
231         cpucache_t              *cpudata[NR_CPUS];
232 #endif

cpudata is the per-cpu data. Discussed further in Section 8.5

233 #if STATS
234         unsigned long           num_active;
235         unsigned long           num_allocations;
236         unsigned long           high_mark;
237         unsigned long           grown;
238         unsigned long           reaped;
239         unsigned long           errors;
240 #ifdef CONFIG_SMP
241         atomic_t                allochit;
242         atomic_t                allocmiss;
243         atomic_t                freehit;
244         atomic_t                freemiss;
245 #endif
246 #endif
247 };

These figures are only available if the CONFIG_SLAB_DEBUG option is set during compile time. They are all beancounters and not of general interest.

num_active is the number of active objects in the cache

num_allocations is the total number of objects that have been allocated on this cache

high_mark is the highest value num_active has been to date

grown is the number of times kmem_cache_grow() has been called

reaped refers to the number of times this cache has been reaped

errors is never used

allochit is the number of times an allocation has used the per-cpu cache

allocmiss is the number of times an allocation has missed the per-cpu cache

freehit is the number of times a free was placed on a per-cpu cache

freemiss is the number of times an object was freed and placed on the global pool


8.1.2 Cache Static Flags

A number of flags are set at cache creation time that remain the same for the lifetime of the cache. They affect how the slab is structured and how objects are stored within it. All the flags are stored in a bitmask in the flags field of the cache descriptor. The full list of possible flags that may be used are declared in include/linux/slab.h.

There is three principle sets. The first set are internal flags which are set only by the slab allocator and are listed in Table 8.2. At time of writing, the only relevant flag is the CFGS_OFF_SLAB flag which determines where the slab descriptor is stored.


Table 8.2: Internal cache static flags
\begin{table}
% latex2html id marker 3712
\begin{center}
\begin{tabularx}{13.5...
...r set and never used \\
\par\hline
\end{tabularx}
\end{center} \end{table}


The second set are set by the cache creator and they determine how the allocator treats the slab and how objects are stored. They are listed in Table 8.3.


Table 8.3: Cache static flags set by caller
\begin{table}\begin{center}
\begin{tabularx}{13.5cm}{\vert l\vert X\vert}
\hl...
...se memory from ZONE\_DMA \\
\hline
\end{tabularx}
\end{center} \end{table}


The last flags are only available if the compile option CONFIG_SLAB_DEBUG is set. They determine what additional checks will be made to slabs and objects and are primarily of interest only when new caches are being developed.


Table 8.4: Cache static debug flags
\begin{table}\begin{center}
\begin{tabularx}{13.5cm}{\vert l\vert X\vert}
\hl...
...ocated
or initialised \\
\hline
\end{tabularx}
\end{center} \end{table}


To prevent callers using the wrong flags a CREATE_MASK is defined in mm/slab.c consisting of all the allowable flags. When a cache is being created, the requested flags are compared against the CREATE_MASK and reported as a bug if invalid flags are used.


8.1.3 Cache Dynamic Flags

The dflags field has only one flag DFLGS_GROWN but it is important. The flag is set during kmem_cache_grow() so that kmem_cache_reap() will be unlikely to choose the cache for reaping. When the function does find a cache with this flag set, it skips the cache and removes the flag.


8.1.4 Cache Allocation Flags

The flags correspond to the GFP page flag options for allocating pages for slabs. Callers sometimes call with either SLAB_ or GFP_ flags, but they really should use only SLAB_* flags. They correspond directly to the flags described in section 6.5 so will not be discussed in detail here. It is presumed the existence of these flags are for clarity and in case the slab allocator needed to behave differently in response to a particular flag but in reality, it doesn't.


Table 8.5: Cache Allocation Flags
\begin{table}\begin{center}
\begin{tabularx}{13.5cm}{\vert l\vert l\vert}
\hl...
...uivalent to GFP\_DMA \\
\par\hline
\end{tabularx}
\end{center} \end{table}



8.1.5 Cache Colouring

To utilize hardware cache better, the slab allocator will offset objects in different slabs by different amounts depending on the amount of space left over in the slab. The offset is in units of BYTES_PER_WORD unless SLAB_HWCACHE_ALIGN is set in which case it is aligned to blocks of L1_CACHE_BYTES for alignment to the L1 hardware cache.

During cache creation, it is calculated how many objects can fit on a slab (See Section 8.2.7) and what the bytes wasted is. Based on that, two figures are calculated for the cache descriptor

colour is the number of different offsets that can be used
colour_off is the amount to offset each objects by

With the objects offset, they will use different lines on the associative hardware cache. Therefore, objects from slabs are less likely to overwrite each other in memory.

The result of this is easiest explained with example. Let us say that s_mem (the address of the first object) on the slab is 0 for convenience, that 100 bytes are wasted on the slab and alignment is to be at 32 bytes to the L1 Hardware Cache on a Pentium 2.

In this scenario, the first slab created will have its objects start at 0. The second will start at 32, the third at 64, the fourth at 96 and the fifth will start back at 0. With this, objects from each of the slabs will not hit the same hardware cache line on the CPU. The value of colour is 3 and colour_off is 32.


8.1.6 Cache Creation

The function kmem_cache_create() is responsible for creating new caches and adding them to the cache chain. The tasks that are taken to create a cache are

Figure 8.2 shows the call graph relevant to the creation of a cache and is more fully described in the code commentary.



\includegraphics[width=17cm]{graphs/kmem_cache_create.ps}
[1]
Figure: Call Graph: kmem_cache_create
[2]



8.1.7 Cache Reaping

When a slab becomes free, it is placed on the slabs_free list for future use. Caches do not automatically shrink themselves so when kswapd notices that memory is tight, it calls kmem_cache_reap() to free some memory. This function is responsible for selecting a cache that will be required to shrink its memory usage. It is worth noting is that cache reaping does account what node the memory is under pressure. This means that with a NUMA or high memory machine, it is possible the kernel will spend a lot of time freeing memory from regions that are under no memory pressure but this is not a problem for architectures like the x86 which has only one bank of memory. As the vast majority of the cache memory will be using ZONE_NORMAL, the zone consideration is a serious problem.



\includegraphics[]{graphs/kmem_cache_reap.ps}
[1]
Figure: Call Graph: kmem_cache_reap
[2]


The call graph in Figure 8.3 is deceptively simple as the task of selecting the proper cache to reap is quite long. In case there is many caches in the system, only REAP_SCANLEN8.1 caches are examined in each call. The last cache to be scanned is stored in the variable clock_searchp so as not to examine the same caches over and over again. For each scanned cache, the reaper does the following


8.1.8 Cache Shrinking

When a cache is selected to shrink itself, the steps it takes are simple and brutal

Linux is nothing, if not subtle.



\includegraphics[width=15cm]{graphs/kmem_cache_shrink.ps}
Figure: Call Graph: kmem_cache_shrink


Two varieties of shrink functions exist with confusingly similar names. kmem_cache_shrink() removes all slabs from slabs_free and returns the number of pages freed as a result. This is the principle function exported for use by the slab allocator users.

The second function __kmem_cache_shrink() frees all slabs from slabs_free and then verifies that slabs_partial and slabs_full are empty. This is for internal use only and is important during cache destruction when it doesn't matter how many pages are freed, just that the cache is empty.


8.1.9 Cache Destroying

When a module is unloaded, it is responsible for destroying any cache with the function kmem_cache_destroy(). It is important the cache is properly destroyed as two caches of the same name are not allowed to exist. Core kernel code often does not bother to destroy its caches as their existence persists for the life of the system. The steps taken to destroy a cache are



\includegraphics[]{graphs/kmem_cache_destroy.ps}
[1]
Figure: Call Graph: kmem_cache_destroy
[2]




Footnotes

...REAP_SCANLEN8.1
Defined statically as 10

next up previous contents index
Next: 8.2 Slabs Up: 8. Slab Allocator Previous: 8. Slab Allocator   Contents   Index
Mel 2003-01-14