next up previous contents index
Next: 9.4 Locking a Memory Up: 9. Process Address Space Previous: 9.2 Process Address Space   Contents   Index

Subsections

9.3 Memory Regions

The full address space of a process is rarely used, only sparse regions are. Each region is represented by a vm_area_struct which never overlap and represent a set of addresses with the same protection and purpose. Examples of a region include a read-only shared library loaded into the address space or the process heap. A full list of mapped regions a process has may be vied via the proc interface at /proc/pid_number/maps.

The region is represented by a number of different structures illustrated in Figure 9.1. At the top, there is the vm_area_struct which on its own is enough to represent anonymous memory.


Table 9.3: Memory Region VMA API
\begin{table}\begin{center}
\begin{tabularx}{13.5cm}{\vert l\vert X\vert}
\hl...
...linear address space \\
\par\hline
\end{tabularx}
\end{center} \end{table}


If a file is memory mapped, the struct file is available through the vm_file field which has a pointer to the struct inode. The inode is used to get the struct address space which has all the private information about the file including a set of pointers to filesystem functions which perform the filesystem specific operations such as reading and writing pages to disk.

 44 struct vm_area_struct {
 45         struct mm_struct * vm_mm;
 46         unsigned long vm_start;
 47         unsigned long vm_end;
 49 
 50         /* linked list of VM areas per task, sorted by address */
 51         struct vm_area_struct *vm_next;
 52 
 53         pgprot_t vm_page_prot;
 54         unsigned long vm_flags;
 55 
 56         rb_node_t vm_rb;
 57 
 63         struct vm_area_struct *vm_next_share;
 64         struct vm_area_struct **vm_pprev_share;
 65 
 66         /* Function pointers to deal with this struct. */
 67         struct vm_operations_struct * vm_ops;
 68 
 69         /* Information about our backing store: */
 70         unsigned long vm_pgoff;
 72         struct file * vm_file;
 73         unsigned long vm_raend;
 74         void * vm_private_data;
 75 };

vm_mm The mm_struct this VMA belongs to

vm_start The starting address

vm_end The end address

vm_next All the VMAs in an address space are linked together in an address ordered singly linked list with this field

vm_page_prot The protection flags for all pages in this VMA which are all defined in include/linux/mm.h. See Table 9.2 for a full description

vm_rb As well as been in a linked list, all the VMAs are stored on a red-black tree for fast lookups. This is important for page fault handling when finding the correct region quickly is important, especially for a large number of mapped regions

vm_next_share Shared VMA regions based on file mappings (such as shared libraries) linked together with this field

vm_pprev_share The complement to vm_next_share

vm_ops The vm_ops field contains functions pointers for open,close and nopage. These are needed for syncing with information from the disk

vm_pgoff This is the page aligned offset within a file that is mmap'ed

vm_file The struct file pointer to the file been mapped

vm_raend This is the end address of a readahead window. When a fault occurs, a readahead window will page in a number of pages after the fault address. This field records how far to read ahead

vm_private_data Used by some device drivers to store private information. Not of concern to the memory manager

Figure 9.2: Memory Region Flags
\begin{figure}\par
\noindent \begin{tabularx}{13.5cm}{\vert l\vert X\vert}
\...
...dahead in the region is useless \\
\hline
\end{tabularx}
\par\end{figure}

All the regions are linked together on a linked list ordered by address via the vm_next field. When searching for a free area, it is a simple matter of traversing the list but a frequent operation is to search for the VMA for a particular address such as during page faulting for example. In this case, the Red-Black tree is traversed as it has O(logN) search time on average. The tree is ordered so that lower addresses than the current node are on the left leaf and higher addresses are on the right.

9.3.1 File/Device backed memory regions

In the event the region is backed by a file, the vm_file leads to an associated address_space. The struct contains information of relevance to the filesystem such as the number of dirty pages which must be flushed to disk. It is defined as follows in include/linux/fs.h

401 struct address_space {
402         struct list_head        clean_pages;    
403         struct list_head        dirty_pages;    
404         struct list_head        locked_pages;   
405         unsigned long           nrpages;        
406         struct address_space_operations *a_ops; 
407         struct inode            *host;          
408         struct vm_area_struct   *i_mmap;        
409         struct vm_area_struct   *i_mmap_shared; 
410         spinlock_t              i_shared_lock;  
411         int                     gfp_mask;       
412 };

clean_pages A list of clean pages which do not have to be synchronized with the disk
dirty_pages Pages that the process has touched and need to by sync-ed
locked_pages The number of pages locked in memory
nrpages Number of resident pages in use by the address space
a_ops A struct of function pointers within the filesystem
host The host inode the file belongs to
i_mmap A pointer to the vma the address space is part of
i_mmap_shared A pointer to the next VMA which shares this address space
i_shared_lock A spinlock to protect this structure
gfp_mask The mask to use when calling __alloc_pages for new pages

Periodically the memory manger will need to flush information to disk. The memory manager doesn't know and doesn't care how information is written to disk, so the a_ops struct is used to call the relevant functions. It is defined as follows in include/linux/fs.h

383 struct address_space_operations {
384         int (*writepage)(struct page *);
385         int (*readpage)(struct file *, struct page *);
386         int (*sync_page)(struct page *);
387         /*
388          * ext3 requires that a successful prepare_write() 
             * call be followed
389          * by a commit_write() call - they must be balanced
390          */
391         int (*prepare_write)(struct file *, struct page *, 
                                 unsigned, unsigned);
392         int (*commit_write)(struct file *, struct page *, 
                                 unsigned, unsigned);
393         /* Unfortunately this kludge is needed for FIBMAP. 
             * Don't use it */
394         int (*bmap)(struct address_space *, long);
395         int (*flushpage) (struct page *, unsigned long);
396         int (*releasepage) (struct page *, int);
397 #define KERNEL_HAS_O_DIRECT 
398         int (*direct_IO)(int, struct inode *, struct kiobuf *, 
                             unsigned long, int);
399 };

writepage Write a page to disk. The offset within the file to write to is stored within the page struct. It is up to the filesystem specific code to find the block. See buffer.c:block_write_full_page()

readpage Read a page from disk. See buffer.c:block_read_full_page()

sync_page Sync a dirty page with disk. See buffer.c:block_sync_page()

prepare_write This is called before data is copied from userspace into a page that will be written to disk. With a journaled filesystem, this ensures the filesystem log is up to date. With normal filesystems, it makes sure the needed buffer pages are allocated. See buffer.c:block_prepare_write()

commit_write After the data has been copied from userspace, this function is called to commit the information to disk. See buffer.c:block_commit_write()

bmap Maps a block so raw IO can be performed. Only of concern to the filesystem specific code.

flushpage This makes sure there is no IO pending on a page before releasing it. See buffer.c:discard_bh_page()

releasepage This tries to flush all the buffers associated with a page before freeing the page itself. See try_to_free_buffers()

9.3.2 Creating A Memory Region

The system call mmap is provided for creating new memory regions within a process. For the x86, the function is called sys_mmap2() and is responsible for performing basic checks before calling do_mmap_pgoff() which is the prime function for creating new areas for all architectures.

Figure 9.3: Call Graph: sys_mmap2

The two high functions above do_mmap_pgoff() are essentially sanity checkers. They ensure the mapping size of page aligned if necessary, clears invalid flags, looks up the struct file for the given file descriptor and acquires the mmap_sem semaphore.

This do_mmap_pgoff() function is very large and broadly speaking it takes the following steps;

9.3.3 Finding a Mapped Memory Region

A common operation is to find the VMA a particular address belongs to during operations such as a page fault and the function responsible is find_vma().

It first checks the mmap_cache field which caches the result of the last call to find_vma() as it is quite likely the same region is needed a few times in succession. If it not the desired region, the red-back tree stored in the mm_rb field is traversed. It returns the VMA closest to the requested address so it is important callers ensure the returned VMA contains the desired address.

A second function is provided which is functionally similar called find_vma_prev(). The only difference is that it also returns the pointer to the VMA preceding the searched for VMA9.2which is required as the list is a singly listed list. This is used rarely but most notably, it is used when deciding if two VMAs can be merged so that the two VMAs may be easily compared. It is also used while removing a memory region so that the linked lists may be fixed up.

The last function of note for searching VMAs is find_vma_intersection() which is used to find a VMA which overlaps a given address range. The most notable use of this is during a call to do_brk() when a region is growing up. It is important to ensure that the growing region will not overlap an old region.

9.3.4 Finding a Free Memory Region

When a new area is to be mmap'd, a free region has to be found that is large enough to contain the new mapping. The function responsible for finding a free area is get_unmapped_area().



\includegraphics[]{graphs/get_unmapped_area.ps}
[1]
Figure: Call Graph: get_unmapped_area
[2]


As the call graph in Figure 9.4 shows, there is not much work involved with finding an unmapped area. The function is passed a number of parameters. A struct file is passed representing the file or device to be mapped as well as pgoff, the offset within the file that is been mapped. The requested address for the mapping is passed as well as its length. The last parameter is the protection flags for the area.

If a device is been mapped, such as a video card, the associated

f_op$\rightarrow$get_unmapped_area is used. This is because devices or files may have additional requirements for mapping that generic code can not be aware of such as the address having to be aligned to a particular virtual address.

If there is no special requirements, the architecture specific function

arch_get_unmapped_area() is called. Not all architectures provide their own function. For those that don't, there is a generic function provided in mm/mmap.c.

9.3.5 Inserting a memory region

The principle function available for inserting a new memory region is insert_vm_struct() whose call graph can be seen in Figure 9.5. It is a very simply function which first called find_vma_prepare() to find the appropriate VMAs the new region is to be inserted between and the correct nodes within the red-black tree. It then calls __vma_link() to do the work of linking in the new VMA.



\includegraphics[width=9cm]{graphs/insert_vm_struct.ps}
Figure: Call Graph: insert_vm_struct


The function insert_vm_struct() is rarely used as it does not increase the map_count field. Instead, the function more commonly used is __insert_vm_struct() which performs the same tasks except it increases map_count.

Two varieties of linking functions are provided, vma_link() and __vma_link(). vma_link() is intended for use when no locks are held. It'll acquire all the necessary locks, including locking the file if the vma is a file mapping before calling __vma_link which places the VMA in the relevant lists.

It is important to note that many users do not the insert_vm_struct() functions but instead prefer to call find_vma_prepare() themselves followed by a later vma_link() to avoid having to traverse the tree multiple times.

The linking in __vma_link() consists of three stages, each of which has a single function. __vma_link_list() inserts the vma into the linear singly linked list. If it is the first mapping in the address space (i.e. prev is NULL), then it will be made the red-black root as well. The second stage is linking the node into the red-black tree with __vma_link_rb(). The final stage is fixing up the file share mapping with __vma_link_file() which basically inserts the vma into the linked list of VMAs via the vm_pprev_share() and vm_next_share() fields.

9.3.6 Merging contiguous regions

Linux used to have a function called merge_segments()[#!kernel-2-2!#] which was responsible for merging adjacent regions of memory together if the file and permissions matched. The objective was to remove the number of VMAs required especially as many operations resulted in a number of mappings been created such as calls to sys_mprotect(). This was an expensive operation as it could result in large portions of the mappings been traversed and was later removed as applications, especially those with many mappings, spent a long time in merge_segments().

Only one function exists now that is roughly equivalent vma_merge() and its use is quite rare. It is only called during sys_mmap() if it is an anonymous region been mapped and during do_brk(). The principle difference is that instead of merging two regions together, it will check to see can another region be expanded to cover the new allocation removing the need to create a new region. A region can be expanded if there is no file or device mappings and the permissions of the two areas are the same.

Regions are merged elsewhere albeit no function is explicitly called to perform the merging. The first is during a call to sys_mprotect(). During the fixup of areas, the two regions will be merged if the permissions are now the same. The second is during a call to move_vma() when it is likely similar regions will be located beside each other.

9.3.7 Remapping and moving a memory region



\includegraphics[width=17cm]{graphs/sys_mremap.ps}
[1]
Figure: Call Graph: sys_mremap
[2]


Memory regions may be moved during a call to sys_mremap() if the region is growing, would overlap another region and MREMAP_FIXED is not specified in the flags. The call graph is illustrated in Figure 9.6.

To move a region, it first calls get_unmapped_area() to find a region large enough to contain the new resized mapping and then calls move_vma() to move the old VMA to the new location. See Figure 9.7 for the call graph.



\includegraphics[width=17cm]{graphs/move_vma.ps}
[1]
Figure: Call Graph: move_vma
[2]


First the function checks if the new location may be merged with the VMAs adjacent to the new location. If they can not be merged, a new VMA is allocated literally one PTE at a time.

Next move_page_tables() is called, see Figure 9.8 for its call graph. This function copies all the page table entries from the old mapping to the new one. While there may be better ways to move the page tables, this method makes error recovery much easier as it is easy to backtrack if an error occurs during the page table move.



\includegraphics[width=10cm]{graphs/move_page_tables.ps}
Figure: Call Graph: move_page_tables


The contents of the pages are not copied. Instead, zap_page_range() is called to swap out or remove all the pages from the old mapping. The normal page fault handling code will either swap the pages back in from swap, files or call the device specific do_nopage() function.

9.3.8 Deleting a memory region

The function responsible for deleting memory regions or parts thereof is do_munmap(). It is a relatively simple operation in comparison to the other memory region related operations and is basically divided up into three parts. The first is to fix up the red-black tree for the region that is about to be unmapped. The second is to release the pages and PTE's related to the region to be unmapped and the third is to fix up the regions if a hole has been generated.



\includegraphics[width=17.5cm]{graphs/do_munmap.ps}
Figure: Call Graph: do_munmap


To ensure the red-black tree is ordered correctly, all VMAs to be affected by the unmap are placed on a linked list called free and then deleted from the red-black tree with rb_erase(). The regions if they still exist will be added with their new addresses later during the fixup.

Next the linked list of free is walked through and checks are made to ensure it is not a partial unmapping. Even if a region is just to be partially unmapped, remove_shared_vm_struct() is still called to remove the shared file mapping. Again, if this is a partial unmapping, it will be recreated during fixup. zap_page_range() is called to remove all the pages associated with the region about to be unmapped before unmap_fixup() is called to handle partial unmappings.

Lastly free_pgtables() is called to try and free up all the page table entries associated with the unmapped region. It is important to note that the page table entry freeing is not exhaustive. It will only unmap full PGD directories and their entries so for example, if only half a PGD was used for the mapping, no page table entries will be freed. This is because a finer grained freeing of page table entries would be too expensive to free up data structures that are both small and likely to be used again.

9.3.9 Deleting all memory regions

During process exit, it is necessary to unmap all VMAs associated with a mm. The function responsible is exit_mmap(). It is a very simply function which flushes the CPU cache before walking through the linked list of VMAs, unmapping each of them in turn and freeing up the associated pages before flushing the TLB and deleting the page table entries. It is covered in detail in the companion document.



Footnotes

... VMA9.2
This is one of the very rare cases where a singly linked list is used in the kernel

next up previous contents index
Next: 9.4 Locking a Memory Up: 9. Process Address Space Previous: 9.2 Process Address Space   Contents   Index
Mel 2003-01-14