Pages in the process linear address space are not necessarily resident in memory. For example, allocations made on behalf of a process are not satisfied immediately as the space is just reserved with the vm_area_struct. Other examples include the page having been swapped out to backing storage, writing a read-only page or simple programming error.
Linux, like most operating system, has a Demand Fetch policy as its fetch
policy for dealing with pages not resident. This states that the page is only
fetched from backing storage when the hardware raises a page fault which the
operating system traps and allocates a page. The characteristics of backing
storage imply that some sort of page prefetching policy would result in less
page faults[#!maekawa87!#] but Linux is fairly primitive in this respect.
When a page is paged in from swap space, a number of pages after it, up to
is read in by swapin_readahead() and placed
in the swap cache. Unfortunately there is not much guarantee that the pages
placed in swap are related to each other or likely to be used soon.
There is two types of page fault, major and minor faults. Major fault
have to have the data fetched from disk else it is referred to as
a minor or soft page fault. Linux maintains statistics on these two
types of fault with the task_structmaj_flt and
task_struct
min_flt respectively.
The page fault handler in Linux is expected to recognise and act on a number of different types of page faults listed in Table 9.4 which will be discussed in detail later in this section.
Each architecture registers an architecture specific function for the handling of page faults. While the name of this function is arbitrary, a common choice is do_page_fault() whose call graph for the x86 is shown in Figure 9.10.
This function is provided with a wealth of information such as the address of the fault, whether the page was simply not found or was a a protection error, whether it was a read or write fault and whether it is a fault from user or kernel space. It is responsible for determining which type of fault it has and how it should be handled by the architecture independent code. The flow chart which shows broadly speaking what this function does is shown in Figure 9.11. In the figure, points with a colon after it is the label as shown in the code.
handle_mm_fault() is the architecture independent top level function for faulting in a page from backing storage, performing COW and so on. If it returns 1, it was a minor fault, 2 was a major fault, 0 sends a SIGBUS error and any other invokes the out of memory handler.
Once the exception handler has decided it is a normal page fault, handle_mm_fault(), whose call graph is shown in Figure 9.12, takes over. It allocates the required page table entries if they do not already exist and calls handle_pte_fault().
Based on the properties of the PTE, one of the handler functions shown in the call graph will be used. The first checks are made if the PTE is marked not present as shown by pte_present() then pte_none() is called. If it returns there is no PTE, do_no_page() is called which handles Demand Allocation, otherwise it is a page that has been swapped out to disk and do_swap_page() is what is required for Demand Paging.
The second option if if the page is been written to. If the PTE is write protected, then do_wp_page() is called as the page is a Copy-On-Write (COW) page as the VMA for the region is marked writable even if the individual PTE is not. Otherwise the page is simply marked dirty as it has been written to.
The last option is if the page has been read and is present but a fault still occurred. This can occur with some architectures that do not have a three level page table. In this case, the PTE is simply established and marked young.
When a process accesses a page for the very first time, the page has to be allocated and possibly filled with data by the do_no_page() function. If the managing VMA has filled in the vm_ops struct and has supplied a nopage(), it is called. This is of importance to a memory mapped device such as a video card which needs to allocate the page and supply data on access or to a mapped file which must retrieve its data from backing storage.
If the struct is not filled in or a nopage() function is not supplied, the function do_anonymous_page() is called to handle an anonymous access which we will discuss first as it is simpler. There is only two cases to handle, first time read and first time write. As it is an anonymous page, the first read is an easy case as no data exists so the system wide empty_zero_page which is just a page of zeros is mapped for the PTE and the PTE is write protected. The PTE is write protected so another page fault will occur if the process writes to the page.
If this is the first write to the page alloc_page() is called to allocate a free page (see Chapter 6) and is zero filled by clear_user_highpage(). Assuming the page was successfully allocated the Resident Set Size (rss) field in the markstructmm_struct will be incremented, flush_page_to_ram() is called as it is required when a page is been inserted into a userspace process by some architectures to ensure cache coherency, the page is inserted on the LRU lists so it may be reclaimed later by the swapping code and finally the page table entries for the process are updated for the new mapping.
If backed by a file or device, a nopage() function will be provided. In the file backed case the function filemap_nopage() is the nopage() function for allocating a page and reading a pages worth of data from disk. Each device driver provides a different nopage() whose internals are unimportant to us here as long as it returns a valid struct page to use.
On return of the page, a check is made to ensure a page was successfully allocated and appropriate errors returned if not. A check is then made to see should an early COW break take place. An early COW break will take place if the fault is a write to the page and the VM_SHARED flag is not included in the managing VMA. An early break is a case of allocating a new page and copying the data across before reducing the reference count to the page returned by the nopage() function.
In either case, a check is then made with pte_none() to ensure there isn't a PTE already in the page table that is about to be used. It is possible with SMP that two faults would occur for the same page at close to the same time and as the spinlocks are not held for the full duration of the fault, this check has to be made at the last instant. If there has been no race, the PTE is assigned, statistics updated and the architecture hooks for cache coherency called.
When a page is swapped out to backing storage, the function do_swap_page() is responsible for reading the page back in. The information needed to find it is stored within the PTE itself. They information within the PTE is enough to find the page in swap. As pages may be shared between multiple processes, they can not always be swapped out immediately. Instead, when a page is swapped out, it is placed within the swap cache.
A shared page can not be swapped out immediately because there is no way of mapping a struct page to the PTE's of each process it is shared between. Searching the page tables of all processes is simply far too expensive. It is worth nothing that the late 2.5.x kernels and 2.4.x with a custom patch have what is called Reverse Mapping (rmap). With rmap, the PTE's a page is mapped by are linked together in a chain so they can be reverse looked up.
With the swap cache existing, it is possible that when a fault occurs it still exists in the swap cache. If it is, the reference count to the page is simply increased and it's placed within the process page tables again and registers as a minor page fault.
If the page exists only on disk swapin_readahead() is
called which reads in the requested page and a number of pages after it. The
number of pages read in is determined by the variable page_cluster
defined in mm/swap.c. On low memory machines with less than 16MiB
of RAM, it is initialised as 2 or 3 otherwise. The number of pages read in
is
unless a bad or empty swap entry is encountered. This
works on the premise that a seek is the most expensive operation in time so
once the seek has completed, the succeeding pages should also be read in.
Traditionally when a process forked, the parent address space was copied to duplicate it for the child. This was an extremely expensive operation as it is possible a significant percentage of the process would have to be swapped in from backing storage. To avoid this considerable overhead, a technique called copy-on-write (COW) is employed.
During fork, the PTE's of the two processes are made read-only so that when a write occurs there will be a page fault. Linux recognizes a COW page because even though the PTE is write protected, the controlling VMA shows the region is writable. It uses the function function do_wp_page() to handle it by making a copy of the page and assigning it to the writing process. If necessary, a new swap slot will be reserved for the page. With this method, only the page table entries have to be copied during a fork.