Библиотека сайта rus-linux.net
Linux Device Drivers, 2nd EditionBy Alessandro Rubini & Jonathan Corbet2nd Edition June 2001 0-59600-008-1, Order Number: 0081 586 pages, $39.95 |
Chapter 12
Loading Block DriversContents:
Registering the Driver
The Header File blk.h
Handling Requests: A Simple Introduction
Handling Requests: The Detailed View
How Mounting and Unmounting Works
The ioctl Method
Removable Devices
Partitionable Devices
Interrupt-Driven Block Drivers
Backward Compatibility
Quick ReferenceOur discussion thus far has been limited to char drivers. As we have already mentioned, however, char drivers are not the only type of driver used in Linux systems. Here we turn our attention to block drivers. Block drivers provide access to block-oriented devices -- those that transfer data in randomly accessible, fixed-size blocks. The classic block device is a disk drive, though others exist as well.
Registering the Driver
Like char drivers, block drivers in the kernel are identified by major numbers. Block major numbers are entirely distinct from char major numbers, however. A block device with major number 32 can coexist with a char device using the same major number since the two ranges are separate.
The functions for registering and unregistering block devices look similar to those for char devices:
#include <linux/fs.h> int register_blkdev(unsigned int major, const char *name, struct block_device_operations *bdops); int unregister_blkdev(unsigned int major, const char *name);result = register_blkdev(sbull_major, "sbull", &sbull_bdops); if (result < 0) { printk(KERN_WARNING "sbull: can't get major %d\n",sbull_major); return result; } if (sbull_major == 0) sbull_major = result; /* dynamic */ major = sbull_major; /* Use `major' later on to save typing */The similarity stops here, however. One difference is already evident: register_chrdev took a pointer to a
file_operations
structure, but register_blkdev uses a structure of typeblock_device_operations
instead -- as it has since kernel version 2.3.38. The structure is still sometimes referred to by the namefops
in block drivers; we'll call itbdops
to be more faithful to what the structure is and to follow the suggested naming. The definition of this structure is as follows:struct block_device_operations { int (*open) (struct inode *inode, struct file *filp); int (*release) (struct inode *inode, struct file *filp); int (*ioctl) (struct inode *inode, struct file *filp, unsigned command, unsigned long argument); int (*check_media_change) (kdev_t dev); int (*revalidate) (kdev_t dev); };The
bdops
structure used in sbull is as follows:struct block_device_operations sbull_bdops = { open: sbull_open, release: sbull_release, ioctl: sbull_ioctl, check_media_change: sbull_check_change, revalidate: sbull_revalidate, };For the purposes of block device registration, however, we must tell the kernel where our request method is. This method is not kept in the
block_device_operations
structure, for both historical and performance reasons; instead, it is associated with the queue of pending I/O operations for the device. By default, there is one such queue for each major number. A block driver must initialize that queue with blk_init_queue. Queue initialization and cleanup is defined as follows:#include <linux/blkdev.h> blk_init_queue(request_queue_t *queue, request_fn_proc *request); blk_cleanup_queue(request_queue_t *queue);blk_init_queue(BLK_DEFAULT_QUEUE(major), sbull_request);Each device has a request queue that it uses by default; the macro
BLK_DEFAULT_QUEUE(major)
is used to indicate that queue when needed. This macro looks into a global array ofblk_dev_struct
structures calledblk_dev
, which is maintained by the kernel and indexed by major number. The structure looks like this:struct blk_dev_struct { request_queue_t request_queue; queue_proc *queue; void *data; };Figure 12-1. Registering a Block Device Driver
int blk_size[][];
This array is indexed by the major and minor numbers. It describes the size of each device, in kilobytes. If
blk_size[major]
isNULL
, no checking is performed on the size of the device (i.e., the kernel might request data transfers past end-of-device).
int blksize_size[][];
int hardsect_size[][];
int read_ahead[];
int max_readahead[][];
These arrays define the number of sectors to be read in advance by the kernel when a file is being read sequentially.
read_ahead
applies to all devices of a given type and is indexed by major number;max_readahead
applies to individual devices and is indexed by both the major and minor numbers.
int max_sectors[][];
int max_segments[];
size=2048
(kilobytes)
blksize=1024
(bytes)The software "block'' used by the module is one kilobyte, like the system default.
hardsect=512
(bytes)
rahead=2
(sectors)Because the RAM disk is a fast device, the default read-ahead value is small.
The initialization of these arrays in sbullis done as follows:
read_ahead[major] = sbull_rahead; result = -ENOMEM; /* for the possible errors */ sbull_sizes = kmalloc(sbull_devs * sizeof(int), GFP_KERNEL); if (!sbull_sizes) goto fail_malloc; for (i=0; i < sbull_devs; i++) /* all the same size */ sbull_sizes[i] = sbull_size; blk_size[major]=sbull_sizes; sbull_blksizes = kmalloc(sbull_devs * sizeof(int), GFP_KERNEL); if (!sbull_blksizes) goto fail_malloc; for (i=0; i < sbull_devs; i++) /* all the same blocksize */ sbull_blksizes[i] = sbull_blksize; blksize_size[major]=sbull_blksizes; sbull_hardsects = kmalloc(sbull_devs * sizeof(int), GFP_KERNEL); if (!sbull_hardsects) goto fail_malloc; for (i=0; i < sbull_devs; i++) /* all the same hardsect */ sbull_hardsects[i] = sbull_hardsect; hardsect_size[major]=sbull_hardsects;for (i = 0; i < sbull_devs; i++) register_disk(NULL, MKDEV(major, i), 1, &sbull_bdops, sbull_size << 1);The cleanup function used by sbull looks like this:
for (i=0; i<sbull_devs; i++) fsync_dev(MKDEV(sbull_major, i)); /* flush the devices */ unregister_blkdev(major, "sbull"); /* * Fix up the request queue(s) */ blk_cleanup_queue(BLK_DEFAULT_QUEUE(major)); /* Clean up the global arrays */ read_ahead[major] = 0; kfree(blk_size[major]); blk_size[major] = NULL; kfree(blksize_size[major]); blksize_size[major] = NULL; kfree(hardsect_size[major]); hardsect_size[major] = NULL;Here, the call to fsync_dev is needed to free all references to the device that the kernel keeps in various caches. fsync_dev is the implementation of block_fsync, which is the fsync "method'' for block devices.
The Header File blk.h
All block drivers should include the header file
<linux/blk.h>
. This file defines much of the common code that is used in block drivers, and it provides functions for dealing with the I/O request queue.Actually, the blk.h header is quite unusual, because it defines several symbols based on the symbol
MAJOR_NR
, which must be declared by the driver before it includes the header. This convention was developed in the early days of Linux, when all block devices had preassigned major numbers and modular block drivers were not supported.blk.h makes use of some other predefined, driver-specific symbols as well. The following list describes the symbols in
<linux/blk.h>
that must be defined in advance; at the end of the list, the code used in sbull is shown.
MAJOR_NR
DEVICE_NAME
The name of the device being created. This string is used in printing error messages.
DEVICE_NR(kdev_t device)
This symbol is used to extract the ordinal number of the physical device from the
kdev_t
device number. This symbol is used in turn to declareCURRENT_DEV
, which can be used within the request function to determine which hardware device owns the minor number involved in a transfer request.
DEVICE_INTR
This symbol is used to declare a pointer variable that refers to the current bottom-half handler. The macros
SET_INTR(intr)
andCLEAR_INTR
are used to assign the variable. Using multiple handlers is convenient when the device can issue interrupts with different meanings.
DEVICE_ON(kdev_t device)
DEVICE_OFF(kdev_t device)
These macros are intended to help devices that need to perform processing before or after a set of transfers is performed; for example, they could be used by a floppy driver to start the drive motor before I/O and to stop it afterward. Modern drivers no longer use these macros, and
DEVICE_ON
does not even get called anymore. Portable drivers, though, should define them (as empty symbols), or compilation errors will result on 2.0 and 2.2 kernels.
DEVICE_NO_RANDOM
By default, the function end_request contributes to system entropy (the amount of collected "randomness''), which is used by /dev/random. If the device isn't able to contribute significant entropy to the random device,
DEVICE_NO_RANDOM
should be defined. /dev/random was introduced in "Section 9.3, "Installing an Interrupt Handler"" in Chapter 9, "Interrupt Handling", whereSA_SAMPLE_RANDOM
was explained.
DEVICE_REQUEST
The sbull driver declares the symbols in the following way:
#define MAJOR_NR sbull_major /* force definitions on in blk.h */ static int sbull_major; /* must be declared before including blk.h */ #define DEVICE_NR(device) MINOR(device) /* has no partition bits */ #define DEVICE_NAME "sbull" /* name for messaging */ #define DEVICE_INTR sbull_intrptr /* pointer to bottom half */ #define DEVICE_NO_RANDOM /* no entropy to contribute */ #define DEVICE_REQUEST sbull_request #define DEVICE_OFF(d) /* do-nothing */ #include <linux/blk.h> #include "sbull.h" /* local definitions */The blk.h header uses the macros just listed to define some additional macros usable by the driver. We'll describe those macros in the following sections.
Handling Requests: A Simple Introduction
The most important function in a block driver is the request function, which performs the low-level operations related to reading and writing data. This section discusses the basic design of the requestprocedure.
The Request Queue
void request_fn(request_queue_t *queue);The request function should perform the following tasks for each request in the queue:
Check the validity of the request. This test is performed by the macro
INIT_REQUEST
, defined in blk.h; the test consists of looking for problems that could indicate a bug in the system's request queue handling.
void sbull_request(request_queue_t *q) { while(1) { INIT_REQUEST; printk("<1>request %p: cmd %i sec %li (nr. %li)\n", CURRENT, CURRENT->cmd, CURRENT->sector, CURRENT->current_nr_sectors); end_request(1); /* success */ } }The request function has one very important constraint: it must be atomic. request is not usually called in direct response to user requests, and it is not running in the context of any particular process. It can be called at interrupt time, from tasklets, or from any number of other places. Thus, it must not sleep while carrying out its tasks.
Performing the Actual Data Transfer
To understand how to build a working requestfunction for sbull, let's look at how the kernel describes a request within a
struct request
. The structure is defined in<linux/blkdev.h>
. By accessing the fields in therequest
structure, usually by way ofCURRENT
, the driver can retrieve all the information needed to transfer data between the buffer cache and the physical block device.[48]CURRENT
is just a pointer intoblk_dev[MAJOR_NR].request_queue
. The following fields of a request hold information that is useful to the request function:
kdev_t rq_dev;
int cmd;
unsigned long sector;
The number of the first sector to be transferred in this request.
unsigned long current_nr_sectors;
unsigned long nr_sectors;
char *buffer;
struct buffer_head *bh;
void sbull_request(request_queue_t *q) { Sbull_Dev *device; int status; while(1) { INIT_REQUEST; /* returns when queue is empty */ /* Which "device" are we using? */ device = sbull_locate_device (CURRENT); if (device == NULL) { end_request(0); continue; } /* Perform the transfer and clean up. */ spin_lock(&device->lock); status = sbull_transfer(device, CURRENT); spin_unlock(&device->lock); end_request(status); } }static Sbull_Dev *sbull_locate_device(const struct request *req) { int devno; Sbull_Dev *device; /* Check if the minor number is in range */ devno = DEVICE_NR(req->rq_dev); if (devno >= sbull_devs) { static int count = 0; if (count++ < 5) /* print the message at most five times */ printk(KERN_WARNING "sbull: request for unknown device\n"); return NULL; } device = sbull_devices + devno; /* Pick it out of device array */ return device; }The actual I/O of the request is handled by sbull_transfer:
static int sbull_transfer(Sbull_Dev *device, const struct request *req) { int size; u8 *ptr; ptr = device->data + req->sector * sbull_hardsect; size = req->current_nr_sectors * sbull_hardsect; /* Make sure that the transfer fits within the device. */ if (ptr + size > device->data + sbull_blksize*sbull_size) { static int count = 0; if (count++ < 5) printk(KERN_WARNING "sbull: request past end of device\n"); return 0; } /* Looks good, do the transfer. */ switch(req->cmd) { case READ: memcpy(req->buffer, ptr, size); /* from sbull to buffer */ return 1; case WRITE: memcpy(ptr, req->buffer, size); /* from buffer to sbull */ return 1; default: /* can't happen */ return 0; } }Since sbull is just a RAM disk, its "data transfer'' reduces to a memcpy call.
Handling Requests: The Detailed View
The sbull driver as described earlier works very well. In simple situations (as with sbull), the macros from
<linux/blk.h>
can be used to easily set up a request function and get a working driver. As has already been mentioned, however, block drivers are often a performance-critical part of the kernel. Drivers based on the simple code shown earlier will likely not perform very well in many situations, and can also be a drag on the system as a whole. In this section we get into the details of how the I/O request queue works with an eye toward writing a faster, more efficient driver.The I/O Request Queue
The request structure and the buffer cache
The design of the
request
structure is driven by the Linux memory management scheme. Like most Unix-like systems, Linux maintains a buffer cache, a region of memory that is used to hold copies of blocks stored on disk. A great many "disk" operations performed at higher levels of the kernel -- such as in the filesystem code -- act only on the buffer cache and do not generate any actual I/O operations. Through aggressive caching the kernel can avoid many read operations altogether, and multiple writes can often be merged into a single physical write to disk.
char *b_data;
unsigned long b_size;
kdev_t b_rdev;
The device holding the block represented by this buffer head.
unsigned long b_rsector;
struct buffer_head *b_reqnext;
A pointer to a linked list of buffer head structures in the request queue.
void (*b_end_io)(struct buffer_head *bh, int uptodate);
Every block passed to a driver's request function either lives in the buffer cache, or, on rare occasion, lives elsewhere but has been made to look as if it lived in the buffer cache.[49] As a result, every request passed to the driver deals with one or more
buffer_head
structures. Therequest
structure contains a member (called simplybh
) that points to a linked list of these structures; satisfying the request requires performing the indicated I/O operation on each buffer in the list. Figure 12-2 shows how the request queue andbuffer_head
structures fit together.Figure 12-2. Buffers in the I/O Request Queue
Request queue manipulation
struct request *blkdev_entry_next_request(struct list_head *head);
struct request *blkdev_next_request(struct request *req);
struct request *blkdev_prev_request(struct request *req);
Given a request structure, return the next or previous structure in the request queue.
blkdev_dequeue_request(struct request *req);
blkdev_release_request(struct request *req);
Releases a request structure back to the kernel when it has been completely executed. Each request queue maintains its own free list of request structures (two, actually: one for reads and one for writes); this function places a structure back on the proper free list. blkdev_release_request will also wake up any processes that are waiting on a free request structure.
All of these functions require that the
io_request_lock
be held, which we will discuss next.The I/O request lock
The I/O request queue is a complex data structure that is accessed in many places in the kernel. It is entirely possible that the kernel needs to add more requests to the queue at the same time that your driver is taking requests off. The queue is thus subject to the usual sort of race conditions, and must be protected accordingly.
How the blk.h macros and functions work
The fields of the
request
structure that we looked at earlier --sector
,current_nr_sectors
, andbuffer
-- are really just copies of the analogous information stored in the firstbuffer_head
structure on the list. Thus, a request function that uses this information from theCURRENT
pointer is just processing the first of what might be many buffers within the request. The task of splitting up a multibuffer request into (seemingly) independent, single-buffer requests is handled by two important definitions in<linux/blk.h>
: theINIT_REQUEST
macro and the end_request function.
Complete the I/O processing on the current buffer; this involves calling the b_end_io function with the status of the operation, thus waking any process that may be sleeping on the buffer.
Release the finished request back to the system;
io_request_lock
is required here too.
int end_that_request_first(struct request *req, int status, char *name);void end_that_request_last(struct request *req);In end_request this step is handled with this code:
struct request *req = CURRENT; blkdev_dequeue_request(req); end_that_request_last(req);Clustered Requests
The time has come to look at how to apply all of that background material to the task of writing better block drivers. We'll start with a look at the handling of clustered requests. Clustering, as mentioned earlier, is simply the practice of joining together requests that operate on adjacent blocks on the disk. There are two advantages to doing things this way. First, clustering speeds up the transfer; clustering can also save some memory in the kernel by avoiding allocation of redundant
request
structures.When the I/O on each buffer completes, your driver should notify the kernel by calling the buffer's I/O completion routine:
bh->b_end_io(bh, status);The active queue head
One other detail regarding the behavior of the I/O request queue is relevant for block drivers that are dealing with clustering. It has to do with the queue head -- the first request on the queue. For historical compatibility reasons, the kernel (almost) always assumes that a block driver is processing the first entry in the request queue. To avoid corruption resulting from conflicting activity, the kernel will never modify a request once it gets to the head of the queue. No further clustering will happen on that request, and the elevator code will not put other requests in front of it.
blk_queue_headactive(request_queue_t *queue, int active);If
active
is 0, the kernel will be able to make changes to the head of the request queue.Multiqueue Block Drivers
As we have seen, the kernel, by default, maintains a single I/O request queue for each major number. The single queue works well for devices like sbull, but it is not always optimal for real-world situations.
request_queue_t queue; int busy;The
busy
flag is used to protect against request function reentrancy, as we will see.for (i = 0; i < sbull_devs; i++) { blk_init_queue(&sbull_devices[i].queue, sbull_request); blk_queue_headactive(&sbull_devices[i].queue, 0); } blk_dev[major].queue = sbull_find_queue;The call to blk_init_queue is as we have seen before, only now we pass in the device-specific queues instead of the default queue for our major device number. This code also marks the queues as not having active heads.
request_queue_t *sbull_find_queue(kdev_t device) { int devno = DEVICE_NR(device); if (devno >= sbull_devs) { static int count = 0; if (count++ < 5) /* print the message at most five times */ printk(KERN_WARNING "sbull: request for unknown device\n"); return NULL; } return &sbull_devices[devno].queue; }Like the request function, sbull_find_queue must be atomic (no sleeping allowed).
Each queue has its own request function, though usually a driver will use the same function for all of its queues. The kernel passes the actual request queue into the request function as a parameter, so the function can always figure out which device is being operated on. The multiqueue request function used in sbull looks a little different from the ones we have seen so far because it manipulates the request queue directly. It also drops the
io_request_lock
while performing transfers to allow the kernel to execute other block operations. Finally, the code must take care to avoid two separate perils: multiple calls of the request function and conflicting access to the device itself.void sbull_request(request_queue_t *q) { Sbull_Dev *device; struct request *req; int status; /* Find our device */ device = sbull_locate_device (blkdev_entry_next_request(&q->queue_head)); if (device->busy) /* no race here - io_request_lock held */ return; device->busy = 1; /* Process requests in the queue */ while(! list_empty(&q->queue_head)) { /* Pull the next request off the list. */ req = blkdev_entry_next_request(&q->queue_head); blkdev_dequeue_request(req); spin_unlock_irq (&io_request_lock); spin_lock(&device->lock); /* Process all of the buffers in this (possibly clustered) request. */ do { status = sbull_transfer(device, req); } while (end_that_request_first(req, status, DEVICE_NAME)); spin_unlock(&device->lock); spin_lock_irq (&io_request_lock); end_that_request_last(req); } device->busy = 0; }Multiqueue drivers must, of course, clean up all of their queues at module removal time:
for (i = 0; i < sbull_devs; i++) blk_cleanup_queue(&sbull_devices[i].queue); blk_dev[major].queue = NULL;That covers the mechanics of multiqueue drivers. Drivers handling real hardware may have other issues to deal with, of course, such as serializing access to a controller. But the basic structure of multiqueue drivers is as we have seen here.
Doing Without the Request Queue
Not all block devices benefit from the request queue, however. sbull, for example, processes requests synchronously and has no problems with seek times. For sbull, the request queue actually ends up slowing things down. Other types of block devices also can be better off without a request queue. For example, RAID devices, which are made up of multiple disks, often spread "contiguous'' blocks across multiple physical devices. Block devices implemented by the logical volume manager (LVM) capability (which first appeared in 2.4) also have an implementation that is more complex than the block interface that is presented to the rest of the kernel.
In the 2.4 kernel, block I/O requests are placed on the queue by the function __make_request, which is also responsible for invoking the driver's requestfunction. Block drivers that need more control over request queueing, however, can replace that function with their own "make request'' function. The RAID and LVM drivers do so, providing their own variant that, eventually, requeues each I/O request (with different block numbers) to the appropriate low-level device (or devices) that make up the higher-level device. A RAM-disk driver, instead, can execute the I/O operation directly.
void blk_queue_make_request(request_queue_t *queue, make_request_fn *func);The
make_request_fn
type, in turn, is defined as follows:typedef int (make_request_fn) (request_queue_t *q, int rw, struct buffer_head *bh);The "make request'' function must arrange to transfer the given block, and see to it that the b_end_io function is called when the transfer is done. The kernel does not hold the
io_request_lock
lock when calling the make_request_fn function, so the function must acquire the lock itself if it will be manipulating the request queue. If the transfer has been set up (not necessarily completed), the function should return 0.sbull, at initialization time, sets up its make request function as follows:
if (noqueue) blk_queue_make_request(BLK_DEFAULT_QUEUE(major), sbull_make_request);int sbull_make_request(request_queue_t *queue, int rw, struct buffer_head *bh) { u8 *ptr; /* Figure out what we are doing */ Sbull_Dev *device = sbull_devices + MINOR(bh->b_rdev); ptr = device->data + bh->b_rsector * sbull_hardsect; /* Paranoid check; this apparently can really happen */ if (ptr + bh->b_size > device->data + sbull_blksize*sbull_size) { static int count = 0; if (count++ < 5) printk(KERN_WARNING "sbull: request past end of device\n"); bh->b_end_io(bh, 0); return 0; } /* This could be a high-memory buffer; shift it down */ #if CONFIG_HIGHMEM bh = create_bounce(rw, bh); #endif /* Do the transfer */ switch(rw) { case READ: case READA: /* Read ahead */ memcpy(bh->b_data, ptr, bh->b_size); /* from sbull to buffer */ bh->b_end_io(bh, 1); break; case WRITE: refile_buffer(bh); memcpy(ptr, bh->b_data, bh->b_size); /* from buffer to sbull */ mark_buffer_uptodate(bh, 1); bh->b_end_io(bh, 1); break; default: /* can't happen */ bh->b_end_io(bh, 0); break; } /* Nonzero return means we're done */ return 0; }There is, however, one detail that the "make request'' function must take care of. The buffer to be transferred could be resident in high memory, which is not directly accessible by the kernel. High memory is covered in detail in Chapter 13, "mmap and DMA". We won't repeat the discussion here; suffice it to say that one way to deal with the problem is to replace a high-memory buffer with one that is in accessible memory. The function create_bouncewill do so, in a way that is transparent to the driver. The kernel normally uses create_bounce before placing buffers in the driver's request queue; if the driver implements its own make_request_fn, however, it must take care of this task itself.
How Mounting and Unmounting Works
Block devices differ from char devices and normal files in that they can be mounted on the computer's filesystem. Mounting provides a level of indirection not seen with char devices, which are accessed through a
struct file
pointer that is held by a specific process. When a filesystem is mounted, there is no process holding thatfile
structure.When the kernel mounts a device in the filesystem, it invokes the normal open method to access the driver. However, in this case both the
filp
andinode
arguments to open are dummy variables. In thefile
structure, only thef_mode
andf_flags
fields hold anything meaningful; in theinode
structure onlyi_rdev
may be used. The remaining fields hold random values and should not be used. The value off_mode
tells the driver whether the device is to be mounted read-only (f_mode == FMODE_READ
) or read/write (f_mode == (FMODE_READ|FMODE_WRITE)
).int sbull_release (struct inode *inode, struct file *filp) { Sbull_Dev *dev = sbull_devices + MINOR(inode->i_rdev); spin_lock(&dev->lock); dev->usage--; MOD_DEC_USE_COUNT; spin_unlock(&dev->lock); return 0; }The ioctl Method
Like char devices, block devices can be acted on by using the ioctl system call. The only relevant difference between block and char ioctl implementations is that block drivers share a number of common ioctlcommands that most drivers are expected to support.
The commands that block drivers usually handle are the following, declared in
<linux/fs.h>
.
BLKGETSIZE
BLKFLSBUF
BLKRRPART
BLKRAGET
BLKRASET
Used to get and change the current block-level read-ahead value (the one stored in the
read_ahead
array) for the device. ForGET
, the current value should be written to user space as along
item using the pointer passed to ioctl inarg
; forSET
, the new value is passed as an argument.
BLKFRAGET
BLKFRASET
Get and set the filesystem-level read-ahead value (the one stored in
max_readahead
) for this device.
BLKROSET
BLKROGET
These commands are used to change and check the read-only flag for the device.
BLKSECTGET
BLKSECTSET
These commands retrieve and set the maximum number of sectors per request (as stored in
max_sectors
).
BLKSSZGET
BLKPG
BLKELVGET
BLKELVSET
These commands allow some control over how the elevator request sorting algorithm works. As with
BLKPG
, no driver implements them directly.
HDIO_GETGEO
Defined in
<linux/hdreg.h>
and used to retrieve the disk geometry. The geometry should be written to user space in astruct hd_geometry
, which is declared in hdreg.h as well. sbull shows the general implementation for this command.
Almost all of these ioctl commands are implemented in the same way for all block devices. The 2.4 kernel has provided a function, blk_ioctl, that may be called to implement the common commands; it is declared in
<linux/blkpg.h>
. Often the only ones that must be implemented in the driver itself areBLKGETSIZE
andHDIO_GETGEO
. The driver can then safely pass any other commands to blk_ioctl for handling.int sbull_ioctl (struct inode *inode, struct file *filp, unsigned int cmd, unsigned long arg) { int err; long size; struct hd_geometry geo; PDEBUG("ioctl 0x%x 0x%lx\n", cmd, arg); switch(cmd) { case BLKGETSIZE: /* Return the device size, expressed in sectors */ if (!arg) return -EINVAL; /* NULL pointer: not valid */ err = ! access_ok (VERIFY_WRITE, arg, sizeof(long)); if (err) return -EFAULT; size = blksize*sbull_sizes[MINOR(inode->i_rdev)] / sbull_hardsects[MINOR(inode->i_rdev)]; if (copy_to_user((long *) arg, &size, sizeof (long))) return -EFAULT; return 0; case BLKRRPART: /* reread partition table: can't do it */ return -ENOTTY; case HDIO_GETGEO: /* * Get geometry: since we are a virtual device, we have to make * up something plausible. So we claim 16 sectors, four heads, * and calculate the corresponding number of cylinders. We set * the start of data at sector four. */ err = ! access_ok(VERIFY_WRITE, arg, sizeof(geo)); if (err) return -EFAULT; size = sbull_size * blksize / sbull_hardsect; geo.cylinders = (size & ~0x3f) >> 6; geo.heads = 4; geo.sectors = 16; geo.start = 4; if (copy_to_user((void *) arg, &geo, sizeof(geo))) return -EFAULT; return 0; default: /* * For ioctls we don't understand, let the block layer * handle them. */ return blk_ioctl(inode->i_rdev, cmd, arg); } return -ENOTTY; /* unknown command */ }The
PDEBUG
statement at the beginning of the function has been left in so that when you compile the module, you can turn on debugging to see which ioctl commands are invoked on the device.Removable Devices
Thus far, we have ignored the final two file operations in the
block_device_operations
structure, which deal with devices that support removable media. It's now time to look at them; sbull isn't actually removable but it pretends to be, and therefore it implements these methods.This kind of "timely expiration'' is implemented using a kernel timer.
check_media_change
int sbull_check_change(kdev_t i_rdev) { int minor = MINOR(i_rdev); Sbull_Dev *dev = sbull_devices + minor; PDEBUG("check_change for dev %i\n",minor); if (dev->data) return 0; /* still valid */ return 1; /* expired */ }Revalidation
The validation function is called when a disk change is detected. It is also called by the various stat system calls implemented in version 2.1 of the kernel. The return value is currently unused; to be safe, return 0 to indicate success and a negative error code in case of error.
int sbull_revalidate(kdev_t i_rdev) { Sbull_Dev *dev = sbull_devices + MINOR(i_rdev); PDEBUG("revalidate for dev %i\n",MINOR(i_rdev)); if (dev->data) return 0; dev->data = vmalloc(dev->size); if (!dev->data) return -ENOMEM; return 0; }Extra Care
int check_disk_change(kdev_t dev);int sbull_open (struct inode *inode, struct file *filp) { Sbull_Dev *dev; /* device information */ int num = MINOR(inode->i_rdev); if (num >= sbull_devs) return -ENODEV; dev = sbull_devices + num; spin_lock(&dev->lock); /* revalidate on first open and fail if no data is there */ if (!dev->usage) { check_disk_change(inode->i_rdev); if (!dev->data) { spin_unlock (&dev->lock); return -ENOMEM; } } dev->usage++; spin_unlock(&dev->lock); MOD_INC_USE_COUNT; return 0; /* success */ }Nothing else needs to be done in the driver for a disk change. Data is corrupted anyway if a disk is changed while its open count is greater than zero. The only way the driver can prevent this problem from happening is for the usage count to control the door lock in those cases where the physical device supports it. Then open and close can disable and enable the lock appropriately.
Partitionable Devices
Most block devices are not used in one large chunk. Instead, the system administrator expects to be able to partition the device -- to split it into several independent pseudodevices. If you try to create partitions on an sbull device with fdisk, you'll run into problems. The fdisk program calls the partitions /dev/sbull01, /dev/sbull02, and so on, but those names don't exist on the filesystem. More to the point, there is no mechanism in place for binding those names to partitions in the sbull device. Something more must be done before a block device can be partitioned.
The device nodes implemented by spull are called
pd
, for "partitionable disk.'' The four whole devices (also called units) are thus named /dev/pda through /dev/pdd; each device supports at most 15 partitions. Minor numbers have the following meaning: the least significant four bits represent the partition number (where 0 is the whole device), and the most significant four bits represent the unit number. This convention is expressed in the source file by the following macros:#define MAJOR_NR spull_major /* force definitions on in blk.h */ int spull_major; /* must be declared before including blk.h */ #define SPULL_SHIFT 4 /* max 16 partitions */ #define SPULL_MAXNRDEV 4 /* max 4 device units */ #define DEVICE_NR(device) (MINOR(device)>>SPULL_SHIFT) #define DEVICE_NAME "pd" /* name for messaging */The spull driver also hardwires the value of the hard-sector size in order to simplify the code:
#define SPULL_HARDSECT 512 /* 512-byte hardware sectors */The Generic Hard Disk
Every partitionable device needs to know how it is partitioned. The information is available in the partition table, and part of the initialization process consists of decoding the partition table and updating the internal data structures to reflect the partition information.
A block driver that supports partitions must include
<linux/genhd.h>
and should declare astruct gendisk
structure. This structure describes the layout of the disk(s) provided by the driver; the kernel maintains a global list of such structures, which may be queried to see what disks and partitions are available on the system.
int major
The major number for the device that the structure refers to.
const char *major_name
The base name for devices belonging to this major number. Each device name is derived from this name by adding a letter for each unit and a number for each partition. For example, "hd'' is the base name that is used to build /dev/hda1 and /dev/hdb3. In modern kernels, the full length of the disk name can be up to 32 characters; the 2.0 kernel, however, was more restricted. Drivers wishing to be backward portable to 2.0 should limit the
major_name
field to five characters. The name for spull ispd
("partitionable disk'').
int minor_shift
The number of bit shifts needed to extract the drive number from the device minor number. In spull the number is 4. The value in this field should be consistent with the definition of the macro
DEVICE_NR(device)
(see "Section 12.2, "The Header File blk.h""). The macro in spullexpands todevice>>4
.
int max_p
The maximum number of partitions. In our example,
max_p
is 16, or more generally,1 << minor_shift
.
struct hd_struct *part
int *sizes
int nr_real
void *real_devices
A private area that may be used by the driver to keep any additional required information.
void struct gendisk *next
A pointer used to implement the linked list of generic hard-disk structures.
struct block_device_operations *fops;
A pointer to the block operations structure for this device.
struct gendisk spull_gendisk = { major: 0, /* Major number assigned later */ major_name: "pd", /* Name of the major device */ minor_shift: SPULL_SHIFT, /* Shift to get device number */ max_p: 1 << SPULL_SHIFT, /* Number of partitions */ fops: &spull_bdops, /* Block dev operations */ /* everything else is dynamic */ };Partition Detection
When a module initializes itself, it must set things up properly for partition detection. Thus, spull starts by setting up the
spull_sizes
array for thegendisk
structure (which also gets stored inblk_size[MAJOR_NR]
and in thesizes
field of thegendisk
structure) and thespull_partitions
array, which holds the actual partition information (and gets stored in thepart
member of thegendisk
structure). Both of these arrays are initialized to zeros at this time. The code looks like this:spull_sizes = kmalloc( (spull_devs << SPULL_SHIFT) * sizeof(int), GFP_KERNEL); if (!spull_sizes) goto fail_malloc; /* Start with zero-sized partitions, and correctly sized units */ memset(spull_sizes, 0, (spull_devs << SPULL_SHIFT) * sizeof(int)); for (i=0; i< spull_devs; i++) spull_sizes[i<<SPULL_SHIFT] = spull_size; blk_size[MAJOR_NR] = spull_gendisk.sizes = spull_sizes; /* Allocate the partitions array. */ spull_partitions = kmalloc( (spull_devs << SPULL_SHIFT) * sizeof(struct hd_struct), GFP_KERNEL); if (!spull_partitions) goto fail_malloc; memset(spull_partitions, 0, (spull_devs << SPULL_SHIFT) * sizeof(struct hd_struct)); /* fill in whole-disk entries */ for (i=0; i < spull_devs; i++) spull_partitions[i << SPULL_SHIFT].nr_sects = spull_size*(blksize/SPULL_HARDSECT); spull_gendisk.part = spull_partitions; spull_gendisk.nr_real = spull_devs;spull_gendisk.next = gendisk_head; gendisk_head = &spull_gendisk;In practice, the only thing the system does with this list is to implement /proc/partitions.
register_disk(struct gendisk *gd, int drive, unsigned minors, struct block_device_operations *ops, long size);Fixed disks might read the partition table only at module initialization time and when
BLKRRPART
is invoked. Drivers for removable drives will also need to make this call in the revalidate method. Either way, it is important to remember that register_disk will call your driver's request function to read the partition table, so the driver must be sufficiently initialized at that point to handle requests. You should also not have any locks held that will conflict with locks acquired in the request function. register_disk must be called for each disk actually present on the system.spull sets up partitions in the revalidate method:
int spull_revalidate(kdev_t i_rdev) { /* first partition, # of partitions */ int part1 = (DEVICE_NR(i_rdev) << SPULL_SHIFT) + 1; int npart = (1 << SPULL_SHIFT) -1; /* first clear old partition information */ memset(spull_gendisk.sizes+part1, 0, npart*sizeof(int)); memset(spull_gendisk.part +part1, 0, npart*sizeof(struct hd_struct)); spull_gendisk.part[DEVICE_NR(i_rdev) << SPULL_SHIFT].nr_sects = spull_size << 1; /* then fill new info */ printk(KERN_INFO "Spull partition check: (%d) ", DEVICE_NR(i_rdev)); register_disk(&spull_gendisk, i_rdev, SPULL_MAXNRDEV, &spull_bdops, spull_size << 1); return 0; }It's interesting to note that register_diskprints partition information by repeatedly calling
printk(" %s", disk_name(hd, minor, buf));for (i = 0; i < (spull_devs << SPULL_SHIFT); i++) fsync_dev(MKDEV(spull_major, i)); /* flush the devices */ blk_cleanup_queue(BLK_DEFAULT_QUEUE(major)); read_ahead[major] = 0; kfree(blk_size[major]); /* which is gendisk->sizes as well */ blk_size[major] = NULL; kfree(spull_gendisk.part); kfree(blksize_size[major]); blksize_size[major] = NULL;for (gdp = &gendisk_head; *gdp; gdp = &((*gdp)->next)) if (*gdp == &spull_gendisk) { *gdp = (*gdp)->next; break; }Note that there is no unregister_disk to complement the register_disk function. Everything done by register_disk is stored in the driver's own arrays, so there is no additional cleanup required at unload time.
Partition Detection Using initrd
If you want to mount your root filesystem from a device whose driver is available only in modularized form, you must use the initrd facility offered by modern Linux kernels. We won't introduce initrd here; this subsection is aimed at readers who know about initrd and wonder how it affects block drivers. More information on initrd can be found in Documentation/initrd.txt in the kernel source.
The Device Methods for spull
We have seen how to initialize partitionable devices, but not yet how to access data within the partitions. To do that, we need to make use of the partition information stored in the
gendisk->part
array by register_disk. This array is made up ofhd_struct
structures, and is indexed by the minor number. Thehd_struct
has two fields of interest:start_sect
tells where a given partition starts on the disk, andnr_sects
gives the size of that partition.First of all, open and closemust keep track of the usage count for each device. Because the usage count refers to the physical device (unit), the following declaration and assignment is used for the
dev
variable:Spull_Dev *dev = spull_devices + DEVICE_NR(inode->i_rdev);Although almost every device method works with the physical device as a whole, ioctl should access specific information for each partition. For example, when mkfscalls ioctl to retrieve the size of the device on which it will build a filesystem, it should be told the size of the partition of interest, not the size of the whole device. Here is how the
BLKGETSIZE
ioctl command is affected by the change from one minor number per device to multiple minor numbers per device. As you might expect,spull_gendisk->part
is used as the source of the partition size.case BLKGETSIZE: /* Return the device size, expressed in sectors */ err = ! access_ok (VERIFY_WRITE, arg, sizeof(long)); if (err) return -EFAULT; size = spull_gendisk.part[MINOR(inode->i_rdev)].nr_sects; if (copy_to_user((long *) arg, &size, sizeof (long))) return -EFAULT; return 0;case BLKRRPART: /* re-read partition table */ return spull_revalidate(inode->i_rdev);Here are the relevant lines in spull_request:
ptr = device->data + (spull_partitions[minor].start_sect + req->sector)*SPULL_HARDSECT; size = req->current_nr_sectors*SPULL_HARDSECT; /* * Make sure that the transfer fits within the device. */ if (req->sector + req->current_nr_sectors > spull_partitions[minor].nr_sects) { static int count = 0; if (count++ < 5) printk(KERN_WARNING "spull: request past end of partition\n"); return 0; }The number of sectors is multiplied by the hardware sector size (which, remember, is hardwired in spull) to get the size of the partition in bytes.
Interrupt-Driven Block Drivers
When a driver controls a real hardware device, operation is usually interrupt driven. Using interrupts helps system performance by releasing the processor during I/O operations. In order for interrupt-driven I/O to work, the device being controlled must be able to transfer data asynchronously and to generate interrupts.
As always, block transfers begin when the kernel calls the driver's request function. The request function for an interrupt-driven device instructs the hardware to perform the transfer and then returns; it does not wait for the transfer to complete. The spull request function performs the usual error checks and then calls spull_transfer to transfer the data (this is the task that a driver for real hardware performs asynchronously). It then delays acknowledgment until interrupt time:
void spull_irqdriven_request(request_queue_t *q) { Spull_Dev *device; int status; long flags; /* If we are already processing requests, don't do any more now. */ if (spull_busy) return; while(1) { INIT_REQUEST; /* returns when queue is empty */ /* Which "device" are we using? */ device = spull_locate_device (CURRENT); if (device == NULL) { end_request(0); continue; } spin_lock_irqsave(&device->lock, flags); /* Perform the transfer and clean up. */ status = spull_transfer(device, CURRENT); spin_unlock_irqrestore(&device->lock, flags); /* ... and wait for the timer to expire -- no end_request(1) */ spull_timer.expires = jiffies + spull_irq; add_timer(&spull_timer); spull_busy = 1; return; } }/* this is invoked when the timer expires */ void spull_interrupt(unsigned long unused) { unsigned long flags spin_lock_irqsave(&io_request_lock, flags); end_request(1); /* This request is done - we always succeed */ spull_busy = 0; /* We have io_request_lock, no request conflict */ if (! QUEUE_EMPTY) /* more of them? */ spull_irqdriven_request(NULL); /* Start the next transfer */ spin_unlock_irqrestore(&io_request_lock, flags); }If you try to run the interrupt-driven flavor of the spull module, you'll barely notice the added delay. The device is almost as fast as it was before because the buffer cache avoids most data transfers between memory and the device. If you want to perceive how a slow device behaves, you can specify a bigger value for
irq=
when loading spull.Backward Compatibility
Much has changed with the block device layer, and most of those changes happened between the 2.2 and 2.4 stable releases. Here is a quick summary of what was different before. As always, you can look at the drivers in the sample source, which work on 2.0, 2.2, and 2.4, to see how the portability challenges have been handled.
The
block_device_operations
structure did not exist in Linux 2.2. Instead, block drivers used afile_operations
structure just like char drivers. The check_media_change and revalidate methods used to be a part of that structure. The kernel also provided a set of generic functions -- block_read, block_write, and block_fsync -- which most drivers used in theirfile_operations
structures. A typical 2.2 or 2.0file_operations
initialization looked like this:struct file_operations sbull_bdops = { read: block_read, write: block_write, ioctl: sbull_ioctl, open: sbull_open, release: sbull_release, fsync: block_fsync, check_media_change: sbull_check_change, revalidate: sbull_revalidate };In 2.2 and previous kernels, the request function was stored in the
blk_dev
global array. Initialization required a line likeblk_dev[major].request_fn = sbull_request;void (*request) (void);Also, all queues had active heads, so
blk_queue_headactive
did not exist.#ifdef RO_IOCTLS static inline int blk_ioctl(kdev_t dev, unsigned int cmd, unsigned long arg) { int err; switch (cmd) { case BLKRAGET: /* return the read-ahead value */ if (!arg) return -EINVAL; err = ! access_ok(VERIFY_WRITE, arg, sizeof(long)); if (err) return -EFAULT; PUT_USER(read_ahead[MAJOR(dev)],(long *) arg); return 0; case BLKRASET: /* set the read-ahead value */ if (!capable(CAP_SYS_ADMIN)) return -EACCES; if (arg > 0xff) return -EINVAL; /* limit it */ read_ahead[MAJOR(dev)] = arg; return 0; case BLKFLSBUF: /* flush */ if (! capable(CAP_SYS_ADMIN)) return -EACCES; /* only root */ fsync_dev(dev); invalidate_buffers(dev); return 0; RO_IOCTLS(dev, arg); } return -ENOTTY; } #endif /* RO_IOCTLS */Finally, register_disk did not exist until Linux 2.4. There was, instead, a function called resetup_one_dev, which performed a similar function:
resetup_one_dev(struct gendisk *gd, int drive);register_disk is emulated in sysdep.h with the following code:
static inline void register_disk(struct gendisk *gdev, kdev_t dev, unsigned minors, struct file_operations *ops, long size) { if (! gdev) return; resetup_one_dev(gdev, MINOR(dev) >> gdev->minor_shift); }One final thing worth keeping in mind: although nobody really knows what will happen in the 2.5 development series, a major block device overhaul is almost certain. Many people are unhappy with the design of this layer, and there is a lot of pressure to redo it.
Quick Reference
The most important functions and macros used in writing block drivers are summarized here. To save space, however, we do not list the fields of
struct request
,struct buffer_head
, orstruct genhd
, and we omit the predefined ioctl commands.
#include <linux/fs.h>
int register_blkdev(unsigned int major, const char *name, struct block_device_operations *bdops);
int unregister_blkdev(unsigned int major, const char *name);
These functions are in charge of device registration in the module's initialization function and device removal in the cleanup function.
#include <linux/blkdev.h>
blk_init_queue(request_queue_t *queue, request_fn_proc *request);
blk_cleanup_queue(request_queue_t *queue);
The first function initializes a queue and establishes the request function; the second is used at cleanup time.
BLK_DEFAULT_QUEUE(major)
This macro returns a default I/O request queue for a given major number.
struct blk_dev_struct blk_dev[MAX_BLKDEV];
This array is used by the kernel to find the proper queue for a given request.
int read_ahead[];
int max_readahead[][];
read_ahead
contains block-level read-ahead values for every major number. A value of 8 is reasonable for devices like hard disks; the value should be greater for slower media.max_readahead
contains filesystem-level read-ahead values for every major and minor number, and is not usually changed from the system default.
int max_sectors[][];
int blksize_size[][];
int blk_size[][];
int hardsect_size[][];
These two-dimensional arrays are indexed by major and minor number. The driver is responsible for allocating and deallocating the row in the matrix associated with its major number. The arrays represent the size of device blocks in bytes (it usually is 1 KB), the size of each minor device in kilobytes (not blocks), and the size of the hardware sector in bytes.
MAJOR_NR
DEVICE_NAME
DEVICE_NR(kdev_t device)
DEVICE_INTR
#include <linux/blk.h>
These macros must be defined by the driver beforeit includes
<linux/blk.h>
, because they are used within that file.MAJOR_NR
is the major number for the device,DEVICE_NAME
is the name of the device to be used in error messages,DEVICE_NR
returns the minor number of the physical device referred to by a device number, andDEVICE_INTR
is a little-used symbol that points to the device's bottom-half interrupt handler.
spinlock_t io_request_lock;
The spinlock that must be held whenever an I/O request queue is being manipulated.
struct request *CURRENT;
INIT_REQUEST;
end_request(int status);
INIT_REQUEST
checks the next request on the queue and returns if there are no more requests to execute. end_request is called at the completion of a block request.
spinlock_t io_request_lock;
The I/O request lock must be held any time that the request queue is being manipulated.
struct request *blkdev_entry_next_request(struct list_head *head);
struct request *blkdev_next_request(struct request *req);
struct request *blkdev_prev_request(struct request *req);
blkdev_dequeue_request(struct request *req);
blkdev_release_request(struct request *req);
blk_queue_headactive(request_queue_t *queue, int active);
Indicates whether the first request in the queue is being actively processed by the driver or not.
void blk_queue_make_request(request_queue_t *queue, make_request_fn *func);
Provides a function to handle block I/O requests directly out of the kernel.
end_that_request_first(struct request *req, int status, char *name);
end_that_request_last(struct request *req);
Handle the stages of completing a block I/O request. end_that_request_last is only called when all buffers in the request have been processed -- that is, when end_that_request_first returns 0.
bh->b_end_io(struct buffer_head *bh, int status);
int blk_ioctl(kdev_t dev, unsigned int cmd, unsigned long arg);
A utility function that implements most of the standard block device ioctl commands.
int check_disk_change(kdev_t dev);
#include<linux/gendisk.h>
struct gendisk;
struct gendisk *gendisk_head;
The generic hard disk allows Linux to support partitionable devices easily. The
gendisk
structure describes a generic disk;gendisk_head
is the beginning of a linked list of structures describing all of the disks on the system.
void register_disk(struct gendisk *gd, int drive, unsigned minors, struct block_device_operations *ops, long size);
This function scans the partition table of the disk and rewrites
genhd->part
to reflect the new partitioning.
Back to: Linux Device Drivers, 2nd Edition
oreilly.com Home | O'Reilly Bookstores | How to Order | O'Reilly Contacts
International | About O'Reilly | Affiliated Companies | Privacy Policy
╘ 2001, O'Reilly & Associates, Inc.