Partition Surprise 0.1: Surprise Online Partition Resizing & Moving And Ext2 Resizing: Kernel part

10.2 Kernel part

Kernel support has two almost independent (they actually share only some useful macros and generic functions) parts:

Partition moving and resizing
Ext2 resizing

Code for both parts resides in linux/fs/surprise/ in kernel tree (surprise/src/online/ in surprise tree) and there are of course some patches to other kernel files.

Partition moving and resizing

Interface

Interfaces to both these operations are through ioctl(2) function. An ioctl(2) must be called on the root device of the disk (ie. /dev/hda) to be manipulated (source disk in case of move). The ioctl number is BLKPG defined in <linux/blkpg.h>. See also this header file for definition of struct blkpg_ioctl_arg which must be provided as a third argument of an ioctl(2). Field op in this argument has to be set to BLKPG_MOVE_PARTITON for partition moving and to BLKPG_RESIZE_PARTITION for partition resizing. Partition move uses as its argument (fields data and datalen in struct blkpg_ioctl_arg) struct blkpg_move_partition (defined in the header file mentioned above) and partition resize uses struct blkpg_partition. See header file for description of fields of those structures.

Changed devices

We keep struct changed_major for each major. To be more precise we have pointer to this structure for each major and once Surprise needs to do something with the major it allocates the space. In the struct changed_major we keep request queue for requests on moved partition and also function which used to return request queue before we changed that function to ours move_part_get_queue. In the structure is also array of pointers to struct changed_disk. We have one pointer per minor. When move of partition is requested we allocate struct changed_disk for the disk on which partition lies and set pointers from all minors which belong to that list to the allocated structure. struct changed_disk contains device number of partition which is moved, pointer to struct gendisk describing the disk, wait queue for waiting when partition is moved / resized, wait queue for waiting when one moving phase is done, list of buffers currently moved protected by semaphore and struct move_map_int which keeps information about interval of blocks which are already moved. The array of struct changed_major and also all arrays of struct changed_disk pointers and also pointed structures are protected by spinlock cd_lock. The only exception is struct move_map_int which is protected by mi_lock. We don't need to keep cd_lock while messing with mapped intervals because we only allocate and never deallocate structures.

Everytime any change in device's partition table or any data move is done it must be protected by appropriate struct changed_disk.

Provided functions operating with changed disks are these:

struct changed_dev *new_changed_dev(struct gendisk *gd, kdev_t moveddev, int blocks, int flags);

which creates new structure for device moveddev (waiting if this device is already busy). Note: moveddev might be (and usually is) just a partition (ie. /dev/sda4) but operation on the partition will block any other operations on the disk (ie. new_changed_dev() on /dev/sda1 will block if we already move /dev/sda4). Function will also allocate array for blocks buffer head pointers if flag NCD_ALLOC_BUFS was specified. If flag NCD_MAP_DISK was specified, new_changed_dev() will set appropriate flag in struct changed_disk and so that move_part_get_queue() knows it should map the partition blocks.

struct changed_dev *get_changed_dev(kdev_t dev);

will return struct changed_disk for given device. It returns NULL if there's not any structure.

void free_changed_dev(struct changed_disk *cd);

should be called when operation on partition is finished. This function will reinitialize struct changed_disk and wakes up all waiters.

void wait_move_done(struct changed_disk *cd);

is a bit special. It has more to do with the moving itself. This function will wait till one phase of moving on specified device is done -- it will insert caller into wait queue and starts uninterruptible sleep. After one phase of moving is done all waiters in the queue are woken up.

Implementation of partition resize

Partition resize (function int resize_partition(kdev_t rdev, struct blkpg_partition *part);) is implemented quite straightforwardly. It will get struct changed_disk for the device through new_changed_dev(), check for potential partition conflict when growing partition or invalidate buffers when shrinking partition (to junk buffers which would be after end of partition) and then alter kernel information about partitions (fields sizes[MINOR(dev)] and part[MINOR(dev)].nr_sects in corresponding struct gendisk).

Implementation of partition move

The idea behind partition moving is the following: We install our own function move_part_get_queue() into struct blk_dev for major to which belongs moved partition. When request queue for some device with this major is requested we check our changed_disk array and if device (partition) is currently moved we return our request queue from struct changed_major. Otherwise we either call function which was in the blk_dev before or we return default request queue.

In our request queue we have only make_request() function set. It is our function move_part_make_request(). We don't need other functions as we in make_request() forward requests to various other queues.

move_part_make_request() first gets struct changed_disk for appropriate device. If it finds out that currently moved device isn't the device of given buffer (this could happen if there would be some kind of race -- something happened after we returned the request queue and before the make_request() was called) it just returns success and further processing in generic_make_request will get proper queue. After this check we wait through wait_block_moved() until move phase is finished if the buffer is currently moved. Then we map the block (buffer) to new block and device and return success.

wait_block_moved() simply gets semaphore from array of moved buffer and scans the array. If it finds buffer in it it will release the semaphore and call wait_move_done() which will wait until current moving phase is finished (see above). If buffer is not in the array it will just release the semaphore and return.

The implementation of block mapping is in blockmap.c. For each mapped interval we know its first sector, last sector, first sector of partition this interval belongs to and destination device (root device of a destination disk) and position of first sector of mapped interval on destination device. So our mapping function move_map() just checks whether given block is in mapped interval and if so it will change appropriately the fields b_rdev and b_rsector.

And now how moving itself (implemented in move.c) works. After some checks we get struct changed_disk for source device -- this device will be mapped. We check for possible partition conflicts and if moving on different partition, we create the destination partition in kernel partition information (it is stored in struct gendisk). Then we call do_move_partition() which does the moving and after that we delete the source partition if it isn't also a destination partition.

Function do_move_partition() first checks whether we should move blocks from partition start or partition end (we have to prevent overwriting of blocks we are going to move...). Than we create mapped interval for moved device - it will contain no blocks in the beginning. And after that we start moving. We move COPYSTEP sectors in one phase.

The moving in one phase does function move_blocks(). It will first get buffer heads for all blocks to move. As we don't want to disturb buffer cache nor we want anybody writing the buffer back to disk earlier than we decide. We use our own function gettmpblk() which will act almost same as standard getblk(). It won't just search in buffer cache whether buffer is there nor it will hash the created buffer. I personally dislike to have this special function but I haven't found out any cleaner solution. After we get all the buffers we submit all buffer for reading (we have to wait till IO on corresponding buffer in buffer cache is done before submitting read because otherwise we could get inconsistent data). After all reads are submitted we wait on IO completition. When all reads are completed we update the mapped interval so that it'll now contain also the freshly read buffers and start submitting buffers for writing. Now all buffers will be mapped by our move_part_make_request() function and so will be written to appropriate location on destination disk (the move_part_make_request() checks whether buffer it is mapping is the buffer move_blocks() is using and if so it won't wait till the phase finishes - otherwise we would have nice deadlocks :-)). After all writes are submitted and completed we discard the buffers so they won't make problems in buffer cache. After discarding the buffers we just wake up processes waiting for end of current phase and return from move_blocks().

After all blocks are moved time for rewriting references to source device comes. This part of code is a bit magical. Rewriting is done by function move_push_int() which will call function rewrite_dev_refs() and also remove the old mapped interval after it has rewritten all the references (the interval is not needed any more). For first we search if there is any struct vfsmnt for the device (eg. if there is any filesystem mounted on the device). If yes we rewrite device references in superblock and struct vfsmnt to new device. Then we go through all inode lists and rewrite references in i_dev to new device. We also scan all pages in page cache belonging to the inode and rewrite b_dev in buffers covering those pages. We have to do this as not all buffers need to be on LRU lists in buffer cache (create_empty_buffers() doesn't insert buffers to LRU lists). From now on no new inodes nor quotas should be created with the old device (OK, at least for ext2 this is true). Then we go through all struct dquot and rewrite device references (we do this pass only in quota is enabled in the kernel). After that we wait till all getblk() finishes. We have to do this as there might be some getblk() in process and it might have our old device in its local variables and so it will create buffer on old device (ugly ugly ugly). After all getblk()s returned we go through all buffer LRU lists and rewrite device references in buffers (in b_dev). We don't alter any swap-related structures and so you can't swap on moved partition. This might also change in future.

Another task for future is investigating LVM as it will probably make things a lot easier than they are now (especially rewriting of refernces) and it seems it will be accepted to 2.4 kernel.

Ext2 resizing

Interface

Interface to ext2 resizing is done through mount command (I have to admit I was inspired by one other resizer). Simply you can remount filesystem on new size :-). After resizing patches ext2 recognizes new remount option resize with one numeric parameter - new size in blocks.

Relocation of data

Both filesystem shrinking and growing needs to relocate some data blocks. When shrinking we need to move data from groups which will be deleted. When growing we need to move data in order to make place for descriptor blocks which must be situated in the beginning of group.

Data relocation is handled through function rewrite_fs_block_refs() implemented in ext2-resize.c. We go through all inodes in filesystem and if inode is used (bit in inode bitmap is set) and inode is regular file, directory or symlink with allocated block we relocate the data blocks with function rewrite_inode_block_refs(). Note that we can't miss any reference as inodes can't be moved between inode tables and newly created inodes have references just to blocks which won't be relocated.

First we mark inode as being rewritten (flag EXT2_REWRITING_FL) and then we go through all block references. For each reference we call rewrite_block_ref(). This function calls specified callback function map() which will return whether this block should really be relocated and where it should be put. We also pass additional information like goal to mapping function so it can allocate blocks for relocation cleverly. When block should be relocated we either move buffer of the block in buffer cache to new hash queue and change its b_blocknr or we find proper page in page cache and change block number of buffer there (the action depends whether we are relocating block of directory or regular file). Then we mark buffer as dirty so bdflush will write it to new place sometimes. Note that we don't have to worry about existence of buffers for block we are relocating our buffer to as in that case ext2 would have problems too (buffer cache vs. page cache races). After all blocks are mapped we remove EXT2_REWRITING_FL flag.

Now to mapping functions. Shrinking has function shrink_mapper() for this purpose. This function checks whether given block is after new end of filesystem and if so it calls ext2_alloc_block() to return it new block. As we have reserved space for purposes of relocation (see shrinking for details) we know we will be successful (we say ext2_alloc_block() with additional argument that it should allcate block from reserved space). We also decrease block quota as this block is in fact already accounted to the user.

Grow mapper is implemented by function grow_mapper(). It is quite similar to shrink_mapper(). It only relocates blocks which lie on place of future descriptor blocks or on place to which we have to shift inode table.

Error handling

When error occures while resizing it's usually not good (I'm not counting the safety checks in the beginning. When they fail everything should remain alright). The behaviour quite depends on the situation when error occured. For example when we are moving inodes and we are unable to read some inode or block we just continue. The problems we cause by this are hopefully smaller than when we would leave inodes half moved. On places where we just can't ignore the error we at least try to return filesystem to the state in which fsck would be able to repair it -- i.e. we return old values to superblock. We also immediately make filesystem readonly so that nobody can mess with it and we mark that the filesystem has errors. This functionality is implemented in surprise_ext2_err() and restore_ext2_super().

Ext2 shrinking

Function ext2_shrink() will first do checks whether there is enough free space and free inodes on filesystem. Then it will count new number of groups, number of inodes, new number of free blocks and inodes, and will set those numbers to super block. Actually it sets number of free blocks and inodes to the values as they should be after all data will be moved. This is done to prevent other processes to eat up free space for moved data and so breaking the resizing (as current function ext2_new_block() and ext2_new_inode() doesn't directly check number of free blocks or number of free inodes we have to add these checks to the functions). It also set bits in block bitmap after end of new last group and removes bitmaps of deleted groups from memory. So from now on filesystem looks (at least from superblock point of view) like shrinked and so no new inodes / blocks will be allocated in areas which will be removed. As there are in fact some data after end of new filesystem we have to patch ext2_free_block(), ext2_free_inode() and ext2_read_inode() so that they won't go crazy when manipulating data after end of filesystem.

After altering superblock and bitmaps we start relocating of inodes This is done by move_inodes() function. When relocating inodes we scan through all inodes in groups which will be deleted. When inode is used (bit in bitmap set) we allocate new inode and swap inode numbers (and also hash chains and group numbers) of new inode and old inode. Then alter inode which will be written to old location in following way: we set EXT2_RESIZED_FL in flags and store new inode number in i_blocks. Then we mark both inodes dirty. So data of old inode are placed to new location and on the old location we have reference to the new one. We use it when directory lookup is done - references from directory structure are still to old inode number and we need ext2_lookup() to return new inode. So ext2_lookup() is patched that way that when it finds out that inode it got has EXT2_RESIZED_FL set it will get inode number from i_blocks field and load new inode.

After all inodes are moved we start rewriting references to inodes. We go through all directories (we scan inode table to prevent races) in filesystem. For each directory we scan all its data and when we find directory entry pointing to inode after end of filesystem we resolve the reference (load the old inode and get new inode number) and change inode number to new number. Note that we can't miss any directory entry even when move is done because before move is done source entry is looked up and so we resolve the reference and get the new inode and hence the new inode number is written to created directory entry.

When all references to inodes are rewritten we relocate data blocks. See Relocation of data for details.

Then we sync all inodes so that we don't lose data while invalidating after remount. We also free any unused descriptor blocks. After we return from ext2_shrink() we write all superblocks on their places and finish resizing.

Ext2 growing

Function ext2_shrink() first checks whether there is enough space in the filesystem for growing. It sounds a bit strange but we need to add new descriptor blocks to each group for new groups to be accessible. As we don't need to add any blocks when just growing last group we actually check whether there would be enough space when we grow last group to its new size.

After the checks are done we grow the last group -- we update number of blocks and free blocks in filesystem and group and clear bits in block bitmap. When growing of last group is done we continue with adding of new groups (of course only when we haven't filesystem large enough). At first we decrease number of free blocks so that we reserve space for new descriptor blocks (see reserving space while shrinking for more details).

Then we start allocating blocks in the beginning of groups for new descriptors. At first we analyze for each group which structures (like inode table or bitmaps) have to be moved to gain needed space. After that we mark all blocks in wanted area (we include also space for moved inode table or bitmaps in the area) as used and then move any data from the area (see Relocation of data for more details). Now when we know there are only structures like bitmaps or inode table in the area we start moving them. We start moving from structure which is last in the group so that we don't overwrite other structures.

Moving of bitmaps is quite simple. We just read bitmap to the memory, move buffer to new location (rehash the buffer, change b_blocknr), change reference in superblock, and mark buffer dirty. Moving of inode table is more difficult. We have two special entries in superblock for moving inode tables -- s_iblocks_shift and s_moved_inode_block. Second entry says which block is actually moved and the first one says how much it is moved. So when we are moving blocks in inode table we keep those values uptodate and so ext2_read_inode() and ext2_update_inode() can work properly. Moving itself is simple -- we read in some blocks, change buffers under them and mark them dirty so they will be flushed to disk to new location.

When we have all space for group descriptors prepared it's time to create them. This is done in function add_descriptors(). This function just update information kept in memory in superblock (actually the descriptors from first group) but function write_all_supers() called later will copy it to all groups. For each new group we set properly (we have to check whether the group is sparse or not) bitmap blocks, inode table position, and number of free blocks. After setting is done we mark blocks with descriptors uptodate and dirty.

The last thing needed before we can start using new groups is initializing them. This does function init_groups(). As all groups are initialized mostly same we create template bitmaps in memory. Then we go through groups and in each group get the blocks with bitmaps, copy template bitmaps to them (we have to check whether the group is sparse or not and set bits in block bitmap for superblock and descriptors in that case), and mark them dirty. Then we zero all blocks in inode table. After that initialization of group is done.

When groups are initialized we set new values (inodes count, blocks count, free blocks count, free inodes count, groups count, number of descriptor blocks) in superblock and clear resizing state. Then we sync all inodes so that data in them won't get trashed in invalidate_inodes(). After we return from ext2_grow() write_all_supers() is called so superblocks and group descriptors in all groups are updated.

Next Previous Contents