Kernel support has two almost independent (they actually share only some useful macros and generic functions) parts:
Code for both parts resides in
linux/fs/surprise/ in kernel tree
surprise/src/online/ in surprise tree) and there are of course
some patches to other kernel files.
Interfaces to both these operations are through
ioctl(2) must be called on the root device of the disk (ie.
/dev/hda) to be manipulated (source disk in case of move). The
ioctl number is
BLKPG defined in
<linux/blkpg.h>. See also this header file for definition of
struct blkpg_ioctl_arg which must be provided as a third argument
op in this argument has to be
BLKPG_MOVE_PARTITON for partition moving and to
BLKPG_RESIZE_PARTITION for partition resizing. Partition move uses
as its argument (fields
struct blkpg_move_partition (defined in the header file mentioned
above) and partition resize uses
See header file for description of fields of those structures.
struct changed_major for each major. To be more precise
we have pointer to this structure for each major and once Surprise needs
to do something with the major it allocates the space. In the
we keep request queue for requests on moved partition and also function which used to
return request queue before we changed that function to ours
In the structure is also array of pointers to
struct changed_disk. We have one
pointer per minor. When move of partition is requested we allocate
for the disk on which partition lies and set pointers from all minors which belong to
that list to the allocated structure.
struct changed_disk contains device number
of partition which is moved, pointer to
struct gendisk describing the disk, wait
queue for waiting when partition is moved / resized, wait queue for waiting when one
moving phase is done, list of buffers currently moved protected by semaphore and
struct move_map_int which keeps information about interval of blocks which are
already moved. The array of
struct changed_major and also all arrays of
struct changed_disk pointers and also pointed structures are protected by
cd_lock. The only exception is
struct move_map_int which is
mi_lock. We don't need to keep
cd_lock while messing with mapped
intervals because we only allocate and never deallocate structures.
Everytime any change in device's partition table or any data move is done it must
be protected by appropriate
Provided functions operating with changed disks are these:
which creates new structure for device
(waiting if this device is already busy). Note:
moveddev might be (and usually
is) just a partition (ie.
/dev/sda4) but operation on the partition
will block any other operations on the disk (ie.
/dev/sda1 will block if we already move
Function will also allocate array for
blocks buffer head pointers if flag
NCD_ALLOC_BUFS was specified. If flag
NCD_MAP_DISK was specified,
new_changed_dev() will set appropriate flag in
and so that
move_part_get_queue() knows it should map the partition blocks.
struct changed_disk for given device. It returns
NULL if there's not
should be called when
operation on partition is finished. This function will reinitialize
and wakes up all waiters.
is a bit special. It has more to do with the moving itself. This function will wait till one phase of moving on specified device is done -- it will insert caller into wait queue and starts uninterruptible sleep. After one phase of moving is done all waiters in the queue are woken up.
Partition resize (function
int resize_partition(kdev_t rdev, struct
blkpg_partition *part);) is implemented quite straightforwardly. It will
struct changed_disk for the device through
new_changed_dev(), check for potential
partition conflict when growing partition or invalidate buffers when shrinking
partition (to junk buffers which would be after end of partition) and then
alter kernel information about partitions (fields
part[MINOR(dev)].nr_sects in corresponding
The idea behind partition moving is the following: We install our own function
struct blk_dev for major to which belongs moved
partition. When request queue for some device with this major is requested we check
changed_disk array and if device (partition) is currently moved we return
our request queue from
struct changed_major. Otherwise we either call function
which was in the
blk_dev before or we return default request queue.
In our request queue we have only
make_request() function set. It is our
move_part_make_request(). We don't need other functions as we in
make_request() forward requests to various other queues.
move_part_make_request() first gets
struct changed_disk for appropriate
device. If it finds out that currently moved device isn't the device of given buffer
(this could happen if there would be some kind of race -- something happened after
we returned the request queue and before the
make_request() was called) it just
returns success and further processing in
generic_make_request will get proper
queue. After this check we wait through
wait_block_moved() until move phase is
finished if the buffer is currently moved. Then we map the block (buffer) to new block and
device and return success.
wait_block_moved() simply gets semaphore from array of moved buffer and
scans the array. If it finds buffer in it it will release the semaphore and
wait_move_done() which will wait until current moving phase is
above). If buffer is not in the
array it will just release the semaphore and return.
The implementation of block mapping is in
blockmap.c. For each mapped
interval we know its first sector, last sector, first sector of partition this
interval belongs to and destination device (root device of a destination disk)
and position of first sector of mapped interval on destination device. So our
move_map() just checks whether given block is in mapped
interval and if so it will change appropriately the fields
And now how moving itself (implemented in
move.c) works. After some
checks we get
struct changed_disk for source device -- this device will be mapped.
We check for possible partition conflicts and if moving on different
partition, we create the destination partition in kernel partition information (it is
struct gendisk). Then we call
which does the moving and after that we delete the source partition if it isn't also
a destination partition.
do_move_partition() first checks whether we should move blocks
from partition start or partition end (we have to prevent overwriting of blocks we
are going to move...). Than we create mapped interval for moved device - it will
contain no blocks in the beginning. And after that we start moving. We move
sectors in one phase.
The moving in one phase does function
It will first get buffer heads for all blocks to move. As we don't want to disturb
buffer cache nor we want anybody writing the buffer back to disk earlier than we decide.
We use our own function
gettmpblk() which will act almost same as standard
It won't just search in buffer cache whether buffer is there nor it will
hash the created buffer. I personally dislike to have this special function but I haven't
found out any cleaner solution. After we get all the buffers we submit all buffer for reading
(we have to wait till IO on corresponding buffer in buffer cache is done before submitting
read because otherwise we could get inconsistent data). After all reads are submitted
we wait on IO completition. When all reads are completed we update the mapped interval so that
it'll now contain also the freshly read buffers and start submitting buffers for writing.
Now all buffers will be mapped by our
move_part_make_request() function and
so will be written to appropriate location on destination disk (the
checks whether buffer it is mapping is the buffer
using and if so it won't wait till the phase finishes - otherwise we would have nice
deadlocks :-)). After all writes are submitted and completed we discard the buffers so
they won't make problems in buffer cache. After discarding the buffers we just wake up
processes waiting for end of current phase and return from
After all blocks are moved time for rewriting references to source device comes.
This part of code is a bit magical. Rewriting is done by function
move_push_int() which will call function
rewrite_dev_refs() and also
remove the old mapped interval after it has rewritten all the references (the
interval is not needed any more). For first we search if there is any
struct vfsmnt for the device (eg. if there is any filesystem mounted on
the device). If yes we rewrite device references in superblock and
vfsmnt to new device. Then we go through all inode lists and rewrite
i_dev to new device. We also scan all pages in page cache
belonging to the inode and rewrite
b_dev in buffers covering those pages.
We have to do this as not all buffers need to be on LRU lists in buffer cache
create_empty_buffers() doesn't insert buffers to LRU lists). From now on
no new inodes nor quotas should be created with the old device (OK, at least
for ext2 this is true). Then we go through all
struct dquot and
rewrite device references (we do this pass only in quota is enabled in the
kernel). After that we wait till all
getblk() finishes. We have to do this
as there might be some
getblk() in process and it might have our old
device in its local variables and so it will create buffer on old device (ugly
ugly ugly). After all
getblk()s returned we go through all buffer LRU
lists and rewrite device references in buffers (in
b_dev). We don't alter
any swap-related structures and so you can't swap on moved partition. This
might also change in future.
Another task for future is investigating LVM as it will probably make things a lot easier than they are now (especially rewriting of refernces) and it seems it will be accepted to 2.4 kernel.
Interface to ext2 resizing is done through mount command (I have to admit I
was inspired by one other resizer). Simply you can remount filesystem on new size :-).
After resizing patches ext2 recognizes new remount option
one numeric parameter - new size in blocks.
Both filesystem shrinking and growing needs to relocate some data blocks. When shrinking we need to move data from groups which will be deleted. When growing we need to move data in order to make place for descriptor blocks which must be situated in the beginning of group.
Data relocation is handled through function
ext2-resize.c. We go through all inodes in filesystem
and if inode is used (bit in inode bitmap is set) and inode is regular file,
directory or symlink with allocated block we relocate the data blocks with function
rewrite_inode_block_refs(). Note that we can't miss any reference as inodes
can't be moved between inode tables and newly created inodes have references just
to blocks which won't be relocated.
First we mark inode as being rewritten (flag
and then we go through all block references. For each reference we call
rewrite_block_ref(). This function calls specified callback
map() which will return whether this block should really be
relocated and where it should be put. We also pass additional information like
goal to mapping function so it can allocate blocks for relocation
cleverly. When block should be relocated we either move buffer of the block in
buffer cache to new hash queue and change its
b_blocknr or we find
proper page in page cache and change block number of buffer there (the action depends
whether we are relocating block of directory or regular file). Then we mark
buffer as dirty so bdflush will write it to new place sometimes. Note
that we don't have to worry about existence of buffers for block we are
relocating our buffer to as in that case ext2 would have problems too
(buffer cache vs. page cache races). After all blocks are mapped we remove
Now to mapping functions. Shrinking has function
this purpose. This function checks whether given block is after new end of filesystem
and if so it calls
ext2_alloc_block() to return it new block. As we have
reserved space for purposes of relocation (see
details) we know we will be successful (we say
ext2_alloc_block() with additional
argument that it should allcate block from reserved space). We also decrease block
quota as this block is in fact already accounted to the user.
Grow mapper is implemented by function
grow_mapper(). It is quite
shrink_mapper(). It only relocates blocks which lie on place
of future descriptor blocks or on place to which we have to shift inode table.
When error occures while resizing it's usually not good (I'm not counting the
safety checks in the beginning. When they fail everything should remain alright).
The behaviour quite depends on the situation when error occured. For example when
we are moving inodes and we are unable to read some inode or block we just
continue. The problems we cause by this are hopefully smaller than when we would
leave inodes half moved. On places where we just can't ignore the error we
at least try to return filesystem to the state in which fsck would be able
to repair it -- i.e. we return old values to superblock. We also immediately
make filesystem readonly so that nobody can mess with it and we mark that the
filesystem has errors. This functionality is implemented in
ext2_shrink() will first do checks whether there is
enough free space and free inodes on filesystem. Then it will count new number
of groups, number of inodes, new number of free blocks and inodes, and will set
those numbers to super block.
Actually it sets number of free blocks and inodes to
the values as they should be after all data will be moved. This is done to prevent other
processes to eat up free space for moved data and so breaking the resizing (as
doesn't directly check number of free blocks or number of free inodes we have to
add these checks to the functions). It also set bits in block bitmap after end of
new last group and removes bitmaps of deleted groups from memory. So from now
on filesystem looks (at least from superblock point of view) like shrinked and
so no new inodes / blocks will be allocated in areas which will be removed. As
there are in fact some data after end of new filesystem we have to patch
so that they won't go crazy when manipulating data after end of filesystem.
After altering superblock and bitmaps we start relocating of inodes
This is done by
move_inodes() function. When relocating inodes
we scan through all inodes in groups which will be deleted. When inode is
used (bit in bitmap set) we allocate new inode and swap inode numbers (and also
hash chains and group numbers) of new inode and old inode. Then alter inode which
will be written to old location in following way: we set
in flags and store new inode number in
i_blocks. Then we mark both inodes dirty.
So data of old inode are placed to new location and on the old location we have reference
to the new one. We use it when directory lookup is done - references from directory structure
are still to old inode number and we need
ext2_lookup() to return new inode. So
ext2_lookup() is patched that way that when it finds out that inode it got
EXT2_RESIZED_FL set it will get inode number from
field and load new inode.
After all inodes are moved we start rewriting references to inodes. We go through all directories (we scan inode table to prevent races) in filesystem. For each directory we scan all its data and when we find directory entry pointing to inode after end of filesystem we resolve the reference (load the old inode and get new inode number) and change inode number to new number. Note that we can't miss any directory entry even when move is done because before move is done source entry is looked up and so we resolve the reference and get the new inode and hence the new inode number is written to created directory entry.
When all references to inodes are rewritten we relocate data blocks. See Relocation of data for details.
Then we sync all inodes so that we don't lose data while invalidating after
remount. We also free any unused descriptor blocks. After we return from
ext2_shrink() we write all superblocks on their places and finish
ext2_shrink() first checks whether there is enough space
in the filesystem for growing. It sounds a bit strange but we need to add new
descriptor blocks to each group for new groups to be accessible. As we don't need
to add any blocks when just growing last group we actually check whether there would
be enough space when we grow last group to its new size.
After the checks are done we grow the last group -- we update number of blocks and free blocks in filesystem and group and clear bits in block bitmap. When growing of last group is done we continue with adding of new groups (of course only when we haven't filesystem large enough). At first we decrease number of free blocks so that we reserve space for new descriptor blocks (see reserving space while shrinking for more details).
Then we start allocating blocks in the beginning of groups for new descriptors. At first we analyze for each group which structures (like inode table or bitmaps) have to be moved to gain needed space. After that we mark all blocks in wanted area (we include also space for moved inode table or bitmaps in the area) as used and then move any data from the area (see Relocation of data for more details). Now when we know there are only structures like bitmaps or inode table in the area we start moving them. We start moving from structure which is last in the group so that we don't overwrite other structures.
Moving of bitmaps is quite simple. We just read bitmap to the memory, move buffer to
new location (rehash the buffer, change
b_blocknr), change reference in superblock,
and mark buffer dirty. Moving of inode table is more difficult. We have two special
entries in superblock for moving inode tables --
s_moved_inode_block. Second entry says which block is actually moved
and the first one says how much it is moved. So when we are moving blocks in inode
table we keep those values uptodate and so
ext2_update_inode() can work properly. Moving itself is simple -- we read
in some blocks, change buffers under them and mark them dirty so they will be flushed
to disk to new location.
When we have all space for group descriptors prepared it's time to create them. This
is done in function
add_descriptors(). This function just update information
kept in memory in superblock (actually the descriptors from first group) but function
write_all_supers() called later will copy it to all groups. For each new
group we set properly (we have to check whether the group is sparse or not) bitmap blocks,
inode table position, and number of free blocks. After setting is done we mark blocks with
descriptors uptodate and dirty.
The last thing needed before we can start using new groups is initializing them.
This does function
init_groups(). As all groups are initialized mostly
same we create template bitmaps in memory. Then we go through groups and in each
group get the blocks with bitmaps, copy template bitmaps to them (we have to check
whether the group is sparse or not and set bits in block bitmap for superblock and
descriptors in that case), and mark them dirty. Then we zero all blocks in inode
table. After that initialization of group is done.
When groups are initialized we set new values (inodes count, blocks count, free
blocks count, free inodes count, groups count, number of descriptor blocks) in superblock
and clear resizing state. Then we sync all inodes so that data in them won't get
invalidate_inodes(). After we return from
write_all_supers() is called so superblocks and group descriptors in
all groups are updated.