Kernel support has two almost independent (they actually share only some useful macros and generic functions) parts:
Code for both parts resides in linux/fs/surprise/
in kernel tree
(surprise/src/online/
in surprise tree) and there are of course
some patches to other kernel files.
Interfaces to both these operations are through ioctl(2)
function.
An ioctl(2)
must be called on the root device of the disk (ie.
/dev/hda
) to be manipulated (source disk in case of move). The
ioctl number is BLKPG
defined in
<linux/blkpg.h>
. See also this header file for definition of
struct blkpg_ioctl_arg
which must be provided as a third argument
of an ioctl(2)
. Field op
in this argument has to be
set to BLKPG_MOVE_PARTITON
for partition moving and to
BLKPG_RESIZE_PARTITION
for partition resizing. Partition move uses
as its argument (fields data
and datalen
in struct
blkpg_ioctl_arg
)
struct blkpg_move_partition
(defined in the header file mentioned
above) and partition resize uses struct blkpg_partition
.
See header file for description of fields of those structures.
We keep struct changed_major
for each major. To be more precise
we have pointer to this structure for each major and once Surprise needs
to do something with the major it allocates the space. In the struct changed_major
we keep request queue for requests on moved partition and also function which used to
return request queue before we changed that function to ours move_part_get_queue
.
In the structure is also array of pointers to struct changed_disk
. We have one
pointer per minor. When move of partition is requested we allocate struct changed_disk
for the disk on which partition lies and set pointers from all minors which belong to
that list to the allocated structure. struct changed_disk
contains device number
of partition which is moved, pointer to struct gendisk
describing the disk, wait
queue for waiting when partition is moved / resized, wait queue for waiting when one
moving phase is done, list of buffers currently moved protected by semaphore and
struct move_map_int
which keeps information about interval of blocks which are
already moved. The array of struct changed_major
and also all arrays of
struct changed_disk
pointers and also pointed structures are protected by
spinlock cd_lock
. The only exception is struct move_map_int
which is
protected by mi_lock
. We don't need to keep cd_lock
while messing with mapped
intervals because we only allocate and never deallocate structures.
Everytime any change in device's partition table or any data move is done it must
be protected by appropriate struct changed_disk
.
Provided functions operating with changed disks are these:
which creates new structure for device moveddev
(waiting if this device is already busy). Note: moveddev
might be (and usually
is) just a partition (ie. /dev/sda4
) but operation on the partition
will block any other operations on the disk (ie. new_changed_dev()
on /dev/sda1
will block if we already move /dev/sda4
).
Function will also allocate array for blocks
buffer head pointers if flag
NCD_ALLOC_BUFS
was specified. If flag NCD_MAP_DISK
was specified,
new_changed_dev()
will set appropriate flag in struct changed_disk
and so that move_part_get_queue()
knows it should map the partition blocks.
will return
struct changed_disk
for given device. It returns NULL
if there's not
any structure.
should be called when
operation on partition is finished. This function will reinitialize struct changed_disk
and wakes up all waiters.
is a bit special. It has more to do with the moving itself. This function will wait till one phase of moving on specified device is done -- it will insert caller into wait queue and starts uninterruptible sleep. After one phase of moving is done all waiters in the queue are woken up.
Partition resize (function int resize_partition(kdev_t rdev, struct
blkpg_partition *part);
) is implemented quite straightforwardly. It will
get struct changed_disk
for the device through new_changed_dev()
, check for potential
partition conflict when growing partition or invalidate buffers when shrinking
partition (to junk buffers which would be after end of partition) and then
alter kernel information about partitions (fields sizes[MINOR(dev)]
and
part[MINOR(dev)].nr_sects
in corresponding struct gendisk
).
The idea behind partition moving is the following: We install our own function
move_part_get_queue()
into struct blk_dev
for major to which belongs moved
partition. When request queue for some device with this major is requested we check
our changed_disk
array and if device (partition) is currently moved we return
our request queue from struct changed_major
. Otherwise we either call function
which was in the blk_dev
before or we return default request queue.
In our request queue we have only make_request()
function set. It is our
function move_part_make_request()
. We don't need other functions as we in
make_request()
forward requests to various other queues.
move_part_make_request()
first gets struct changed_disk
for appropriate
device. If it finds out that currently moved device isn't the device of given buffer
(this could happen if there would be some kind of race -- something happened after
we returned the request queue and before the make_request()
was called) it just
returns success and further processing in generic_make_request
will get proper
queue. After this check we wait through wait_block_moved()
until move phase is
finished if the buffer is currently moved. Then we map the block (buffer) to new block and
device and return success.
wait_block_moved()
simply gets semaphore from array of moved buffer and
scans the array. If it finds buffer in it it will release the semaphore and
call wait_move_done()
which will wait until current moving phase is
finished (see
above). If buffer is not in the
array it will just release the semaphore and return.
The implementation of block mapping is in blockmap.c
. For each mapped
interval we know its first sector, last sector, first sector of partition this
interval belongs to and destination device (root device of a destination disk)
and position of first sector of mapped interval on destination device. So our
mapping function move_map()
just checks whether given block is in mapped
interval and if so it will change appropriately the fields b_rdev
and
b_rsector
.
And now how moving itself (implemented in move.c
) works. After some
checks we get struct changed_disk
for source device -- this device will be mapped.
We check for possible partition conflicts and if moving on different
partition, we create the destination partition in kernel partition information (it is
stored in struct gendisk
). Then we call do_move_partition()
which does the moving and after that we delete the source partition if it isn't also
a destination partition.
Function do_move_partition()
first checks whether we should move blocks
from partition start or partition end (we have to prevent overwriting of blocks we
are going to move...). Than we create mapped interval for moved device - it will
contain no blocks in the beginning. And after that we start moving. We move COPYSTEP
sectors in one phase.
The moving in one phase does function move_blocks()
.
It will first get buffer heads for all blocks to move. As we don't want to disturb
buffer cache nor we want anybody writing the buffer back to disk earlier than we decide.
We use our own function gettmpblk()
which will act almost same as standard getblk()
.
It won't just search in buffer cache whether buffer is there nor it will
hash the created buffer. I personally dislike to have this special function but I haven't
found out any cleaner solution. After we get all the buffers we submit all buffer for reading
(we have to wait till IO on corresponding buffer in buffer cache is done before submitting
read because otherwise we could get inconsistent data). After all reads are submitted
we wait on IO completition. When all reads are completed we update the mapped interval so that
it'll now contain also the freshly read buffers and start submitting buffers for writing.
Now all buffers will be mapped by our move_part_make_request()
function and
so will be written to appropriate location on destination disk (the move_part_make_request()
checks whether buffer it is mapping is the buffer move_blocks()
is
using and if so it won't wait till the phase finishes - otherwise we would have nice
deadlocks :-)). After all writes are submitted and completed we discard the buffers so
they won't make problems in buffer cache. After discarding the buffers we just wake up
processes waiting for end of current phase and return from move_blocks()
.
After all blocks are moved time for rewriting references to source device comes.
This part of code is a bit magical. Rewriting is done by function
move_push_int()
which will call function rewrite_dev_refs()
and also
remove the old mapped interval after it has rewritten all the references (the
interval is not needed any more). For first we search if there is any
struct vfsmnt
for the device (eg. if there is any filesystem mounted on
the device). If yes we rewrite device references in superblock and struct
vfsmnt
to new device. Then we go through all inode lists and rewrite
references in i_dev
to new device. We also scan all pages in page cache
belonging to the inode and rewrite b_dev
in buffers covering those pages.
We have to do this as not all buffers need to be on LRU lists in buffer cache
(create_empty_buffers()
doesn't insert buffers to LRU lists). From now on
no new inodes nor quotas should be created with the old device (OK, at least
for ext2 this is true). Then we go through all struct dquot
and
rewrite device references (we do this pass only in quota is enabled in the
kernel). After that we wait till all getblk()
finishes. We have to do this
as there might be some getblk()
in process and it might have our old
device in its local variables and so it will create buffer on old device (ugly
ugly ugly). After all getblk()
s returned we go through all buffer LRU
lists and rewrite device references in buffers (in b_dev
). We don't alter
any swap-related structures and so you can't swap on moved partition. This
might also change in future.
Another task for future is investigating LVM as it will probably make things a lot easier than they are now (especially rewriting of refernces) and it seems it will be accepted to 2.4 kernel.
Interface to ext2 resizing is done through mount command (I have to admit I
was inspired by one other resizer). Simply you can remount filesystem on new size :-).
After resizing patches ext2 recognizes new remount option resize
with
one numeric parameter - new size in blocks.
Both filesystem shrinking and growing needs to relocate some data blocks. When shrinking we need to move data from groups which will be deleted. When growing we need to move data in order to make place for descriptor blocks which must be situated in the beginning of group.
Data relocation is handled through function rewrite_fs_block_refs()
implemented in ext2-resize.c
. We go through all inodes in filesystem
and if inode is used (bit in inode bitmap is set) and inode is regular file,
directory or symlink with allocated block we relocate the data blocks with function
rewrite_inode_block_refs()
. Note that we can't miss any reference as inodes
can't be moved between inode tables and newly created inodes have references just
to blocks which won't be relocated.
First we mark inode as being rewritten (flag EXT2_REWRITING_FL
)
and then we go through all block references. For each reference we call
rewrite_block_ref()
. This function calls specified callback
function map()
which will return whether this block should really be
relocated and where it should be put. We also pass additional information like
goal
to mapping function so it can allocate blocks for relocation
cleverly. When block should be relocated we either move buffer of the block in
buffer cache to new hash queue and change its b_blocknr
or we find
proper page in page cache and change block number of buffer there (the action depends
whether we are relocating block of directory or regular file). Then we mark
buffer as dirty so bdflush will write it to new place sometimes. Note
that we don't have to worry about existence of buffers for block we are
relocating our buffer to as in that case ext2 would have problems too
(buffer cache vs. page cache races). After all blocks are mapped we remove
EXT2_REWRITING_FL
flag.
Now to mapping functions. Shrinking has function shrink_mapper()
for
this purpose. This function checks whether given block is after new end of filesystem
and if so it calls ext2_alloc_block()
to return it new block. As we have
reserved space for purposes of relocation (see
shrinking for
details) we know we will be successful (we say ext2_alloc_block()
with additional
argument that it should allcate block from reserved space). We also decrease block
quota as this block is in fact already accounted to the user.
Grow mapper is implemented by function grow_mapper()
. It is quite
similar to shrink_mapper()
. It only relocates blocks which lie on place
of future descriptor blocks or on place to which we have to shift inode table.
When error occures while resizing it's usually not good (I'm not counting the
safety checks in the beginning. When they fail everything should remain alright).
The behaviour quite depends on the situation when error occured. For example when
we are moving inodes and we are unable to read some inode or block we just
continue. The problems we cause by this are hopefully smaller than when we would
leave inodes half moved. On places where we just can't ignore the error we
at least try to return filesystem to the state in which fsck would be able
to repair it -- i.e. we return old values to superblock. We also immediately
make filesystem readonly so that nobody can mess with it and we mark that the
filesystem has errors. This functionality is implemented in surprise_ext2_err()
and restore_ext2_super()
.
Function ext2_shrink()
will first do checks whether there is
enough free space and free inodes on filesystem. Then it will count new number
of groups, number of inodes, new number of free blocks and inodes, and will set
those numbers to super block.
Actually it sets number of free blocks and inodes to
the values as they should be after all data will be moved. This is done to prevent other
processes to eat up free space for moved data and so breaking the resizing (as
current function ext2_new_block()
and ext2_new_inode()
doesn't directly check number of free blocks or number of free inodes we have to
add these checks to the functions). It also set bits in block bitmap after end of
new last group and removes bitmaps of deleted groups from memory. So from now
on filesystem looks (at least from superblock point of view) like shrinked and
so no new inodes / blocks will be allocated in areas which will be removed. As
there are in fact some data after end of new filesystem we have to patch
ext2_free_block()
, ext2_free_inode()
and ext2_read_inode()
so that they won't go crazy when manipulating data after end of filesystem.
After altering superblock and bitmaps we start relocating of inodes
This is done by move_inodes()
function. When relocating inodes
we scan through all inodes in groups which will be deleted. When inode is
used (bit in bitmap set) we allocate new inode and swap inode numbers (and also
hash chains and group numbers) of new inode and old inode. Then alter inode which
will be written to old location in following way: we set EXT2_RESIZED_FL
in flags and store new inode number in i_blocks
. Then we mark both inodes dirty.
So data of old inode are placed to new location and on the old location we have reference
to the new one. We use it when directory lookup is done - references from directory structure
are still to old inode number and we need ext2_lookup()
to return new inode. So
ext2_lookup()
is patched that way that when it finds out that inode it got
has EXT2_RESIZED_FL
set it will get inode number from i_blocks
field and load new inode.
After all inodes are moved we start rewriting references to inodes. We go through all directories (we scan inode table to prevent races) in filesystem. For each directory we scan all its data and when we find directory entry pointing to inode after end of filesystem we resolve the reference (load the old inode and get new inode number) and change inode number to new number. Note that we can't miss any directory entry even when move is done because before move is done source entry is looked up and so we resolve the reference and get the new inode and hence the new inode number is written to created directory entry.
When all references to inodes are rewritten we relocate data blocks. See Relocation of data for details.
Then we sync all inodes so that we don't lose data while invalidating after
remount. We also free any unused descriptor blocks. After we return from
ext2_shrink()
we write all superblocks on their places and finish
resizing.
Function ext2_shrink()
first checks whether there is enough space
in the filesystem for growing. It sounds a bit strange but we need to add new
descriptor blocks to each group for new groups to be accessible. As we don't need
to add any blocks when just growing last group we actually check whether there would
be enough space when we grow last group to its new size.
After the checks are done we grow the last group -- we update number of blocks and free blocks in filesystem and group and clear bits in block bitmap. When growing of last group is done we continue with adding of new groups (of course only when we haven't filesystem large enough). At first we decrease number of free blocks so that we reserve space for new descriptor blocks (see reserving space while shrinking for more details).
Then we start allocating blocks in the beginning of groups for new descriptors. At first we analyze for each group which structures (like inode table or bitmaps) have to be moved to gain needed space. After that we mark all blocks in wanted area (we include also space for moved inode table or bitmaps in the area) as used and then move any data from the area (see Relocation of data for more details). Now when we know there are only structures like bitmaps or inode table in the area we start moving them. We start moving from structure which is last in the group so that we don't overwrite other structures.
Moving of bitmaps is quite simple. We just read bitmap to the memory, move buffer to
new location (rehash the buffer, change b_blocknr
), change reference in superblock,
and mark buffer dirty. Moving of inode table is more difficult. We have two special
entries in superblock for moving inode tables -- s_iblocks_shift
and
s_moved_inode_block
. Second entry says which block is actually moved
and the first one says how much it is moved. So when we are moving blocks in inode
table we keep those values uptodate and so ext2_read_inode()
and
ext2_update_inode()
can work properly. Moving itself is simple -- we read
in some blocks, change buffers under them and mark them dirty so they will be flushed
to disk to new location.
When we have all space for group descriptors prepared it's time to create them. This
is done in function add_descriptors()
. This function just update information
kept in memory in superblock (actually the descriptors from first group) but function
write_all_supers()
called later will copy it to all groups. For each new
group we set properly (we have to check whether the group is sparse or not) bitmap blocks,
inode table position, and number of free blocks. After setting is done we mark blocks with
descriptors uptodate and dirty.
The last thing needed before we can start using new groups is initializing them.
This does function init_groups()
. As all groups are initialized mostly
same we create template bitmaps in memory. Then we go through groups and in each
group get the blocks with bitmaps, copy template bitmaps to them (we have to check
whether the group is sparse or not and set bits in block bitmap for superblock and
descriptors in that case), and mark them dirty. Then we zero all blocks in inode
table. After that initialization of group is done.
When groups are initialized we set new values (inodes count, blocks count, free
blocks count, free inodes count, groups count, number of descriptor blocks) in superblock
and clear resizing state. Then we sync all inodes so that data in them won't get
trashed in invalidate_inodes()
. After we return from ext2_grow()
write_all_supers()
is called so superblocks and group descriptors in
all groups are updated.