Partition Surprise 0.1: Supported filesystems: Linux ext2 filesystem internals

9.1 Linux ext2 filesystem internals

Ext2 filesystem is implemented using ext2fs library (written by Theodor Ts'o), so it supports all ordinary ext2 filesystems. Low level access to partition is done via IO-managers, the middle level job is done by libext2fs and the filesystem is controlled in high level by these sources.

All the internal functions returning numbers use negative number to signalize error has occured, functions returning pointers use the NULL value. An ERR_OCCURED() macro is used if situations not covered by these rules.

Ext2 headers

The simplest and most basic header file is types.h. It just defines ext2_debug macro and includes general system headers. Every source file imports this header at first.

The next 3 simple headers params.h and error.h/error_chket.h are generated by scripts called in Makefile. They contain declaration of filesystem parameters and error messages.

Very important headers are imported.h, exported.h. They define all internal data structures booth for imported (ext2_ptdata, ext2_dirdata, ext2_eidata) and exported (ext2_ptdata, ext2_exp_dirdata) filesystems.

struct ext2_ptdata {
#ifndef NDEBUG
        int magic;
#endif
       iconv_t conv;            /* conversion tables */         
       ext2_filsys e2_fs;
       ERR_PARAM;

       /* For import:  */
       block_t ff_block;        /* searching free blocks behind this block */

        /* For export:  */
       badblocks_list bb_list;
       int mkfs_done:1;
       char *buf;               /* buffer for sbmap */
};

The ext2_ptdata data structure contains very few attributes, because all important information is stored by libext2fs in ext2_filsys e2_fs. The structure just saves opened handles, indexes and pointers to buffers.

There are no definitions of Ext2 filesystem RAW structures (superblock, inode, etc...). All the job touching these structures is done by libext2fs and the Suprise code doesn't have to touch them.

Ext2 layers

Miscelaneous modules

error.c

Generated by Makefile scripts, contains error table initialisation.

params.c

Generated by gener_params.m4, contains Ext2 fs parameters definition, conversion functions etc...

interface.c

This source contains only the definition of fs_calls type (function callbacks).

1st layer: IO managers

Filesystem needs to be able to access partition in 2 modes:

RAW mode: Part module calls imported filesystem callbacks giving a file descriptor and unsigned int offset (given in blocks). Filesystem accesses partition by reading from/writing to this opened file, but it has to add the offset at first.
GCONV mode: when the conversion is performed, GConv may use any free blocks to store filesystem data and the filesystem must NOT care about it. Accessing partition is done by calling bread/bwrite functions, thay operate on blocks of size BASIC_BLOCKS (= 512b).

The libext2fs doesn't care about physical representation of the filesystem, so it allows using IO-managers. IO-manager is a set of standardized interface routines (callbacks) that can read from/write to the disks.

Two IO-managers are introduced: handle_io.[ch] implements handle_io_manager (for RAW mode) and surprise_io.[ch] implements surprise_io_manager (for GCONV mode). The IO-managers are implemented using cut/paste method from standard unix IO-manager from libext2fs. The first one save a file descriptor and logical offset, the second one saves pointer to GConv in their private data structures. The *_open data structures initializes them and *_close deletes them. The *_set_blksize sets the logical block-size to any multiple of BASIC_BLOCK, *_flush flushes all caches and *_read_blk, *_write_blk do the partition access.

libext2fs doesn't support the Surprise error-system, so all the ERR_PARAMs in surprise_io_manager (using GConv bread, bwrite functions) must be transferred via struct ext2_ptdata. Special macros ERR_PARAM_CHAN(channel), ERR_DECL are used instead standard ERR_PARAM, ERR_DECL.

There's nothing interesting to be noted about the IO-managers. They just call the read, write, bread, bwrite functions of lower layer. The only `hacked' function is surprise_write_blk because it has to support writing less than BASIC_BLOCK bytes. It must read the original block to the internal buffer buf at first, modify it and store it again, because GConv supports access aligned to blocks.

Initialization routines

The inits.[ch] modules contains global initialization routines and implementation of some basic filesystem callbacks.

ext2_init initializes 2 error tables (for libext2fs and itself) and checks few asserts.

ext2_read_super reads a superblock from given filehandle to ext2_super_block data structure and checks a signature and few important parameters. If the filesystem seems to be correct, initial imported filesystem parameters are parsed from the superblock. ext2_fs_identify just calls ext2_read_super and returns whether the filesystem is correct. ext2_fs_getparams fill the default filesystem parameters and calls ext2_read_super too. The imported parameters are returned to the caller.

Hacks to libext2fs

The libext2fs is a very good library, but 1 feature was missing in time of writing Surprise. The ext2fs_bmap function was able to read an inode block numbers or to allocate a new one. A simple setting the block number to given value was missing.

We introduce new ext2fs_sbmap function in sbmap.c module that implements this feature if BMAP_SET flag is set. The body is a bit modified to compute the number of blocks used by the inode in correct way -- BMAP_SET could be usedto allocate a block, to modify its number or to delete it -- but the code stays almost the same.

2nd layer: libext2fs

The middle-level job is done by libext2fs.

3rd layer: the import and export functions

These functions are divided into 4 groups:

getting info about imported filesystem (e2info.[ch])
importing a filesystem (imported.[ch])
checking the possibility of creating exported filesystem and mkfs (mkfs.[ch])
exporting a filesystem (exported.[ch])

Getting info about imported filesystem

The e2info.[ch] implements the ext2_get_info callback. It doesn't implement the fc_getrootdirs callback yet (because it is not needed to).

A static function ext2_proc_sumdir computes the total length of link names in one directory.

ext2_get_info sums the total length of file in the filesystem by browsing all inodes and using libext2fs directory iterators. A filesystem is opened by ext2fs_open and all filesystem dependent parameters are read. Then ino is iterated through all inodes. If the inode is a regular file with at least 1 link, its length is added to the total counter. If the inode is a directory, ext2fs_dir_iterate(...,ext2_proc_sumdir,...) is called to sum the link names lengths in the directory. Now all the information has been parsed and the results are returned.

Importing a filesystem

imported.[ch] implements all imported filesystem callbacks using libext2fs iterators. The algorithm loads everything using the iterators immidiately after the object is opened, and the callbacks just browse the data structure in the memory. All the data blocks of regular files and directory entries are linked into link lists.

The module uses the following types:

struct ext2_blocklist

contains up to EXT2_BLCOUNT==512 extents of regular file blocks. The pointer next points to next record in the link list, until the whole file is browsed.

struct ext2_blkdata

points to head and tail of link list of struct ext2_blocklists. It contains all the information about file placement.

struct ext2_datdata

contains the contents of a file (usually a softlink). All the data will be loaded into buf of size size.

struct ext2_eidata

contains an internal information about a loaded directory entry. The pointer to this will be stored in entry->ei_fsdata. It saves the pointer to internal directory data struct ext2_dirdata *dirdata, to list of the regular file blocks struct ext2_blkdata blkdata and to next loaded directory entry struct entry_info *next (next internal record can be accessed by next->ei_fsdata).

struct ext2_dirdata

contains the link list (struct entry_info *first, *last) of preloaded directory entries plus some usefull pointers. All the information about directory is gathered here.

The code can be divided into these sections:

Initialization routines

The module implements constructor ext2_i_init. It opens a libext2fs handle, conversion table and read the filesystem bitmaps. The destructor ext2_i_done closes the conversion table and the libext2fs filesystem.

Free blocks are read by ext2_freeblocks. It exports a free space between end of filesystem and end of partition at first. Then it browses all the blocks in the bitmap and exports them into target block_list.

Softlink implementation

The contents of a file (usually a softlink) can be read by ext2_load_file function. This function iterates ext2_data_proc through all file blocks. The ext2_data_proc just loads the given blocks into the buffer. ext2_load_softlink recognizes fast softlinks (stored directly in the inode) and ordinary softlinks. After softlink is read, the contents is converted to UCS4.

Handling directory entries

The contents of a directory is loaded by ext2_opendir functions that iterates ext2_proc_direntry through all directory entries. The ext2_proc_direntry function builds a link list of them. It reads the inode and fills the entry_info. Then it converts the directory link to UCS4 by iconv. If the link is a softlink, ext2_load_softlink is called. Internal data structure ext2_eidata is created and filled. The record is appended to the link list. Booth ext2_open_rootdir and ext2_open_subdir just call the ext2_opendir for proper inode (EXT2_ROOT_INO for root directory, dir->ei_fsdata->inode for subdirectory). This works for imported filesystems, the exported one are handled in different way, see Exporting a filesystem. ext2_close_dir deallocates all the link list.

ext2_nextentry just deletes the head of link list and moves the pointer to the next record.

Importing a regular file blocks

The blocks of regular file are read by iteration ext2_blocks_proc. This function just calls ext2_blocks_add to append a block to builded link list. It supports files with holes (if a block number is much higher than the previous one, an amount of INVALID_BLOCKs is inserted at first). ext2_blocks_add appends a block to the link lists. It tests a continuity of the last extent at first. If the block number matches, the extent is enlarged, else a new extent is created. If the array of extents in struct ext2_blocklist is fulfilled, a new record is created and linked at the of the link list struct ext2_blkdata. ext2_getdata reads a list of the file blocks at first (if it wasn't already loaded) by iterating ext2_blocks_proc. All the ext2 blocks are converted to GConv blocks by multiplicating by the block_size/BASIC_BLOCK. The last block is cutted to round the file to correct multiple of BASIC_BLOCKs (not to multiple of block_size). The preloaded extents are copied to target struct block_list until EOF is reached or the target block_list is fulfilled. ext2_getdone deallocates the struct ext2_blkdata linklist.

Handling the hardlinks

The imported hardlinks are handled in an easy way. Every inode with i_links_count>1 will force GConv to call ext2_hardlink_get_imported. This function has a dummy implementation, because it has nothing to do -- all the attributes (ei_inode number) have already been set. ext2_hardlink_hashkey divides the hardlinks into hash groups depending by ei_inode. ext2_hardlink_compare just compares the two inode numbers -- if they are equal, the hardlinks are the same. ext2_hardlink_get_entry creates the struct entry_info and struct ext2_eidata to let GConv call ext2_getdata on the hardlink. It just sets the inode number and blocks count. ext2_hardlink_free_imported just deletes the memory.

Checking the possibility of creating exported filesystem and mkfs

The functions in module mkfs.[ch] get the exported filesystem size, struct fs_info containing miscelaneous sums of sizes of imported filesystem files and checks whether the exported filesystem could be created. It takes care about inode_ratio (maximal count of inodes), block_size (rounding imported file sizes up to multiple of block_size), etc...

Computing the filesystem defaults and wasted space

Static function set_fs_defaults sets the inode_ratio, block_size parameters to optimal value depending on the partition size. It is taken from mkfs.ext2. show_stats displays detailed information about structure of created filesystem to the debug output. ext2fs_size approximates the number of blocks `wasted' by the filesystem. It takes into account these records:

superblock,
group descriptors,
inode and block bitmap,
inode tables,
reserved inodes,
space in lost+found directory.

ext2_available_blocks computes the count of blocks (of given block_size) available for user files.

approximate_ext2dir_size is a heuristic function that computes the amount of space wasted in directories. It gets a count of files and directories and the total length of all filenames and computes the CERTAIN upper bound of wasted blocks. This bound is somewhat big and could be modified later.

fill_other_parameters is an important function that computes all missing exported filesystem parameters that depend on the others. It computes blocks_per_group and one of bytes_per_inode, inodes_count depending on force_inodes value.

Checking the parameters and finding the best values

static int check_ext2_parameters(struct ext2_parameters *pars,struct user_partition *part,struct fs_info *info,...) does all the tests. At first it checks whether the partition is not too small to contain a filesystem (there must by some available blocks). Then it compares whether the inodes_count is grater or equal than the count of exported files and directories. It approximates the direcory size by calling approximate_ext2dir_size and looks whether there are enough free blocks. If this everything is OK, the filesystem could be created.

modify_ext2_parameters is an intelligent function that computes optimal bytes_per_inode, block_size values for given filesystem (if they weren't forced by setting force_... flags). It is a recursive function that browses the block_sizes in 1st level, the bytes_per_inode in 2nd level and calls check_ext2_parameters in 3rd level. Browsing in every level operates in this way: it the parameter is forced, it's value is not changed. If it isn't, we try all the possible values from the smallest (1024/2048/4096 and 1024/2048/.../262144). We never accept values returning error. If we find a value without a warning, we choose it and stop the job. ext2_check just checks the current parameter set by calling check_ext2_parameters. If any warning/error is found, a recommended configuration is probed by calling set_fs_defaults. If it doesn't fit, modify_ext2_parameters is called. It stores a correct MBR partition type (0x83) too.

ext2_getsize, ext2_getsize_struct just call ext2fs_size, approximate_ext2dir_size and compute the total exported file size for given block_size and return the count of blocks.

Fast filesystem rebuild

ext2_superblock is an accelerator. It is called if some filesystem parameters are changed, but the partition type, size and offset is not changed. It checks whether the fast conversion (just replacing the superblock) could be performed. Ext2 filesystem supports replacing these superblock parameters: reserved_percent, max_mount_count, check_interval, error_handling, volume_name. The function reads imported and exported parameters and checks whether all important parameters are equal. If they are not, nonzero is returned. In positive case we read the superblock from imported filesystem, modify the parameters and write it to exported filesystem.

Preparing a filesystem

ext2_prepare_superblock gets the exported filesystem parameters and fill the libext2fs superblock to make the correct mkfs.

ext2_mkfs does the very 1st part of mkfs. It initializes internal data structures, conversion tables and prepares the superblock by calling ext2_prepare_superblock. Then the filesystem is opened by calling ext2_initialize. Now the superblock is almost prepared. We generate an unique uuid by calling uuid_generate, show the statistics and flush the superblock. Then we allocate an empty list of bad blocks and buffer for sbmap function. There are many things to do to finish the mkfs, they will be done in ext2_open_rootdir after all the bad blocks are registered (to force not to use them). ext2_fsdone just deallocates the buffer, bad block list, conversion tables and flushed and closes the filesystem. The job will be complete.

ext2_putbadblocks takes all the extents from given block_list and checks whether they don't the superblock. If they don't, bad blocks are registered to the list (they will be checked again by handle_bad_blocks), else an error is returned.

Finishing mkfs

The rest of the sources is a 2nd phase of mkfs taken from mkfs.ext2. The defined functions are:

handle_bad_blocks

tests whether the bad blocks don't touch the important data structures. If they don't, they are signed as bad in the bitmap using the default iterator bb_iter through the bad block list.

write_inode_tables

clears all the inodes in the partition (filled by zero).

create_root_dir

calls a libray call ext2fs_mkdir on EXT2_ROOT_INO and sets its uid.

create_load_and_found

calls ext2fs_mkdir in the root directory (its alse creates a link) and expands the directory to contain enough free directory entries.

create_bad_block_inode

reserve_inodes

marks all the inodes before first user available inode as used in the inode bitmap.

zap_bootblock

fills the first sector by zero.

The ext2_finish_mkfs function called by ext2_open_rootdir calls all the previous functions in given order and flushes the filesystem.

Exporting a filesystem

Directory functions are a bit modified to work with exported filesystems too. The ext2_open_rootdir calls ext2_finish_mkfs at first (it wasn't already called). The ext2_exp_open_dir is also called instead ext2_opendir, because the internal data structures are completely different for exported filesystems.

The module exported.[ch] introduces new internal data structure struct ext2_exp_dirdata.

struct ext2_exp_dirdata {
        ino_t dir_inode;                        /* of this directory */
        ino_t file_inode;                       /* of current directory entry */
        int write_inode:1;                      /* shall the inode be written? */
        int write_pos;                          /* number of blocks written to file */
        struct ext2_inode inode;                /* filled inode with created file */
        struct extent to_fill;                  /* unfilled block after ext2_putdata */
        struct block_list free_ext2;            /* from last fc_createentry */
        struct block_list *free_blocks; 
};

Directory access functions

New open directory function ext2_exp_open_dir fills the struct ext2_exp_dirdata dir_inode attribute and clean all other attributes. The first directory entry will be read next time. We need not to redefine close directory function, because deallocation of the data structures is enough.

Converting GConv and ext2 extents

ext2_gconv_extent converts a GConv extent (interval of BASIC_BLOCKs) to Ext2 extent (interval of blocks). It can round the interval in booth inner/outer way. blocks_extent_list does the same for the struct block_list.

ext2_align is a callback aligning GConv extent into maximal subextent aligned to blocks. It just calls ext2_gconv_extent and converts the block numbers back to GConv blocks.

Manipulating directory entries

find_unique_name performs a lookup through a directory and appends a random postfix if the filename is already present. This is good when more distinct filenames are converted to the same because they contain wide characters unknown in target codepage.

Static function set_inode_attributes formats less important common attributes (date/time and uid/gid) to directory entry, set_inode_mode sets the correct file mode and size and resets i_links_count, reset_exported_dirdata copies all important parameters from imported struct entry_info *file to exported one (to allow opening directory), resets to_fill extent (none to fulfill yet) and write_post index and sets the new file_inode number. register_inode_in_bitmaps marks the inode as used in the inode bitmap.

ext2_write_softlink is the opposite of ext2_load_softlink. It converts softlink from UCS4 to 8bit charset, stores the data by calling ext2fs_file_write and calls reset_exported_dirdata. It doesn't support fast softlinks yet.

Static function link_type returns a 3bit identificator of link type to save it into directory entry (regular file, directory, etc...). create_unique_linked_inode gets an unique link name and creates and fills the inode and links it into the directory. The directories are created by calling ext2fs_mkdir and the inode is modified by calling set_inode_attributes. Other inodes (regular files, devices, etc...) are handled in this way: a new inode is allocated by calling ext2fs_new_inode, registered by register_inode_in_bitmaps and linked by calling ext2fs_link. The inode mode and attributes are set by set_inode_(mode|attributes) and the softlink is optionally written by ext2_wrote_softlink. In all cases, reset_exported_dirdata is called to make internal data structures consistent. A raw inode will stay loaded in struct ext2_exp_dirdata to be filled by ext2_putdata.

ext2_createentry stores the free_blocks pointer, converts the filename from UCS4 to given codepage by safe_iconv and checks whether it isn't a lost+found subdirectory in root directory. If it is, only the reset_exported_dirdata is called on already existing inode, else an unique name is found by find_unique_name and the inode is properly created and linked by create_unique_linked_inode.

ext2_putdone writes the inode (with updated block numbers) if it was modified and deallocates the converted free_ext2 block list.

Exporting regular file blocks

Few static functions are defined: extent_contains_block returns true if the given block will fit the given extent, block_is_free tests the block bitmap, whether the given block is free and checks the reserved extent (for append optimalizations, see the discussion in this and the next paragraph). ext2_find_free finds a free block. It is a bit complicated, because a given block is tested at first (to prevent file fragmentation) and the block_list of preferred blocks is used after (if set). If none free block is found, all the filesystem is searched. Attention! Some blocks are already used, but they haven't been written to bitmap yet to enable optimalizations. These blocks are always ordered in one extent struct ext2_extent *except and they are never used. The boolean function block_creates_indirect returns true if the given block is the first block that needs to allocate a new indirect block. The ext2_putdata must handle an exception in this case, else the blocks allocated by the file would become corrupted (because sbmap doesn't know anything about reserved blocks).

To optimalize appending blocks to file we remember the last appended blocks extent last and just increment its length if the file is not fragmented. After next fragment is used the previous will be flushed out. append_1_block checks the previous extent and increments its length if the appended block fits. If it doesn't, the function flushes the previous extent by calling ext2fs_sbmap and stores the extent to struct block_list *target. If the output buffer becomes full, it it signalized to the caller to finish immidiately the job. If the appended block is correct, new extent is created (negative values are used for flushing, they don't create a new extent). It must be noted that the allocated blocks have not been stored in bitmap yet, so searching free blocks would return them as free. A workaround is enabled: block_is_free, ext2_find_free functions expects a special reserved block extent containing blocks that must not be returned as free.

ext2_putdata clears the output block buffer last and remembers the free_blocks extent. If a previous unfilled block to_fill is set, it is filled at first (because files are stored in discrete complete blocks) and the pointer to exported blocks is shifted.

The main cycle browses through all exported blocks: variable i contains index of exported extent and j contains index of block in extent i. At first it repairs extent overflow (moves to next one if the previous has been already read) and breaks if all exported extents have already been read. Then it tests whether block_creates_indirect. If it does, the output block cache is flushed by calling append_1_block(-10,...) to make sbmap find a correct free block after it will look for one. Then it gets the block index and checks whether it could be used for file (the best way is NOT moving file blocks to speed up the permutation at the end of conversion, but is not always possible):

It must not be INVALID_BLOCK (GConv signalizes that the block is not available and filesystem must find the best position alone).
It must fit the Ext2 data area.
It must be aligned to block_size.
The exported extent must contain enough blocks to fill the block.
The block must be free (checked by calling block_is_free).

If any of these conditions is not true, a new free block is searched by calling blocks_extent_list and ext2_find_free. Filesystem prefers allocation of the next block to prevent the fragmentation. The output block extent buffer last is stored to prevent ext2_find_free allocate them. A found block is appended by calling append_1_block, which implements the output cache. The exported blocks are skipped and the cycle continues.

The output block buffer is flushed by calling append_1_block(-10) and the exported blocks overflow is checked: if GConv gives the filesystem 5 blocks but the ext2 block contains 8 GConv blocks, a whole block_size was allocated and 3 more GConv blocks have been skipped in the list. These blocks will be stored into to_fill and will be fulfilled next time or cutted after file will be closed.

Creating the hardlinks

Ext2 filesystem supports the hardlinks. When GConv asks the filesystem to create one, it just reads the inode, increments the i_links_count and links the inode to the next directory entry. ext2_hardlink_get_exported allocates new hardlink record containg the only important attribute inode and 1 debugging flag whether the hardlink has been duplicated. hardlink_free_exported deletes the memory only.

ext2_hardlink_create does almost the same job as ext2_createentry. It converts the filename from UCS4 to given codepage and calls find_unique_name. If the hardlinked inode is a directory (which is NOT supported by ext2 filesystem) or another special case has occured, the hardlink is duplicated and the directory entry is created by create_unique_linked_inode. In the most cases the real hardlink will be created: the inode is read, its i_links_count is incremented and the inode is written to the partition, new link is created by ext2fs_link, reset_exported_dirdata is called and the job is finished. The first case continues by calling ext2_hardlink_duplicate which gets a duplicated hardlink record and prepare struct entry_info and struct ext2_exp_dirdata to make ext2_putdata work correctly.

Next Previous Contents