Saturday, 30 August 2014

Create Linux Virtual File System

Problem statement
Write a python program for creating Virtual File System on Linux environment 

   A.    History of Linux filesystem
Linux is a Unix-like operating system, which runs on PC-386 computers. It was implemented first as extension to the Minix operating system [Tanenbaum 1987] and its first versions included support for the Minix filesystem only. The Minix filesystem contains two serious limitations: block addresses are stored in 16 bit integers, thus the maximal filesystem size is restricted to 64 mega bytes, and directories contain fixed-size entries and the maximal file name is 14 characters.

In its very early days, Linux was cross-developed under the Minix operating system. It was easier to share disks between the two systems than to design a new filesystem, so Linus Torvalds decided to implement support for the Minix filesystem in Linux. The Minix filesystem was an efficient and relatively bug-free piece of software.
However, the restrictions in the design of the Minix filesystem were too limiting, so people started thinking and working on the implementation of new filesystems in Linux.In order to ease the addition of new filesystems into the Linux kernel, a Virtual File System (VFS) layer was developed. The VFS layer was initially written by Chris Provenzano, and later rewritten by Linus Torvalds before it was integrated into the Linux kernel.
   B.    Basic filesystem concepts
Every Linux filesystem implements a basic set of common concepts derivated from the Unix operating system [Bach 1986] files are represented by inodes, directories are simply files containing a list of entries and devices can be accessed by requesting I/O on special files.
                          I.          Inodes
Each file is represented by a structure, called an inode. Each inode contains the description of the file: file type, access rights, owners, timestamps, size, pointers to data blocks. The addresses of data blocks allocated to a file are stored in its inode. When a user requests an I/O operation on the file, the kernel code converts the current offset to a block number, uses this number as an index in the block addresses table and reads or writes the physical block. Figure 1 below shows structure of inode.

Figure 1 structure of inode
                        II.          Directories
Directories are structured in a hierarchical tree. Each directory can contain files and subdirectories. Directories are implemented as a special type of files. Actually, a directory is a file containing a list of entries. Each entry contains an inode number and a file name. When a process uses a pathname, the kernel code searches in the directories to find the corresponding inode number. After the name has been converted to an inode number, the inode is loaded into memory and is used by subsequent requests. Figure 2 represents association between inode table and directory.   

Figure 2 Association between inode table and directory
                      III.          Links
Unix filesystems implement the concept of link. Several names can be associated with a inode. The inode contains a field containing the number associated with the file. Adding a link simply consists in creating a directory entry, where the inode number points to the inode, and in incrementing the links count in the inode. When a link is deleted, i.e. when one uses the rm command to remove a filename, the kernel decrements the links count and deallocates the inode if this count becomes zero.
This type of link is called a hard link and can only be used within a single filesystem: it is impossible to create cross-filesystem hard links. Moreover, hard links can only point on files: a directory hard link cannot be created to prevent the apparition of a cycle in the directory tree.
                      IV.          Device special files
In Unix-like operating systems, devices can be accessed via special files. A device special file does not use any space on the filesystem. It is only an access point to the device driver.
Two types of special files exist: character and block special files. The former allows I/O operations in character mode while the later requires data to be written in block mode via the buffer cache functions. When an I/O request is made on a special file, it is forwarded to a (pseudo) device driver. A special file is referenced by a major number, which identifies the device type, and a minor number, which identifies the unit.
   C.    The Virtual File System
The Linux kernel contains a Virtual File System layer which is used during system calls acting on files. The VFS is an indirection layer which handles the file oriented system calls and calls the necessary functions in the physical filesystem code to do the I/O.
This indirection mechanism is frequently used in Unix-like operating systems to ease the integration and the use of several filesystem types [Kleiman 1986, Seltzer et al. 1993].
When a process issues a file oriented system call, the kernel calls a function contained in the VFS. This function handles the structure independent manipulations and redirects the call to a function contained in the physical filesystem code, which is responsible for handling the structure dependent operations. Filesystem code uses the buffer cache functions to request I/O on devices. This scheme is illustrated in figure 3.

Figure 3 Logical diagram of VFS
   D.    The VFS structure
The VFS defines a set of functions that every filesystem has to implement. This interface is made up of a set of operations associated to three kinds of objects: filesystems, inodes, and open files.
The VFS knows about filesystem types supported in the kernel. It uses a table defined during the kernel configuration. Each entry in this table describes a filesystem type: it contains the name of the filesystem type and a pointer on a function called during the mount operation. When a filesystem is to be mounted, the appropriate mount function is called. This function is responsible for reading the superblock from the disk, initializing its internal variables, and returning a mounted filesystem descriptor to the VFS. After the filesystem is mounted, the VFS functions can use this descriptor to access the physical filesystem routines.
A mounted filesystem descriptor contains several kinds of data: informations that are common to every filesystem types, pointers to functions provided by the physical filesystem kernel code, and private data maintained by the physical filesystem code. The function pointers contained in the filesystem descriptors allow the VFS to access the filesystem internal routines.
Two other types of descriptors are used by the VFS: an inode descriptor and an open file descriptor. Each descriptor contains informations related to files in use and a set of operations provided by the physical filesystem code. While the inode descriptor contains pointers to functions that can be used to act on any file (e.g. create, unlink), the file descriptors contains pointer to functions which can only act on open files (e.g. read,write).

   E.    The Linux VFS
The Linux kernel implements the concept of Virtual File System (VFS, originally Virtual Filesystem Switch), so that it is (to a large degree) possible to separate actual "low-level" filesystem code from the rest of the kernel. The API of a filesystem is described below.
This API was designed with things closely related to the ext2 filesystem in mind. For very different filesystems, like NFS, there are all kinds of problems.
Four main objects: superblock, dentries, inodes, files
The kernel keeps track of files using in-core inodes ("index nodes"), usually derived by the low-level filesystem from on-disk inodes.
A file may have several names, and there is a layer of dentries ("directory entries") that represent pathnames, speeding up the lookup operation.
Several processes may have the same file open for reading or writing, and file structures contain the required information such as the current file position.
Access to a filesystem starts by mounting it. This operation takes a filesystem type (like ext2, vfat, iso9660, nfs) and a device and produces the in-core superblock that contains the information required for operations on the filesystem; a third ingredient, the mount point, specifies what pathname refers to the root of the filesystem.
Auxiliary objects
We have filesystem types, used to connect the name of the filesystem to the routines for setting it up (at mount time) or tearing it down (at umount time).
A struct vfsmount represents a subtree in the big file hierarchy - basically a pair (device, mountpoint).
A struct nameidata represents the result of a lookup.
A struct address_space gives the mapping between the blocks in a file and blocks on disk. It is needed for I/O.

   F.     Implementation
You can take a disk file, format it as an ext2, ext3 filesystem, and then mount it, just like a physical drive. It's then possible to read and write files to this newly-mounted device. You can also copy the complete filesystem, since it is just a file, to another computer.
This is an excellent way to investigate different filesystem without having to reformat a physical drive, which means you avoid the hassle of moving all your data. This method is quick -- very quick compared to preparing a physical device. You can then read and write files to the mounted device, but what is truly great about this technique is that you can explore different filesystem such as ext3, or ext2 without having to purchase an additional physical drive. Since the same file can be mounted on more than one mount point, you can investigate sync rates.
Creating a filesystem in this manner allows you to set a hard limit on the amount of space used, which, of course, will be equal to the file size. This can be an advantage if you need to move this information to other servers. Since the contents cannot grow beyond the file, you can easily keep track of how much space is being used.
First, you want to create a 20MB file by executing the following command:

pavan@ubuntu~:$ dd if=/dev/zero of=disk-image count=40960
40960+0 records in
40960+0 records out

You created a 20 MB file because, by default, dd uses a block size of 512 bytes. That makes the size: 40960*512=20971520.

pavan@ubuntu~:$ ls -l disk-image
-rw-rw-r--    1 pavan  pavan  disk-image

Next, to format this as an ext3 filesystem, you just execute the following command:

pavan@ubuntu~:$/sbin/mkfs -t ext3 -q disk-image
mke2fs 1.32 (02-Aug-2014)
disk-image is not a block special device.
Proceed anyway? (y,n) y

You are asked whether to proceed because this is a file, and not a block device. That is OK. We will mount this as a loopback device so that this file will simulate a block device. Next, you need to create a directory that will serve as a mount point for the loopback device.

pavan@ubuntu~:$ mkdir fs

You are now one step away from the last step. You just want to find out what the next available loopback device number is. Normally, loopback devices start at zero (/dev/loop0) and work their way up (/dev/loop1, /dev/loop2, ... /dev/loopn). An easy way for you to find out what loopback devices are being used is to look into /proc/mounts, since the mount command may not give you what you need.

pavan@ubuntu~:$cat /proc/mounts
rootfs / rootfs rw 0 0
/dev/root / ext3 rw 0 0
/proc /proc proc rw,nodiratime 0 0
none /sys sysfs rw 0 0
/dev/sda1 /boot ext3 rw 0 0
none /dev/pts devpts rw 0 0
/proc/bus/usb /proc/bus/usb usbdevfs rw 0 0
none /dev/shm tmpfs rw 0 0

On my computer, I have no loopback devices mounted, so I'm OK to start with zero. You must do the next command as root, or with an account that has superuser privileges.

pavan@ubuntu~:$mount -o loop=/dev/loop0 disk-image fs

That's it. You just mounted the file as a device. Now take a look at /proc/mounts, you will see this is using /dev/loop0.

pavan@ubuntu~:$cat /proc/mounts
rootfs / rootfs rw 0 0
/dev/root / ext3 rw 0 0
/proc /proc proc rw,nodiratime 0 0
none /sys sysfs rw 0 0
/dev/sda1 /boot ext3 rw 0 0
none /dev/pts devpts rw 0 0
/proc/bus/usb /proc/bus/usb usbdevfs rw 0 0
none /dev/shm tmpfs rw 0 0
/dev/loop0 /home/pavan/junk/fs ext3 rw 0 0

You can now create new files, write to them, read them, and do everything you normally would do on a disk drive. To check newly created file system details use

pavan@ubuntu~:$df -h

If you need to umount the filesystem, as root, just issue the umount command. If you need to free the loopback device, execute the losetup command with the -d option. You can execute both commands as follows:

pavan@ubuntu~:$umount /home/pavan/junk/fs
pavan@ubuntu~:$losetup -d /dev/loop0

Python source code (Contributed by Nikhil Gupta)
import sys,subprocess,os
if len(sys.argv) != 5:
    print "Invalid Argument!!\n"
    if not os.path.exists(sys.argv[4]):
    f = open(sys.argv[1],"w")
    if sys.argv[3] =="b":
        size = float(sys.argv[2])
    elif sys.argv[3] == "kb":
        size = float(sys.argv[2]) * 1000
    elif sys.argv[3] == "mb":
        size = float(sys.argv[2]) * 1000000
        print "Wrong block type!! \n"
        print "Supported block are : \n1. Bytes identified here as 'b'"
        print "\n2. KiloBytes identified here as 'kb'"
        print "\n3. MegaBytes identified here as 'mb'"
    mkfs = "mkfs -t ext3 -q " + sys.argv[1]
    subprocess.check_call(mkfs, shell = True)
    while True:
        freeloop = "loop" + str(x)
        check = "grep -c '" + freeloop + "' /proc/mounts"
        down =, shell = True)
        if down == 1:
            x = x + 1
    mount = "mount -o loop=/dev/" + freeloop + " " + sys.argv[1] + " " + sys.argv[4]  
    flag = subprocess.check_call(mount, shell = True)
    if flag == 0:
        print "Virtual device " + sys.argv[1] + " created successfully at " + sys.argv[4] + "!!\n"
        print "Details about the created file system !!\n"
        subprocess.check_call('df -hT ' + sys.argv[4],shell = True)
        print "Failed!!";

Output screenshots

In this way we understood basics of Linux Virtual File System. We illustrated logical structure of VFS with its important components and auxiliary objects. We used series of Linux commands to create Linux VFS.