How does linux vfs work
In Linux 2. The dentries encode the filesystem tree structure, the names of the files. Thus, the main parts of a dentry are the inode if any that belongs to it, the name the final part of the pathname , and the parent the name of the containing directory.
There are also the superblocks, the methods, a list of subdirectories, etc. Some of these names were badly chosen, and lead to confusion. However, if the dentry is the root of a mounted filesystem i. It points back to the dentry itself in case of the root of a filesystem. A dentry is called negative if it does not have an associated inode, i. We see that although a dentry represents a pathname, there may be several dentries for the same pathname, namely when overmounting has taken place.
Such dentries have different inodes. Dentries are used to speed up the lookup operation. The offset can be set by the lseek system call. Note that instead of a pointer to the inode we have a pointer to the dentry - that means that the name used to open a file is known. In particular system calls like getcwd are possible. It is the list of all files belonging to a given superblock.
A struct vfsmount describes a mount. The definition lives in mount. Long ago 1. There was a linked list of struct vfsmount s that contained a device number, device name, mount point name, mount flags, superblock pointer, semaphore, file pointers to quota files and time limits for how long an over-quota situation would be allowed. Nowadays quota have independent bookkeeping, and a struct vfsmount only describes a mount.
This list does not seem to be protected by a lock. Dentry for the mountpoint. Used e. Keep track of users of this structure. Incremented by mntget , decremented by mntput. Initially 1. It will be 2 for a mount that may be unmounted. Autofs also uses this to test whether a tree is busy. This list is ordered by the order in which the mounts were done, so that one can do the umounts in reverse order.
Semantics of root and pwd are clear. Remains to discuss altroot. The effect is determined at kernel compile time. If that fails the usual root is used. A struct nameidata represents the result of a lookup. The definition lives in fs. The former finds the start of the walk, the latter does the walking. However, the former returns 0 in case it did the walking itself already.
One solution for configuration files in the home directory is typically to pregenerate them and build them into the rootfs. Using bind or overlay mounts is another popular alternative. Running man mount is the best place to learn about bind and overlay mounts, which give embedded developers and system administrators the power to create a filesystem in one path location and then provide it to applications at a second one.
Overlay mounts provide a union between the tmpfs and the underlying filesystem and allow apparent modification to an existing file in a ro-rootfs, while bind mounts can make new empty tmpfs directories show up as writable at ro-rootfs paths.
While overlayfs is a proper filesystem type, bind mounts are implemented by the VFS namespace facility. Based on the description of overlay and bind mounts, no one will be surprised that Linux containers make heavy use of them. Let's spy on what happens when we employ systemd-nspawn to start up a container by running bcc's mountsnoop tool:. Here, systemd-nspawn is providing selected files in the host's procfs and sysfs to the container at paths in its rootfs.
Understanding Linux internals can seem an impossible task, as the kernel itself contains a gigantic amount of code, leaving aside Linux userspace applications and the system-call interface in C libraries like glibc.
The file operations are what makes "everything is a file" actually work, so getting a handle on them is particularly satisfying.
Bind and overlay mounts via Linux namespaces are the VFS magic that makes containers and read-only root filesystems possible. In combination with a study of source code, the eBPF kernel facility and its bcc interface makes probing the kernel simpler than ever before. Much thanks to Akkana Peck and Michael Eager for comments and corrections. It was a total failure, so I ditched that idea. Now reading your article, my first thought was "this is not going to I'm such a chump.
Thank you for your article! Figure 1 looks remarkably like one of the diagrams I made. I thank my co-worker Sam Cramer for pointing this out. Alison, this presentation at SCaLE 17x was my favorite over all 4 days and 15 sessions! Awesome lecture and live demos.
I learned an incredible amount, even as a Linux kernel noob. So grateful that you also created this blog post Here at least are the slides:. Hey, Clark could you please a link to the lectures and the live demos i would really love to have a look. You'll want to look at the slides separately. Virtual filesystems: Why we need them and how they work Virtual filesystems in Linux: Why we need them and how they work.
Virtual filesystems are the magic abstraction that makes the "everything is a file" philosophy of Linux possible. Get the highlights in your inbox every week. Filesystem basics The Linux kernel requires that for an entity to be a filesystem, it must also implement the open , read , and write methods on persistent objects that have names associated with them.
If we can open , read , and write , it is a file as this console session shows. More Linux resources. Our latest Linux articles. Topics Linux. About the author. Her day job is creating a variant of Debian, the Universal Operating System, that runs on big semi-automated trucks for Peloton Technology.
She has worked for This may happen for data integrity reasons i. The page will be Locked when readpage is called, and should be unlocked and marked uptodate once the read completes.
This is particularly needed if an address space attaches private data to a page, and that data needs to be updated when a page is dirtied. This is called, for example, when a memory mapped page gets modified. The pages are consecutive in the page cache and are locked. The caller will decrement the page refcount and unlock the remaining pages for you. This is essentially just a vector version of readpage. Instead of just one page, several pages are requested.
If anything goes wrong, feel free to give up. This interface is deprecated and will be removed by the end of ; implement readahead instead. Called by the generic buffered write code to ask the filesystem to prepare to write len bytes at the given offset in the file. To be able to swap to a file, the file must have a stable mapping to a block device.
The swap system does not go through the filesystem but instead uses bmap to find out where the blocks in the file are and uses those addresses directly. If a page has PagePrivate set, then invalidatepage will be called when part or all of the page is to be removed from the address space. Any private data associated with the page should be updated to reflect this truncation. If releasepage fails for some reason, it must indicate failure with a 0 return value.
The first is when the VM finds a clean page with no active users and wants to make it a free page. If the filesystem makes such a call, and needs to be certain that all pages are invalidated, then its releasepage will need to ensure this.
Possibly it can clear the PageUptodate bit if it cannot free private data yet. Called by the VM when isolating a movable non-lru page. This is used to compact the physical memory usage. If the VM wants to relocate a page maybe off a memory card that is signalling imminent failure it will pass a new page and an old page to this function. Called before freeing a page - it writes back the dirty page. To prevent redirtying the page, it is kept locked during the whole operation.
Called by the VM when reading a file through the pagecache when the underlying blocksize! If the required block is up to date then the read can complete without needing the IO to bring the whole page up to date. Called by the VM when attempting to reclaim a page. The VM uses dirty and writeback information to determine if it needs to stall to allow flushers a chance to complete some IO.
Ordinarily it can use PageDirty and PageWriteback but some filesystems have more complex state unstable pages in NFS prevent reclaim or do not set those flags due to locking problems. This callback allows a filesystem to indicate to the VM if a page should be treated as dirty or writeback for the purposes of stalling.
Used for memory failure handling. Setting this implies you deal with pages going away under you, unless you have them locked or reference counts increased. Called when swapon is used on a file to allocate space if necessary and pin the block lookup information in memory. A return value of zero indicates success, in which case this file can be used to back swapspace.
A file object represents a file opened by a process. This describes how the VFS can manipulate an open file. As of kernel 4. Called by the select 2 and poll 2 system calls.
It then calls the open method for the newly allocated file structure. This method is used by the splice 2 system call. The return value should the number of bytes remapped, or the usual negative error code if errors occurred before any bytes were remapped. Note that the file operations are implemented by the specific filesystem in which the inode resides. When opening a device node character or block special most filesystems will call special support routines in the VFS which will locate the required device driver information.
These support routines replace the filesystem file operations with those for the device driver, and then proceed to call the new open method for the file. This is how opening a device file in the filesystem eventually ends up calling the device driver open method. This describes how a filesystem can overload the standard dentry operations. Dentries and the dcache are the domain of the VFS and the individual filesystem implementations. Device drivers have no business here.
This is called whenever a name look-up finds a dentry in the dcache. Most local filesystems leave this as NULL, because all their dentries in the dcache are valid. Network filesystems are different since things can change on the server without the client necessarily being aware of it.
This is called when a path-walk ends at dentry that was not acquired by doing a lookup in the parent directory. In this case, we are less concerned with whether the dentry is still fully correct, but rather that the inode is still valid. The first dentry is the parent of the dentry to be compared, the second is the child dentry.
Must be constant and idempotent, and should not take locks if possible, and should not or store into the dentry. Should not dereference pointers outside the dentry without lots of care eg. Return 1 to delete immediately, or 0 to cache the dentry. Default is NULL which means to always cache a reachable dentry. If you define this method, you must call iput yourself.
Useful for some pseudo filesystems sockfs, pipefs, … to delay pathname generation. Real filesystems probably dont want to use it, because their dentries are present in global dcache hash, so their hash should be an invariant.
0コメント