vnode - an overview of vnodes
A vnode is an object in kernel memory that speaks the UNIX
file interface
(open, read, write, close, readdir, etc.). Vnodes can represent files,
directories, FIFOs, domain sockets, block devices, character
devices.
Each vnode has a set of methods which start with string
'VOP_'. These
methods include VOP_OPEN, VOP_READ, VOP_WRITE, VOP_RENAME,
VOP_CLOSE,
VOP_MKDIR. Many of these methods correspond closely to the
equivalent
file system call - open, read, write, rename, etc. Each
file system
(FFS, NFS, etc.) provides implementations for these methods.
The Virtual File System (VFS) library maintains a pool of
vnodes. File
systems cannot allocate their own vnodes; they must use the
functions
provided by the VFS to create and manage vnodes.
Vnode life cycle [Toc] [Back]
When a client of the VFS requests a new vnode, the vnode allocation code
can reuse an old vnode object that is no longer in use.
Whether a vnode
is in use is tracked by the vnode reference count (v_usecount). By convention,
each open file handle holds a reference as do VM
objects backed
by files. A vnode with a reference count of 1 or more will
not be de-allocated
or re-used to point to a different file. So, if you
want to ensure
that your vnode doesn't become a different file under
you, you better
be sure you have a reference to it. A vnode that points
to a valid
file and has a reference count of 1 or more is called "active".
When a vnode's reference count drops to zero, it becomes
"inactive", that
is, a candidate for reuse. An "inactive" vnode still refers
to a valid
file and one can try to reactivate it using vget(9) (this is
used a lot
by caches).
Before the VFS can reuse an inactive vnode to refer to another file, it
must clean all information pertaining to the old file. A
cleaned out vnode
is called a "reclaimed" vnode.
To support forceable unmounts and the revoke(2) system call,
the VFS may
"reclaim" a vnode with a positive reference count. The "reclaimed" vnode
is given to the dead file system, which returns errors for
most operations.
The reclaimed vnode will not be re-used for another
file until
its reference count hits zero.
Vnode pool [Toc] [Back]
The getnewvnode(9) system call allocates a vnode from the
pool, possibly
reusing an "inactive" vnode, and returns it to the caller.
The vnode returned
has a reference count (v_usecount) of 1.
The vref(9) call increments the reference count on the vnode. It may only
be on a vnode with reference count of 1 or greater. The
vrele(9) and
vput(9) calls decrement the reference count. In addition,
the vput(9)
call also releases the vnode lock.
The vget(9) call, when used on an inactive vnode, will make
the vnode
"active" by bumping the reference count to one. When called
on an active
vnode, vget increases the reference count by one. However,
if the vnode
is being reclaimed concurrently, then vget will fail and return an error.
The vgone(9) and vgonel(9) orchestrate the reclamation of a
vnode. They
can be called on both active and inactive vnodes.
When transitioning a vnode to the "reclaimed" state, the VFS
will call
VOP_RECLAIM(9) method. File systems use this method to free
any filesystem
specific data they attached to the vnode.
Vnode locks [Toc] [Back]
The vnode actually has three different types of lock: the
vnode lock, the
vnode interlock, and the vnode reclamation lock (VXLOCK).
The vnode lock [Toc] [Back]
The vnode lock and its consistent use accomplishes the following:
+o It keeps a locked vnode from changing across certain
pairs of VOP_
calls, thus preserving cached data. For example, it
keeps the directory
from changing between a VOP_LOOKUP call and a
VOP_CREATE. The
VOP_LOOKUP call makes sure the name doesn't already exist in the directory
and finds free room in the directory for the new
entry. The
VOP_CREATE can then go ahead and create the file without
checking if
it already exists or looking for free space.
+o Some file systems rely on it to ensure that only one
"thread" at a
time is calling VOP_ vnode operations on a given file or
directory.
Otherwise, the file system's behavior is undefined.
+o On rare occasions, code will hold the vnode lock so that
a series of
VOP_ operations occurs as an atomic unit. (Of course,
this doesn't
work with network file systems like NFSv2 that don't
have any notion
of bundling a bunch of operations into an atomic unit.)
+o While the vnode lock is held, the vnode will not be reclaimed.
There is a discipline to using the vnode lock. Some VOP_
operations require
that the vnode lock is held before being called. A
description of
this rather arcane locking discipline is in
sys/kern/vnode_if.src.
The vnode lock is acquired by calling vn_lock(9) and released by calling
VOP_UNLOCK(9).
A process is allowed to sleep while holding the vnode lock.
The implementation of the vnode lock is the responsibility
of the individual
file systems. Not all file systems implement it.
To prevent deadlocks, when acquiring locks on multiple vnodes, the lock
of parent directory must be acquired before the lock on the
child directory.
Vnode interlock [Toc] [Back]
The vnode interlock (vp->v_interlock) is a spinlock. It is
useful on
multi-processor systems for acquiring a quick exclusive lock
on the contents
of the vnode. It MUST NOT be held while sleeping.
(What fields
does it cover? What about splbio/interrupt issues?)
Operations on this lock are a no-op on uniprocessor systems.
Other Vnode synchronization [Toc] [Back]
The vnode reclamation lock (VXLOCK) is used to prevent multiple processes
from entering the vnode reclamation code. It is also used
as a flag to
indicate that reclamation is in progress. The VXWANT flag
is set by processes
that wish to be woken up when reclamation is finished.
The vwaitforio(9) call is used to wait for all outstanding
write I/Os associated
with a vnode to complete.
Version number/capability
The vnode capability, v_id, is a 32-bit version number on
the vnode. Every
time a vnode is reassigned to a new file, the vnode capability is
changed. This is used by code that wishes to keep pointers
to vnodes but
doesn't want to hold a reference (e.g., caches). The code
keeps both a
vnode * and a copy of the capability. The code can later
compare the vnode's
capability to its copy and see if the vnode still
points to the
same file.
Note: for this to work, memory assigned to hold a struct vnode can only
be used for another purpose when all pointers to it have
disappeared.
Since the vnode pool has no way of knowing when all pointers
have disappeared,
it never frees memory it has allocated for vnodes.
Vnode fields [Toc] [Back]
Most of the fields of the vnode structure should be treated
as opaque and
only manipulated through the proper APIs. This section describes the
fields that are manipulated directly.
The v_flag attribute contains random flags related to various functions.
They are summarized in table ...
The v_tag attribute indicates what file system the vnode belongs to.
Very little code actually uses this attribute and its use is
deprecated.
Programmers should seriously consider using more object-oriented approaches
(e.g. function tables). There is no safe way of
defining new
v_tags for loadable file systems. The v_tag attribute is
read-only.
The v_type attribute indicates what type of file (e.g. directory, regular,
FIFO) this vnode is. This is used by the generic code
for various
checks. For example, the read(2) system call returns an error when a
read is attempted on a directory.
The v_data attribute allows a file system to attach a piece
of file system
specific memory to the vnode. This contains information
about the
file that is specific to the file system.
The v_numoutput attribute indicates the number of pending
synchronous and
asynchronous writes on the vnode. It does not track the
number of dirty
buffers attached to the vnode. The attribute is used by
code like fsync
to wait for all writes to complete before returning to the
user. This
attribute must be manipulated at splbio().
The v_writecount attribute tracks the number of write calls
pending on
the vnode.
RULES [Toc] [Back]
The vast majority of vnode functions may not be called from
interrupt
context. The exceptions are bgetvp and brelvp. The following fields of
the vnode are manipulated at interrupt level: v_numoutput,
v_holdcnt,
v_dirtyblkhd, v_cleanblkhd, v_bioflag, v_freelist, and
v_synclist. Any
access to these fields should be protected by splbio.
This document first appeared in OpenBSD 2.9.
OpenBSD 3.6 February 22, 2001
[ Back ] |