The Vinum volume manager

The Vinum volume manager The Vinum volume manager
Last updated: 16 April 1999

Previous Sections
Introduction
The problems
Current implementations
How Vinum addresses the Three Problems

The big picture
As a result of these considerations, Vinum provides a total of four kinds of abstract storage structures:

At the lowest level is the UNIX disk partition, which Vinum calls a drive. With the exception of a small area at the beginning of the drive, which is used for storing configuration and state information, the entire drive is available for data storage.
Next come subdisks, which are part of a drive. They are used to build plexes.
A plex is a copy of the data of a volume. It is built out of subdisks, which may be organized in one of three manners:

A concatenated plex uses the address space of each subdisk in turn.
A striped plex stripes the data across each subdisk. The subdisks must all have the same size, and there must be at least two subdisks to distinguish it from a concatenated plex.
Like a striped plex, a RAID-5 plex stripes the data across each subdisk. The subdisks must all have the same size, and there must be at least three subdisks, since otherwise mirroring would be more efficient.

Although a plex represents the complete data of a volume, it is possible for parts of the representation to be physically missing, either by design (by not defining a subdisk for parts of the plex) or by accident (as a result of the failure of a drive).
A volume is a collection of between one and eight plexes. Each plex represents the data in the volume, so more than one plex provides mirroring. As long as at least one plex can provide the data for the complete address range of the volume, the volume is fully functional.

RAID-5
Conceptually, RAID-5 is used for redundancy, but in fact the implementation is a kind of striping. This poses problems for the implementation of Vinum: should it be a kind of plex or a kind of volume? In the end, the implementation issues won, and RAID-5 is a plex type. This means that there are two different ways of ensuring data redundancy: either have more than one plex in a volume, or have a single RAID-5 plex. These methods can be combined.
Which plex organization?
Vinum implements only that subset of RAID organizations which make sense in the framework of the implementation. It would have been possible to implement all RAID levels, but there was no reason to do so. Each of the chosen organizations has unique advantages:

Concatenated plexes are the most flexible: they can contain any number of subdisks, and the subdisks may be of different length. The plex may be extended by adding additional subdisks. They require less CPU time than striped or RAID-5 plexes, though the difference in CPU overhead from striped plexes is not measurable. On the other hand, they are most susceptible to hot spots, where one disk is very active and others are idle.
The greatest advantage of striped (RAID-0) plexes is that they reduce hot spots: by choosing an optimum sized stripe (empirically determined to be in the order of 256 kB), the load on the component drives can be made more even. The disadvantages of this approach are (fractionally) more complex code and restrictions on subdisks: they must be all the same size, and extending a plex by adding new subdisks is so complicated that Vinum currently does not implement it. Vinum imposes an additional, trivial restriction: a striped plex must have at least two subdisks, since otherwise it is indistinguishable from a concatenated plex.
RAID-5 plexes are effectively an extension of striped plexes. Compared to striped plexes, they offer the advantage of fault tolerance, but the disadvantages of higher storage cost and significantly higher CPU overhead, particularly for writes. The code is an order of magnitude more complex than for concatenated and striped plexes. Like striped plexes, RAID-5 plexes must have equal-sized subdisks and cannot currently be extended. Vinum enforces a minimum of three subdisks for a RAID-5 plex, since any smaller number would not make any sense.
These are not the only possible organizations. In addition, the following could have been implemented:

RAID-4, which differs from RAID-5 only by the fact that all parity data is stored on a specific disk. This simplifies the algorithms somewhat at the expense of drive utilization: the activity on the parity disk is a direct function of the read to write ratio. Since Vinum implements RAID-5, RAID-4's only advantage is nullified.
RAID-3, effectively an implementation of RAID-4 with a stripe size of one byte. Each transfer requires reading each disk (with the exception of the parity disk for reads). Without spindle synchronization (where the corresponding sectors pass the heads of each drive at the same time), RAID-3 would be very inefficient. In a multiple-access system, it also causes high latency.
An argument for RAID-3 does exist where a single process requires very high data rates. With spindle synchronization, this would be a potentially useful addition to Vinum.
RAID-2, which uses two subdisks to store a Hamming code, and which otherwise resembles RAID-3. Compared to RAID-3, it offers a lower data density, higher CPU usage and no compensating advantages.
In addition, RAID-5 can be interpreted in two different ways: the data can be striped, as in the Vinum implementation, or it can be written serially, exhausting the address space of one subdisk before starting on the other, effectively a modified concatenated organization. There is no recognizable advantage to this approach, since it does not provide any of the other advantages of concatenation.

Following Sections
Some examples
Increased resilience: RAID-5
Object naming
Startup
Performance issues
The implementation
Driver structure
Availability
Future directions
References