Mbox has a design flaw

People often point out that an mbox file can't be accessed in random order or concurrently. That's true, but it's not unique to mbox. That's true of any format that stores multiple messages in a single sequential file.

Mbox's unique flaw is in how it marks message boundaries within the file. That was a bad design from the beginning. The result is that mbox damages some messages.

Design

Mbox originated in the 1970s, in some early version of Unix. It was invented to store multiple received email messages in a single file. Each message is preceded by a control line that starts with "From ", and is followed by an empty line.

The From_ line before each message contains message metadata. Usually it contains the "envelope sender" -- the sender address transmitted through SMTP, outside the message -- and a date and time. In the past there was sometimes a third field. Early on, the sender address was often a UUCP address. Now it's sometimes a simple name, not an Internet mail address. The syntax of the date and time varies, and its time zone is not defined.

The envelope sender and date-time are redundant. The sender is in the message header Return-Path:; the date-time is in Received:. Those headers probably didn't exist when mbox was invented, but now both have existed for decades.

An RFC 822 message is variable length, and does not include any indication of where it ends. When the message is transferred across the Internet, SMTP marks the end of the message. In an mbox file, the end of a message is not marked; it's implied by the beginning of the next message, detected by the From_ line before it. (The end of the last message is implied by end of file.)

Because the syntax of the From_ line varies, the only reliable way to detect it is to search only for the "From " at the beginning of the line.

Of course, a message line can start with "From ", too. To prevent that from being mistaken for an mbox control line, any message line that starts with "From " is modified, by adding a '>' at the beginning. However, a message line that starts with ">From " does not get another '>'. This makes the escaping irreversible: it's impossible to know whether the '>' was in the original message. So messages end up damaged.

The ">From " damage merely irritates human readers, but it breaks cryptographic signatures. It could also break other programmatic data. (Don't attach an mbox file to a message.) It's enough of a problem that mail readers sometimes compose new messages in ways that attempt to protect them from future mbox damage. One of those ways is part of a MIME type (see RFC 3676).

Example

Here is a sample RFC 822 message.

Return-Path: <programmer@obscurity.org>
From: <programmer@obscurity.org>
To: <chaos@abyss.void>
Message-ID: <0>
Date: Thu, 01 Jan 1970 00:00:00 +0000

From the beginning, mbox was a bad design.
From the beginning, mbox was a bad design.

Below is a Usenet-style quote, as from a message being replied to.

>From the beginning, mbox was a bad design.

Here is that message as it would be stored in an mbox file. The text constructs added by mbox format are highlighted. Note that the Usenet-style quote does not get a second '>'. Also note the empty line that mbox adds after the message.

From programmer@obscurity.org Thu Jan  1 00:00:00 1970
Return-Path: <programmer@obscurity.org>
From: <programmer@obscurity.org>
To: <chaos@abyss.void>
Message-ID: <0>
Date: Thu, 01 Jan 1970 00:00:00 +0000

>From the beginning, mbox was a bad design.
>From the beginning, mbox was a bad design.

Below is a Usenet-style quote, as from a message being replied to.

>From the beginning, mbox was a bad design.

Variants

Many variants of mbox have been created over the years. The result is that mbox is not a single file format; it's many related formats, all partially incompatible and more or less indistinguishable.

The major variants change either or both of how the end of each message is found, and how message lines that start with "From " are escaped. Apparently these were attempts to fix the original problem: messages are damaged by irreversible ">From " escaping.

One early variant was BSD mbox. Rather than recognize an mbox From_ line anywhere, it recognizes it only after an empty line -- the empty line that follows the previous message -- or at the beginning of the file. So, it escapes a message line that starts with "From " only when that message line is preceded by an empty line. This reduces the number of occurrences of damage to messages.

Later, Unix System V Release 4 invented a new message header, Content-Length:. Its value is the decimal number of bytes in the body of the message. This is non-standard -- not in the RFCs -- and non-portable, since the number of characters at the end of each line varies across systems. Two variants of mbox use this, by adding it to messages. One variant escapes message lines as in the original mbox; the other doesn't escape them at all.

Sometime, probably in the 1990s, somebody invented another non-standard message header, Lines:. This one's value is the number of lines in the body of the message. This is used in two mbox variants, with and without escaping of message lines.

Both Content-Length: and Lines: sometimes are generated with incorrect values, probably because they're non-standard and poorly defined. Even if they're originally correct, their values can become wrong if the message is modified. Because they are non-standard, not all mail software knows about them; software that doesn't, does not update those lengths when it modifies the message.

In the mid-1990s, Rahul Dhesi and some other people proposed fixing the escaping of message lines in the original mbox, to make it reversible. This variant adds '>' to the beginning of any message line that starts with zero or more '>' characters followed by "From ". On read, any message line that starts with one or more '>' followed by "From " has the first '>' removed, restoring the original line. This was a good idea, but 20 years too late.

All these mbox variants are still in use. Typically each piece of software writes one variant, usually one of these, but sometimes something a little different. And, of course, there are many existing mbox files, some of them quite old.

Variant examples

Here is the RFC 822 message above as it would be stored in the BSD variant of mbox. The second line of the message body is not escaped with '>', because it's not preceded by an empty line.

From programmer@obscurity.org Thu Jan  1 00:00:00 1970
Return-Path: <programmer@obscurity.org>
From: <programmer@obscurity.org>
To: <chaos@abyss.void>
Message-ID: <0>
Date: Thu, 01 Jan 1970 00:00:00 +0000

>From the beginning, mbox was a bad design.
From the beginning, mbox was a bad design.

Below is a Usenet-style quote, as from a message being replied to.

>From the beginning, mbox was a bad design.

Here is a different message, as it would be stored in a different variant of mbox (with Lines:, without escaping).

From programmer@obscurity.org Thu Jan  1 00:00:00 1970
Return-Path: <programmer@obscurity.org>
From: <programmer@obscurity.org>
To: <chaos@abyss.void>
Message-ID: <0>
Date: Thu, 01 Jan 1970 00:00:00 +0000
Lines: 8

Suppose we're talking about mbox format through email.  We might want
to put a sample mbox From_ line in this message, like this:

From nobody@nowhere.invalid Thu Jan  1 00:00:00 1970

That line above is part of this message.  If this message is written
to an mbox file, software that reads that file MUST NOT take that line
above as the beginning of a new message.

Reading mbox

Now consider the problem of writing software to read mbox. It's straightforward to make a program read one variant of mbox -- but not very useful. A useful program has to read all the variants. The program has to know which variant of mbox it's reading, to know how to find the end of each message. What's the syntax of the From_ line? Are message lines that start with "From " escaped? All of them, some of them, none? Does Lines: exist? Can it be trusted?

If the program guesses wrong where a message ends, it will lose data. Typically it loses one whole message, and maybe part of another.

The program could be told by the user which mbox variant a particular file is, but it's unreasonable to burden the user with that. The user rarely knows that -- seldom even knows that there are variants -- and shouldn't have to. Furthermore, a single file could have been appended to by multiple programs, each writing a different variant.

So it has to be automatic.

Can we auto-detect the variant? We could collect statistics on the file -- count lines that start with "From ", ">From ", "Lines:", "Content-Length:" -- and make some guesses. But that's unlikely to work perfectly, and doesn't work at all when the file contains multiple variants. And we can't inspect each message, because we'd have to find where the message ends, which we don't yet know how to do. This probably won't work out.

A common approach is to recognize whole From_ lines, metadata and all, to (probably) avoid false positives on unescaped message lines. But there are many syntaxes for the mbox From_ line -- some people have found about 20 syntaxes -- and it's unlikely that they've all been found. Software can recognize many known From_ syntaxes, and also use Lines: and Content-Length:, when they're present, but can't trust those lengths too much, because they're sometimes wrong. Software can treat the lengths as estimates, and look for many forms of whole From_ lines near them. All this put together can work pretty well, and does, in some programs. But it still isn't sure to work, and it shouldn't be this hard.

There shouldn't be 20 ways to find the end of a message. There should only be one way.

Mbox is hopeless

Mbox cannot be fixed. Any fix is necessarily an incompatible change to the file format, and is not understood by existing software. The existing major variants were attempts to fix mbox; they have all failed to solve the overall problem.