Don't let mbox leak out

Sometimes there is confusion about the difference between RFC 822 and mbox format. That is probably because mbox is very old, and was the only way mail was stored on Unix systems for about twenty years.

Mbox file format stores multiple email messages in a single file, and adds several text constructs in and around those messages. Mbox adds some or all of these to each message:

  1. From_ line before the message, with "envelope sender" and date-time.
  2. Empty line after the message, for no particular reason.
  3. ">From " escaping of message lines.
  4. The header Content-Length: (value is number of bytes in body of message).
  5. The header Lines: (value is number of lines in body of message).

The metadata in the From_ line is redundant. The envelope sender is in the RFC 822 header Return-Path:; the date-time is in the most recent Received: header.

All of these mbox constructs, including the From_ line, are involved in finding the end of each message.

Here is a sample RFC 822 message, as it would be received from the network:

Return-Path: <programmer@obscurity.org>
From: <programmer@obscurity.org>
To: <chaos@abyss.void>
Message-ID: <0>
Date: Thu, 01 Jan 1970 00:00:00 +0000

From the beginning, mbox was a bad design.
All the attempts to fix it have failed.

Here is that message as it would be stored in an mbox file, in an mbox variant that adds all the text constructs. The mbox constructs are highlighted.

From programmer@obscurity.org Thu Jan  1 00:00:17 1970
Return-Path: <programmer@obscurity.org>
From: <programmer@obscurity.org>
To: <chaos@abyss.void>
Message-ID: <0>
Date: Thu, 01 Jan 1970 00:00:00 +0000
Lines: 2
Content-Length: 84

>From the beginning, mbox was a bad design.
All the attempts to fix it have failed.

The mbox constructs are part of mbox format, and should not exist anywhere but in mbox files. They don't exist in a message when it's sent across the network, and shouldn't exist when the message is stored in other file formats. Other file formats have their own ways to find the end of each message.

However, sometimes when software stores mail in other mail file formats, it adds some mbox constructs. That is a bug. Outside of mbox files, the mbox constructs are at best useless clutter, and sometimes misleading or damaging.

Maildir

Let's look at a concrete example: maildir. In a maildir tree, each message is in a separate file. End of file marks the end of the message. That is simple and reliable. Adding other ways to mark the end of the message is useless. A From_ line is useless; Lines: is useless. (Remember that the data fields in the mbox From_ line are redundant.) Yet both of those mbox lines sometimes appear in maildir message files.

Furthermore, they're invalid. A maildir message file contains an RFC 822 message, but the From_ line and the header Lines: are not RFC 822. The From_ line is not even valid RFC 822 syntax. Lines: is valid syntax, but undefined -- not in the RFCs.

Furthermore, the value of Lines: is sometimes wrong, because it's non-standard and poorly defined. Sometimes the original value is wrong; sometimes the message is modified but Lines: is not updated.

Software example

I wrote some software to read and write email in various file formats. There's a class for each format, with methods that read and write a message at a time. The read methods return an object holding a message as straight RFC 822; the write methods take such a message object as an argument. Then there's a small application that converts stored mail among the formats: read with one class, write with another.

On read, file format constructs are removed; on write they're added.

The mbox read method removes all the mbox constructs (except ">From " escaping, which can't be undone reliably). The message read is returned without the From_ line, the empty line, Lines:, and Content-Length:. On write, all the mbox constructs are generated as necessary.

The dotmail read method returns a message without the dot line after it, and with dot-stuffing undone. The write method writes the message dot-stuffed and followed by a dot line.

The maildir class just moves RFC 822 messages in and out of files.

This approach is conceptually clean, and works out nicely -- and does not put mbox constructs anywhere but in mbox files. That's the right thing to do.