BYTE, April 1991
By Christopher Locke
This is the original draft of an article that appeared in Byte magazine under the title "The Dark Side of DIP" (for Document Image Processing). The published version differed slightly.
The moral delivered by many folk tales is that the image -- the seemingly accurate representation of reality -- often lies, usually with dire consequences for those who put their trust in such alluring reflections. In the case of document imaging systems, this ancient caveat is still worth heeding.
If imaging captures picture-perfect digital facsimiles of document pages, as indeed it does, how can these images "lie"? When only a few dozen pages are involved, they don't; each screen image is a faithful replica of the paper document from which it was taken. However, when an imaging system contains thousands or millions of pages, let's just say it rarely reveals the whole truth. The point in question here is not accurate capture, but the later retrieval of those -- and only those -- pages relevant to some knowledge worker's immediate needs. If the system holds back crucial information, in effect, it lies.
Of course, we don't yet have information systems that lie intentionally out of malice. (Maybe this should be the new Turing test for genuine artificial intelligence.) But all information retrieval systems tend to lie by omission: they simply have little way of "knowing" what they contain relative to our queries. In document imaging systems, this deceit by omission is not an infrequent accident, but an inherent attribute of the technology that can cost organizations millions of dollars after the glitz wears off.
As many who have worked with such systems will already have guessed, the fatal flaw here has to do with characterizing the contents of document images such that relevant information can be recalled on demand. The technical term for this challenge is "indexing," and it applies to any form of computerized information retrieval. This seemingly straightforward concept will also be familiar to anyone who has ever used the back pages of a decent reference book.
While it may seem tangential to note that works of fiction do not have indexes, this odd fact is very much to the point. Rightly or not, publishers assume that novels will be read linearly, that is, "users" will become familiar with their contents by "processing" their pages from front to back. This assumption does not hold for non-fiction; potential readers may simply want to check a single fact in a 600-page book, or scan its several relevant pages in search of some highly specific information. Obviously, few of us would have the time or dedication to wade through all that text for so little return. A table of contents helps, but usually not enough. Thus was the index conceived.
Although such an index represents a simple concept, creating one is no laughing matter. Let's say our 600-page tome deals in part with artificial intelligence, and that we give it to ten different indexers to see what they come up with. The first thing we'd notice is that no two resulting indexes would be substantially alike. Because the process requires selectivity, and because of their differing background knowledge, different indexers would notice and emphasize different key words, phrases, concepts and relationships.
In addition, it is unlikely that all would decide to standardize on the same indexing terms. For example, "artificial intelligence," "expert systems," "intelligent machines," "knowledge-based programming," and "automated reasoning" may or may not refer to "the same thing" -- without a careful consideration of context and a good deal of specific knowledge of the field, it would be hard to say. Although there are only five in this example, there could easily be a dozen such terms that were fundamentally synonymous (or close enough). Our ten indexers would not be likely to telepathically settle on a single one; some would use one term, some another.
Even a single individual might not decide to use just one -- grouping all related references under, say, "artificial intelligence" -- but rather list each of the five in separate locations in the book's index, with differing page references for each. Understanding that such an approach would needlessly confuse readers, an experienced indexer would chose a single term under which to supply page references to all related concepts -- say again, "artificial intelligence" -- but would then list the other synonyms in their alphabetical index locations along with a pointer to this primary term. For instance: "Expert Systems, See Artificial Intelligence." Indexers must always accommodate large heterogeneous reader communities, individual members of which may be inclined to look for any one of many possible phrasings when using an index to locate material about a single concept.
If we substitute imaged document for book, the real problem begins to emerge: How will this document be retrieved? While few if any documents will be 600 pages long, there will be many thousands of shorter ones to search through. Even this simplified example should suggest that indexing is not a low- level task you would wisely hand off to a temp. And it gets worse, not better, as larger volumes of text become involved. As we will see, this caution holds regardless of the technology brought to bear.
Choosing the "right" primary indexing term is no big deal at the level of a single book. The choice can be somewhat arbitrary as long as it is consistently applied. But what happens when many books are involved, for example, at the Library of Congress? This national document repository faces a challenge different from that of back-of-book indexing. However, these problems are related -- and both are intimately connected to the problem of document imaging. LC does not (yet) attempt to provide as detailed a description of its holdings as does a book's index of its contents, but it does attempt to roughly characterize what each book is "about." LC "subject indexers" attach several topic descriptors to every non- fiction work published in America. You can find an example on the copyright page of any modern non- fiction book under the somewhat odd rubric "Library of Congress Cataloging-in-Publication Data." Here's an example:
Dreyfus, Hubert L. Mind over machine. Includes index 1. Artificial Intelligence 2. Computers 3. Expert Systems (Computer Science) I. Dreyfus, Stuart E. II. Athanasiou, Tom III. Title Q335.D73 1986 006.3 85-20421 ISBN 0-02-908060-6
The terms "1. Artificial Intelligence 2. Computers 3. Expert Systems (Computer Science)" are LC topic descriptors. If you'd like to see more, your local librarian can point you to the three monster volumes that contain the complete "Library of Congress Subject Headings" (or simply LCSH). Or, you can pop "CDMARC Subjects" into your CD-ROM player, MARC being an acronym for "Machine Readable Cataloging" (see box for Library of Congress). This disc is an absolutely terrific bargain, especially for anyone who is currently being pitched by multiple document imaging vendors -- might as well get an early taste of what you're really going to be up against!
There are two fundamental ideas underlying LCSH. The first is to guarantee (or at least encourage) consistency in the selection of indexing terms. The problem here is often called "vocabulary control" and, logically, its results are called "controlled vocabularies." Such agreed upon lists of valid terms guide subject indexers in characterizing a book's subject matter consistently, i.e., so that multiple synonyms will not be applied at random by multiple indexers. Since such synonyms obviously do exist (otherwise there would be no problem), controlled vocabularies must also provide "See" pointers from terms that are discouraged for indexing use to those that are approved (e.g., "Computer control: See Automation" or "Computer insurance: See Insurance, computer"). In addition, "See also" pointers are provided to suggest related keywords (e.g., "Computer input- output equipment: See also Automatic speech recognition").
The second fundamental idea behind LCSH is to establish not just terms, but categories of terms, and relationships among these categories. "See also" references point to what are technically called "Related Terms" (or RTs). In addition, the LCSH classification includes pointers to "Broader Terms" (BTs) that indicate more inclusive subject categories, as well as to Narrower Terms (NTs) which name more specific sub-categories. What begins to emerge here is a quasi- hierarchical representation of a field of knowledge. Though it differs radically from a simple synonym list, this sort of extended conceptual taxonomy is often called a "thesaurus." While "See" pointers and the approved terms they indicate may be the product of somewhat arbitrary decisions, "See also" pointers entail some knowledge of the domain under consideration. In fact, the "subject analysis" which librarians perform to create these categories and relationships is strongly akin to what the AI literature calls "knowledge engineering."
Described in barest outline here, these and related library techniques have evolved over untold thousands of man-years of deep experience in document collection management. Chances are good, though, that you'd really rather not wade into all this, thanks just the same. Here's where those terminally slick document imaging systems come in, right on cue. Their stratospherically high technology promises to solve all your information problems without necessitating such a descent into the Ninth Circle of Cognitive Complexity. But think a minute. Did librarians really come up with such schemes because they get off on complication? Or maybe to indulge a professionally twisted sense of humor? More likely, these techniques exist because they -- or something very much like them -- are crucial to getting the job done. And the job, in this case, is to serve the information needs of communities very much like your own.
If you're seriously considering document imaging, your organization is probably already hitting just such indexing and retrieval hurdles -- they apply as much to paper-based filing systems as to computerized databases -- possibly without much clue as to how deep such problems can quickly get. Don't expect imaging vendors to point them out, either, because their technology offers no substantial relief. Most product literature, and even trade press reporting, makes little mention of indexing as it relates to document imaging, except to acknowledge that the process requires "manual" effort. But note carefully that this is hardly the same kind of manual effort required to type 80 words per minute. Were not talking "Kelly Girls" here; we're talking knowledge engineers.
Part of the problem -- but only part -- is that after it has been "captured" in an imaging system, you don't actually "have" the real document. What you've got is pictures of pages. (If you've ever wanted to cut a longish but particularly relevant passage from a document image and paste it into a report, you know that a picture really is worth a thousand words -- of typing.) Apart from optical character recognition (about which, more later), manual indexing is the only way to tie these pictures to meaningful concepts by which they might later be retrieved.
In fact, document imaging technology suggests many surprising parallels with the now nearly-archaic online database systems once used almost exclusively to search bibliographic records of published periodical literature. In those bygone days, storage was too expensive to allow the full text of documents to be kept online. As with imaging systems today, the documents weren't "really there." Instead, the physical page-pictures were kept only on library shelves, and the database was an automated assistant for paper-based filing and retrieval.
To index records in these online bibliographic systems, some types of consistent information were placed into structured database fields: author, title, journal, publication date, and so on. For books, such information might include Dewey Decimal Classification, LC call number, LC card number and ISBN (several of these more arcane data items are shown in the sample Cataloging-in-Publication record, above). However, providing only this type of standard information puts a tremendous burden of knowledge on searchers. It assumes they know exactly what they're looking for, which is seldom the case. More often, they want to know about something, and are not even sure what that something should be called. It might have fifty possible expressions -- or only one that is so new or technical they've never heard the term used. Only rarely will they know the titles of articles or books in which the concept appears, or the name of an author who has written on the subject.
Two remedies were developed to deal with these inherent shortcomings of fixed- field database retrieval. The first is simply to add a new field: Subject. As already described, subject indexing implies both the existence and skilled use of a structured thesaurus of related concept categories into which the elements of a controlled vocabulary have been appropriately placed. From such a thesaurus, an indexer selects keywords that accurately characterize a particular document, and attaches them to that document's database record to assist retrieval. Note that the indexer selecting these subject headings must have sufficient understanding of 1) the document itself, 2) the field of knowledge that forms its context, and 3) the methodology for applying subject headings. Note further that this level of indexing is a fundamental minimum requirement for retrieving picture-pages from a document image-base. Unless you have already developed a comprehensive controlled vocabulary, a domain- specific thesaurus, and a pool of competent indexers, you are either leaving a big hole in your imaging budget or designing an "information" system that will stonewall everyone who uses it.
The second remedy for the inadequacy of fixed-field database descriptors is the development of document abstracts. Each document is summarized in a succinct and cogent paragraph, an the full text of these abstracts is then automatically indexed such that virtually every word serves as a retrieval hook. Applied to good abstracts, this software-driven "inverted-file indexing" technique greatly increases the likelihood of finding relevant material. But note again that succinctness and cogency do not come cheap. Their cost is a function of knowledge -- abstracters being relatively high-level "knowledge workers" in the true sense. Some believe that AI will eventually master the "natural language understanding" problem, thus enabling abstracts to be generated automatically. Some also believe that the moon is made of green cheese. (However, see box for the CLARIT Project and the Text Categorization Shell, both of which successfully use some AI/natural-language techniques to assist in thesaurus construction and indexing.)
As the cost of storage continues to drop dramatically, online information services -- and especially CD-ROM applications developers -- are moving away from providing rudimentary fixed- field bibliographic citations (with or without subject keywords and searchable abstracts) and toward the delivery of full text. That is, search-it, dump-it, cut-it, paste-it ASCII. At first, users experience such systems as technological marvels. Every word a keyword? Zounds! While the value of full-text databases is hardly to be denigrated, the initial enthusiasm they inspire tends to be rapidly replaced by massive frustration -- for all the reasons given so far: no controlled vocabulary, no thesaurus, no clue. The problems that give rise to this frustration are technically termed "recall" and "precision," which can be roughly translated as: "How am I supposed to read 300 documents before 5 o'clock?" and "That's not what I meant!"
Despite all the bad press it's gotten, we're going to eventually have to take "mere semantics" seriously. What documents are "about" simply can't be captured by a few words casually jotted in a database header, nor even by dumping all the discrete words they contain into an inverted-file index. Even for full-text document databases, higher-level tools are required for the type of sophisticated semantic analysis prerequisite to the construction and maintenance of an adequate thesaurus. More important, such tools must be put into the hands of intelligent and knowledgeable people who are not adverse to long hours of difficult intellectual work. (See box for American Library Association and Special Library Association -- good sources for locating such people.)
But we're not even seriously considering full-text databases here. We're still looking at picture bases. To overcome this rigid dichotomy, there are hybrid possibilities well worth exploring. If you can afford even more storage than images require (and that's a lot), it may be feasible to make certain types (or portions) of imaged documents full-text-searchable. This involves using optical character recognition to convert page images into ASCII (where such conversion is possible). The ASCII will be "dirty" unless you pay people to correct the inevitable OCR errors, but -- while clean text is certainly preferable -- this isn't absolutely necessary for indexing purposes. Automatic file-inversion techniques can then be used to create an index in which every word correctly captured by the OCR process is linked by a pointer to its corresponding page image. The junk you just throw away.
While this approach is obviously more involved and costly than straight imaging, retrieval is remarkably improved. Cost falls into better perspective if you consider that effective retrieval is the only rational reason to build information systems of this type in the first place. Of course, having put such a hybrid text/image system together, you'll still be faced with the same challenge that accompanies any full-text retrieval effort: the knowledge engineering that goes into constructing a first-class thesaurus of relevant concepts. In this regard, a product called Topic is an education in itself, and it will work as well in the "dirty OCR"/image scenario as with plain text (see box for Verity, Inc.). Whatever the tools used, information managers would be ill-advised to see this essential knowledge engineering process as an optional step in creating usable document retrieval systems. The real problem for unalloyed imaging systems is that it's not even optional; it's impossible.
So far, I have not admitted that there are situations in which there is no alternative to document image management. OK, I admit it. If you have zillions of forms filled out in handwriting, imaging is better than paper, yes. If you have lots of critical graphic material, ditto. And there are probably many other cases. But just because there may seem to be no practical alternative, the indexing problems described here don't go away. For this reason, handwritten forms will sooner or later be converted into machine-readable ASCII text. If the information they contain is valuable enough to justify imaging in the first place, the indexing currently applied to these images will so degrade retrieval that the costs directly resulting from lost business opportunities will make even manual conversion seem cheap by comparison. The more critical such information to an organization's health and longevity, the sooner the crunch will be felt.
The core issues here are epistemological: How do we know? What constitutes knowledge? Although raising such questions has always been seen as unbusinesslike, if not downright flaky, look for this to change. Despite the glut of verbiage about the Information Age (Society, Economy, Anxiety), something has definitely changed from when the focus of business was almost exclusively on product. We speak less of production and more of productivity; less of quantity and a lot more about quality. These are significant reflections of a new concern with how new things can be done, old things done better, and the human understanding required to do either. Rather than a set of objective facts that can be known in advance, knowledge is the result of such understanding.
In contrast, image processing presupposes that what is important in a document is already known. We type into the index field: this document is about x. But that's today. What about tomorrow? We may want to turn to our information resources to mine a completely different type of ore. How many times do you hear the phrase "for as long as records have been kept..." and notice that the data doesn't go back very far? This doesn't necessarily mean that record keeping began on the date given or that records before that date contain no relevant data. Often, it means that the records kept -- and the method used to keep them -- do not allow retrieval of information which has been "captured," but only recently been perceived as worth knowing about, that is to say: indexed. If it's in there but no one can get it out, it isn't information. By any meaningful definition, information must inform someone.
Take for example "acquired immune deficiency syndrome." Not too long ago, we had no name for this disease. Though it was as deadly then as it is today, medical diagnosticians didn't know what to call it, or even what it was. Today, we know a lot more than we did then about the illness we have since come to call AIDS. But it took longer than it might have to arrive at this knowledge, and meanwhile, an unnamed contagion spread through unsuspected vectors. Granted, the history of AIDS so far has had deep social, political and legal ramifications. Might it also have had a non-trivial technological component? Could medical records maintained in databases -- as images or otherwise -- have been better used to correlate the emergence of an as-yet undiagnosed constellation of symptoms with the histories of patients in which these symptoms were being observed? Such records may very well have contained vital information that wasn't readily accessible because of inherent shortcomings, not in the ability of medical researchers, but in their information systems. If there is no name, there will be no explicit field or index term. If there is no retrieval hook, no documents will be returned in response to queries. If searchers asking intelligent questions are not informed, events will follow their own course, with no one the wiser.
Although somewhat hypothetical, the example is relevant if it makes you ask "What am I missing in my own work?" In fact, the whole area of medical epidemiology forms a good analogy to what many organizations would like to do better in the marketplace today: track potential yet unidentified problems and opportunities, and either head them off before they go critical, or take advantage of them before the competition catches on. Accomplishing this involves pattern recognition of such a high caliber that it is most often referred to as "intuition" and considered a near- mystical skill. Yet people do this sort of thing all the time -- and more could do it more often given software tools which not only enabled but encouraged deeper exploration of documentary evidence.
How much is "worth knowing" about your market? Here, you have ten fields in your customer service database to type in notes on that last bug-report call -- or ten keywords to attach to that image of a letter from an irate consumer. Hell, make it twenty, fifty or five hundred. This sort of structured data -- about characteristics you have already identified as "significant" -- will never be adequate. The development that zaps you is always the one you didn't see coming. And you'll always get zapped if you can't quickly restructure or re- index information in light of more recent intelligence (think CIA, not AI).
Information resources -- document collections especially -- are not just mountains of facts we already know, but research bases to explore for clues about what we don't yet understand. More than that, they are the foundation on which new knowledge will be built by accretion. Detailed annotations and even off-the-cuff comments by knowledgeable people can add enormous value to documents, if -- and only if -- they are also retrievable. Hypertext linkages created between related texts can likewise be high-value-added contributions to greater comprehension, the non- negotiable prerequisite to effective action. But all this requires text, not pictures of words. And not just text, but sophisticated conceptual maps which can be continually "debugged" and incrementally extended to make sense of the words, which themselves are simply tokens of some presumably real territory beyond the document.
Since we opened with the Wicked Witch, let's close with something a little less fabulous and more contemporary. This is legendary PI Philip Marlowe in the process of searching a dead woman's apartment. You can almost hear Bogart's sibilant inflection:
"The cops would have seen all this. They'd have looked at everything like they do, and anything that mattered would be down in a box in property storage with a case tag on it. Still, they didn't know all the things I knew, and I was hoping I might see something that wouldn't have meant anything to them."
Raymond Chandler and Robert B. Parker Poodle Springs, Berkeley Books, 1990
If you read "document image database" for "property storage," the analog of the "case tag" is obviously an index entry. Today, the cops probably just take a picture and scribble a few notes on the back. Like Marlowe, I'd rather get closer to the hard evidence. Decide for myself who's the fairest of them all.
For Further Exploration...There is no better way to understand the real challenges and opportunities involved in text management than to get your hands dirty in the problem. The following is an eclectic mix of software tools, R&D efforts and professional associations that should provide plenty of insight into the wide range of options full text affords.
Text Management Systems
$995 for Professional Edition
10573 W. Pico Boulevard
Los Angeles, CA 90064
A hybrid text/data base that allows a combination of fixed fields and fully-indexed free text. Includes a full programming language for building applications.
2155 North Freedom Boulevard
Provo, UT 84604
An information storage and retrieval system that enables full-text indexing and search as well as hypertext navigation. Topical/structural segmentation of textbases is a strong feature.
55 Princeton-Hightstown Road
Princeton Junction, NJ 08550
An information system whose primary method of retrieval is the navigation of hypertext networks. Significantly, this is an authoring tool in addition to its presentation capabilities.
Document Structure: Tagging and Recognition
$495 for Mac; $695 for DOS; $1100-2500 for Unix
720 Spadina Avenue
A/E is an SGML editor. If you keep hearing this acronym for Standard Generalized Markup Language, but don't know what it's all about, here's a good place to begin learning about a powerful and growing document management/interchange methodology.
$2400 on DOS; $3100 on UNIX
947 Walnut Street
Boulder, CO 80302
While OCR recognizes characters, FastTag's pattern-matching capabilities recognize document structure and convert it into various machine-readable markup formats, SGML among them.
CD-ROM: Tools and Applications
$315/year; updated quarterly
Library of Congress
Customer Services Section
Cataloging Distribution Service
Washington, DC 20541
This amazing disc contains all the subject descriptors used by the Library of Congress (and many information vendors) to describe document content. Function keys enable a sort of hypertext navigation through the thesaurus structure.
The Original Oxford English Dictionary on Compact Disc
Oxford University Press
Electronic Publishing Division
200 Madison Avenue
New York, NY 10016
(212) 679-7300 ext 7370
The OED is the operating system manual set for the English language. The tagged structure and full-text indexing employed here enable searches never possible with the massive 13-volume hardcopy original.
$995/year; updated monthly
One Park Avenue
New York, NY 10016
A highly practical collection of full-text and abstracted articles from hundreds of computer-oriented publications. This new incarnation of the older "Computer Library" CD makes far better use of the subject indexing descriptors attached to each document record.
Tools and Languages
Word processors are inadequate for much of the preprocessing required for serious text management applications. The products listed here are programming tools that may require some effort to master, but will repay that effort in greatly increased control of textual information.
$249 for DOS version
Mortice Kern Systems
35 King Street North
Waterloo, Ontario N2J 2W9
The regular-expression based pattern-matching capabilities of Unix tools like ed, sed, grep and awk constitute a sort of Swiss Army Knife for text processing. The MKS Toolkit provides these and many other extremely useful text manipulation capabilities in a DOS environment.
PolyAwk and Sage Professional Editor
$195 and $295, respectively, for DOS
1700 N.W. 167th Place
Beaverton, OR 97006
Sage offers awk either by itself or as a built-in language for modifying and extending the SPE text editing environment: tools within tools within tools.
Snobol, Spitbol, and Icon
P.O. Box 1123
Salida, CO 81201
These three programming languages employ powerful pattern-matching techniques designed specifically for processing text. Available in various PC, Mac and Unix implementations.
The High End
$15,000 to $150,000
1550 Plymouth Street
Mountain View, CA 94043
This document retrieval system employs "concept retrieval," a shared knowledge base approach which captures information about subject matter of interest to users and stores it as reusable objects called "topics."
Text Categorization Shell
$124,000 (includes installation and training)
Carnegie Group, Inc.
Five PPG Place
Pittsburgh, PA 15222
TCS automatically categorizes text by semantic content rather than simple lexical string occurrences. It has been used by Reuters for subject indexing their online text databases.
(inquire about licensing)
Intelligent Technology Group
115 Evergreen Heights Drive
Pittsburgh, PA 15229
Although a commercial product has not been released, this thesaurus-based text management technology incorporates many of the concepts referred to in this article and is currently licensable.
The CLARIT Project
(inquire about sponsorship arrangements)
Laboratory for Computational Linguistics
Carnegie Mellon University
Pittsburgh, PA 15213
CLARIT stands for Computational Linguistic Approaches to Retrieval and Indexing of Text. This research project has obtained encouraging results, and is open to corporate sponsors seriously exploring text management solutions.
Document Management Expertise
The following are excellent sources for publications on library science, as well as of people who know how to apply the principles discussed here.
American Library Association
50 East Huron Street
Chicago, IL 60611
Special Library Association
1700 18th Street, NW
Washington, DC 20009