Ex Bibliotheca

The life and times of Zack Weinberg.

Wednesday, 11 December 2002

# 6:30 PM

web logs and search engines

These days there's a lot of good content out there in the form of web logs. Unfortunately, it's not indexed well by search engines. The trouble is, the webcrawler comes by and records whatever's on the front page of the log at the time, but by the time you go to make a search, a whole bunch more entries will have been added, pushing the entry you searched for off the front page. Entries in the archives may not be indexed at all.

This despite the fact that most weblogs have 'permanent links': the little blue hashmark at the beginning of this entry is an <A> tag pointing to a URL which will always and forevermore reference this article. (Unless Panix goes belly up or something.) If the search engines knew about the permanent links, they could use them in their indexes, and it'd all work.

Here's how I would implement this. Suppose we invent a labelling scheme which will allow a webcrawler to tell that an <a> tag is a permanent link. I'd use the class attribute, like this:

  <a class="permalink-above" href="...">
  <a class="permalink-below" href="...">

"permalink-above" means that the tag precedes the text it is a permanent link for; "permalink-below" means that it follows that text. (Both styles are used.) We also need a way to indicate the block-level element that contains all the permanent links, so that navigation and page header boilerplate don't get sucked into the permalink indexing mode. For this, we define another class attribute, "weblog-content". This is to be put on a block-level element that contains (in the DOM sense) all of the permalink tags. The ambit of each permalink tag then runs as far as the next such tag in the appropriate direction, or to the boundary of the weblog-content element, whichever comes first. Permalink tags outside the weblog-content element are ignored. If there is no element explicitly marked weblog-content, the body element gets the job.

Search engines then should record each chunk of text in the ambit of a permalink tag as a separate logical document. However, links to the base URL for the weblog should count as links to all of these chunks, for scoring purposes. (This corresponds to the intuitive interpretation of a link to the base URL, which is "I like everything this author says.") Links to individual permalinks count only for that chunk.

Your comments are requested.