panix.user.html FAQ: Logs and Analysis: pwlog3

pwlog 3.0.0

The purpose of pwlog3 is to enhance the readability and manageability of XCLF weblogs, as rendered by the getclogs program,

When invoked without any command-line options, pwlog3 basically copies the input to the output, with possibly some minimal uniformization of the URL data. The usefulness of pwlog3 comes from the transformations it performs in response to the various command-line options.

Running pwlog3

If you have already run getclogs and saved the output to a file, you can use this file as input to pwlog3 by typing:

pwlog3 logfilename > newlogfilename

However, if you do not specify an input log file name, pwlog3 will automatically call getclogs for you. In other words, instead of typing

getclogs > logfilename
pwlog3 logfilename > newlogfilename

you can simply type

pwlog3 > newlogfilename

Options

The usefulness of pwlog3 comes from the features available via its option switches. A complete list of options and a short help message can be obtained by typing

pwlog3 -h

The options are as follows:

pwlog3 -r

The single least informative aspect of the getlcogs output is the use of IP numbers to identify the computers which have requested your webpages. By using the -r option of pwlog3, you can convert these numbers to hostnames, making it much easier to figure out where your visitors are coming from. However, be aware that (a) approximately 10-25% of IP numbers cannot be converted to hostnames, perhaps because the computers haven't been assigned "real names", and (b) the IP->name conversion takes time and if you have a busy site, it can take a really long time, perhaps on the order of hours, even days.

Invoking this option on the example XCLF log would result in the following output:


ip68-2-201-90.ph.ph.cox.net - - [01/Aug/2004:00:29:43 -0400] "GET /www.speedskating.com/wl/show.rp HTTP/1.1" 200 17617 "http://www.google.com/search?hl=en&ie=UTF-8&q=speed+skating+in+phoenix&btnG=Google+Search" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.1) Gecko/20040707"
ip68-2-201-90.ph.ph.cox.net - - [01/Aug/2004:00:29:43 -0400] "GET /www.speedskating.com/css/doz.css HTTP/1.1" 200 3553 "http://www.speedskating.com/wl/show.rp?id=inline_clubs/united_states" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.1) Gecko/20040707"
ip68-2-201-90.ph.ph.cox.net - - [01/Aug/2004:00:29:43 -0400] "GET /www.speedskating.com/gfx/logo/ssk030130a.gif HTTP/1.1" 200 830 "http://www.speedskating.com/wl/show.rp?id=inline_clubs/united_states" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.1) Gecko/20040707"
yahoobb219018208079.bbtec.net - guest [01/Aug/2004:00:29:46 -0400] "GET /www.skatecity.com/ HTTP/1.1" 200 1148 "http://www.google.com/search?q=SKATE&btnG=Google+%E6%A4%9C%E7%B4%A2&hl=ja&ie=UTF-8&c2coff=1" "Opera/7.23 (Windows 98; U)  [ja]"
yahoobb219018208079.bbtec.net - guest [01/Aug/2004:00:29:46 -0400] "GET /www.skatecity.com/gfx/rhs_infobahn.gif HTTP/1.1" 200 739 "http://www.skatecity.com/" "Opera/7.23 (Windows 98; U)  [ja]"
yahoobb219018208079.bbtec.net - guest [01/Aug/2004:00:29:46 -0400] "GET /www.skatecity.com/gfx/rhs_speedskating.gif HTTP/1.1" 200 666 "http://www.skatecity.com/" "Opera/7.23 (Windows 98; U)  [ja]"

pwlog3 -m: Omit any request coming from any *.panix.com and *.access.net host.
pwlog3 -M: Include only requests coming from within the *.panix.com and *.access.net domains.
pwlog3 -b pattern: Include this log entry only if the request came from a machine which includes this pattern (a Perl regexp). Note: This test is made after IP-to-hostname is attempted, assuming that option is turned on.
pwlog3 -B pattern: Omit this log entry only if the request came from a machine which includes this pattern (a Perl regexp). Note: If you specify any combination of the -b, -B, -m and -M options, only one of them will be evaluated. Preference is in the order just given (i.e., -b wins)
pwlog3 -d somedate: Report only entries occurring on or after the specified date. The format of the specified date must be YYYY:MM:DD; for example, to obtain a report limited to requests on or after August 15, 1997, you would replace somedate with 1997:08:15. Note: Remember that the only web requests which will be checked against the specified date are those from the log file(s) you've specified.
pwlog3 -D somedate: Similar to the -d option except that it reports only entries on or before the specified date.
pwlog3 -f pattern: Skip this log entry if the filename does not include this pattern (a Perl regexp).
pwlog3 -F pattern: Skip this log entry if the filename does include this pattern (a Perl regexp).
pwlog3 -l: Execute getclogs -o and use the result as input for pwlog3. This option is ignored if you specify an input log file.

IP-to-Hostname Resolving and the Host Hash File

The way that the -r option in the pwlog and pwstat programs determines the machine names corresponding to the IP numbers in the weblogs is to do a host lookup for each number. However, since most people who hit a good website hit it more than once, doing a lookup for every single entry in a log file would be needlessly repetitious. Thus, the pwlog and pwstat programs maintain a file of matching IP numbers and hostnames, and they check in this file for a match before actually executing an IP lookup. At present, this is done for every single user who executes pwlog and pwstat; there is no Panix-wide common file which all pwlog and pwstat users can access. The name of the hostfile is .pwhosts, and you will find your copy in your home (login) directory.

The process of converting IP numbers to hostnames can be incredibly slow, whether it occurs in pwlog or in pwstat. In fact, it can be downright maddening if you have a popular site. Lookup time for just a couple days worth of hits on my own pages can take over an hour. clay once reported that it took about 10 hours to resolve the new hostnames seen during a week of traffic to his site, and that was back in late 1995, when web traffic was a fraction of what it is now.

Persons with popular sites will also find that their .pwhosts file can get pretty large. Mine, for example, was up to 475 kb by the summer of 1995, after only a few months of traffic to my pages. If you have a popular set of pages, it wouldn't be too long before your copy of .pwhosts was into the megabytes. At that point, it's time to ask if you really need to know the names of all the machines visiting your site.

All this said, you may understand why your time is better spent (and less computing time and disk space wasted) if you do not invoke the -r option in either pwlog or pwstat