pwlog 2.1.1

The goals of pwlog include reducing the length and mystery of a typical getlogs output.

The default action of pwlog is to abbreviate the directory path in filenames and to drop the referral information. As of this writing, a "hit" as reported by getlogs includes "/htdocs/userdirs/userid" at the start of a file in personal webspace and "/htdocs/userid" for a file in corporate webspace. pwlog abbreviates these as "(u)" and "(c)", respectively. Thus, using the same example log information as in the getlogs description, invoking pwlog without any of its options set would result in:


3819	WWW	182	1998:09:01:01:24:58	(u)/Skate	209.240.199.53	301	www1
3819	WWW	203	1998:09:01:03:25:47	(u)/Skate	204.244.93.232	301	www2
3819	FTP	9600	1998:09:01:04:16:05	(f)/pub/incoming/newfile.txt	166.84.197.198	200	ftp
3819	WWW	15130	1998:09:01:05:23:41	(c)/blur/index.cgi	24.112.48.33	200	web4
3819	WWW	43	1998:09:01:05:23:43	(c)/blur/gfx/spacer.GIF	24.112.48.33	200	web4
3819	WWW	43	1998:09:01:05:23:43	(c)/blur/gfx/spacer.GIF	24.112.48.33	200	web4
3819	WWW	191	1998:09:01:05:23:44	(c)/blur/banner.cgi	24.112.48.33	302	web4
3819	WWW	43	1998:09:01:05:23:44	(c)/blur/gfx/spacer.GIF	24.112.48.33	200	web4
3819	WWW	162	1998:09:01:06:04:56	(c)/skatecity/robots.txt	204.123.9.47	200	web4
3819	WWW	4301	1998:09:01:06:18:17	(c)/blur/article.cgi	193.13.129.79	200	web4
3819	WWW	6665	1998:09:01:07:11:49	(c)/skatecity/ah/	195.133.10.89	200	web4
3819	WWW	1343	1998:09:01:07:11:52	(c)/skatecity/ah/gfx/uchronia.sml.GIF	195.133.10.89	200	web4
3819	WWW	911	1998:09:01:07:11:54	(c)/skatecity/ah/gfx/intro.GIF	195.133.10.89	200	web4
3819	WWW	9723	1998:09:01:08:08:51	(c)/blur/resources/reviews.cgi	155.78.124.187	200	web4

Running pwlog

If you have already run getlogs, the procedure for creating output like the above example can most simply be done by just typing:

pwlog logfilename > newlogfilename

However, you can get pwlog to call getlogs for you. In fact, if you specify no input log file name, that automatically happens. In other words, instead of typing

getlogs > logfilename
pwlog logfilename > newlogfilename

you can instead just type

pwlog > newlogfilename

Options

Most of the more useful features of pwlog are only available via its option switches. A complete list and a short help message can be obtained by typing

pwlog -h

Noteable among these options are:

pwlog -A
Don't bother to do the filepath abbreviating described above.
pwlog -P
In the above example, the first column of the output remain the same. Additionally, the final column, which lists the name of the specific Panix machine which serviced the request, is generally non-useful information to all except Panix staffers. The -P option deletes these columns. In addition to reducing the filesize for the log information, this option makes it a bit more likely than a line in the file will fit into an 80-character screen, thereby making it somewhat more readable. Using this option on the above example would result in:

WWW     182	1998:09:01:01:24:58	(u)/Skate	209.240.199.53	301
WWW     203	1998:09:01:03:25:47	(u)/Skate	204.244.93.232	301
FTP    9600	1998:09:01:04:16:05	(f)/pub/incoming/newfile.txt	166.84.197.198	200
WWW   15130	1998:09:01:05:23:41	(c)/blur/index.cgi	24.112.48.33	200
WWW      43	1998:09:01:05:23:43	(c)/blur/gfx/spacer.GIF	24.112.48.33	200
WWW      43	1998:09:01:05:23:43	(c)/blur/gfx/spacer.GIF	24.112.48.33	200
WWW     191	1998:09:01:05:23:44	(c)/blur/banner.cgi	24.112.48.33	302
WWW      43	1998:09:01:05:23:44	(c)/blur/gfx/spacer.GIF	24.112.48.33	200
WWW     162	1998:09:01:06:04:56	(c)/skatecity/robots.txt	204.123.9.47	200
WWW    4301	1998:09:01:06:18:17	(c)/blur/article.cgi	193.13.129.79	200
WWW    6665	1998:09:01:07:11:49	(c)/skatecity/ah/	195.133.10.89	200
WWW    1343	1998:09:01:07:11:52	(c)/skatecity/ah/gfx/uchronia.sml.GIF	195.133.10.89	200
WWW     911	1998:09:01:07:11:54	(c)/skatecity/ah/gfx/intro.GIF	195.133.10.89	200
WWW    9723	1998:09:01:08:08:51	(c)/blur/resources/reviews.cgi	155.78.124.187	200

pwlog -r
One of the most mysterious things about getlogs output is the use of IP numbers to report the IDs of computers which have requested your webpages. You can convert these numbers to computer names, making it much easier to figure out where your visitors are coming from; just use the -r option with pwlog. However, be aware that (a) approximately 10-25% of IP numbers cannot be converted to hostnames, perhaps because the computers haven't been assigned "real names", and (b) the IP->name conversion takes time and if you have a busy site, it can take a really long time, perhaps on the order of hours, even days. Invoking this option and the -P option on the example log would result in output like the following:

WWW     182	1998:09:01:01:24:58	(u)/Skate	proxy-226.iap.bryant.webtv.net	301
WWW     203	1998:09:01:03:25:47	(u)/Skate	kam1d40.dial.uniserve.ca	301
FTP    9600	1998:09:01:04:16:05	(f)/pub/incoming/newfile.txt	rbs.dialup.access.net	200
WWW   15130	1998:09:01:05:23:41	(c)/blur/index.cgi	pc-403.on.rogers.wave.ca	200
WWW      43	1998:09:01:05:23:43	(c)/blur/gfx/spacer.GIF	pc-403.on.rogers.wave.ca	200
WWW      43	1998:09:01:05:23:43	(c)/blur/gfx/spacer.GIF	pc-403.on.rogers.wave.ca	200
WWW     191	1998:09:01:05:23:44	(c)/blur/banner.cgi	pc-403.on.rogers.wave.ca	302
WWW      43	1998:09:01:05:23:44	(c)/blur/gfx/spacer.GIF	pc-403.on.rogers.wave.ca	200
WWW     162	1998:09:01:06:04:56	(c)/skatecity/robots.txt	vscooter.av.pa-x.dec.com	200
WWW    4301	1998:09:01:06:18:17	(c)/blur/article.cgi	193.13.129.79	200
WWW    6665	1998:09:01:07:11:49	(c)/skatecity/ah/	89.10.133.195.dynamic.dialup.ru	200
WWW    1343	1998:09:01:07:11:52	(c)/skatecity/ah/gfx/uchronia.sml.GIF	89.10.133.195.dynamic.dialup.ru	200
WWW     911	1998:09:01:07:11:54	(c)/skatecity/ah/gfx/intro.GIF	89.10.133.195.dynamic.dialup.ru	200
WWW    9723	1998:09:01:08:08:51	(c)/blur/resources/reviews.cgi	155.78.124.187	200

pwlog -g
"Smash" the filenames of graphics, reducing them to *.gif or *.jpeg as appropriate. For example, foo.gif and bar.GIF would both be converted to "*.gif". This is not necessarily useful when running pwlog, but comes in very handy when running pwstat.
pwlog -q list
Filter log entries by usage types, where "list" can be one or more of c, u, or f. If c, then we want corporate web hits included; if u, include personal web hits; and if f, include ftp transfers. Note: most Panix users do not have both corporate and personal web traffic, but corporate users may want to use this option to generate separate logs for their web and ftp traffic.
pwlog -m
Omit any request coming from any *.panix.com and *.access.net host.
pwlog -M
Include only requests coming from within the *.panix.com and *.access.net domains.
pwlog -b pattern
Include this log entry only if the request came from a machine which includes this pattern (a Perl regexp). Note: This test is made after IP-to-hostname is attempted, assuming that option is turned on.
pwlog -B pattern
Omit this log entry only if the request came from a machine which includes this pattern (a Perl regexp). Note: If you specify any combination of the -b, -B, -m and -M options, only one of them will be evaluated. Preference is in the order just given (i.e., -b wins)
pwlog -d somedate
Report only entries occurring on or after the specified date. The format of the specified date must be YYYY:MM:DD; for example, to obtain a report limited to requests on or after August 15, 1997, you would replace somedate with 1997:08:15. Note: Remember that the only web requests which will be checked against the specified date are those from the log file(s) you've specified.
pwlog -D somedate
Similar to the -d option except that it reports only entries on or before the specified date.
pwlog -f pattern
Skip this log entry if the filename does not include this pattern (a Perl regexp).
pwlog -F pattern
Skip this log entry if the filename does include this pattern (a Perl regexp).
pwlog -l
Execute getlogs -o and use the result as input for pwlog. This option is ignored if you specify an input log file.

Converting to Common Log Format

NOTE.   This section is obsolete. If you need you logs in Common Log Format, use getclogs to obtain them.

It may be that you have obtained some handy-dandy third-party stats program which you'd like to use, but you can't because the output from getlogs and the above described output from pwlog aren't in "common log format", which most such programs require. If so, there are two additional pwlog options which you will find of use:

pwlog -k
Use this option to convert getlogs or pwlog output to common log format. (If you specify both the -P and -k options, -k wins.) The result should looking something like the following:

209.240.199.53 - - [01/Sep/1998:01:24:58 -0500] "HEAD (u)/Skate HTTP/1.0" 301 182
204.244.93.232 - - [01/Sep/1998:03:25:47 -0500] "HEAD (u)/Skate HTTP/1.0" 301 203
166.84.197.198 - - [01/Sep/1998:04:16:05 -0500] "FTP (f)/pub/incoming/newfile.txt FTP/X.X" 200 9600
24.112.48.33 - - [01/Sep/1998:05:23:41 -0500] "GET (c)/blur/index.cgi HTTP/1.0" 200 15130
24.112.48.33 - - [01/Sep/1998:05:23:43 -0500] "GET (c)/blur/gfx/spacer.GIF HTTP/1.0" 200 43
24.112.48.33 - - [01/Sep/1998:05:23:43 -0500] "GET (c)/blur/gfx/spacer.GIF HTTP/1.0" 200 43
24.112.48.33 - - [01/Sep/1998:05:23:44 -0500] "HEAD (c)/blur/banner.cgi HTTP/1.0" 302 191
24.112.48.33 - - [01/Sep/1998:05:23:44 -0500] "GET (c)/blur/gfx/spacer.GIF HTTP/1.0" 200 43
204.123.9.47 - - [01/Sep/1998:06:04:56 -0500] "GET (c)/skatecity/robots.txt HTTP/1.0" 200 162
193.13.129.79 - - [01/Sep/1998:06:18:17 -0500] "GET (c)/blur/article.cgi HTTP/1.0" 200 4301
195.133.10.89 - - [01/Sep/1998:07:11:49 -0500] "GET (c)/skatecity/ah/ HTTP/1.0" 200 6665
195.133.10.89 - - [01/Sep/1998:07:11:52 -0500] "GET (c)/skatecity/ah/gfx/uchronia.sml.GIF HTTP/1.0" 200 1343
195.133.10.89 - - [01/Sep/1998:07:11:54 -0500] "GET (c)/skatecity/ah/gfx/intro.GIF HTTP/1.0" 200 911
155.78.124.187 - - [01/Sep/1998:08:08:51 -0500] "GET (c)/blur/resources/reviews.cgi HTTP/1.0" 200 9723

pwlog -K
Similar to the -k option, except that it attempts to convert the log file to "extended common log format". Basically, this means also including referral information. Extended common log should also include the user agent, i.e., browser type, but that data is not available from the getlogs output to start with. The result should looking something like the following:

209.240.199.53 - - [01/Sep/1998:01:24:58 -0500] "HEAD (u)/Skate HTTP/1.0" 301 182 "http://www.xs4all.nl:80/~lowlevel/skate/linx.html" "UNKNOWN"
204.244.93.232 - - [01/Sep/1998:03:25:47 -0500] "HEAD (u)/Skate HTTP/1.0" 301 203 "-" "UNKNOWN"
166.84.197.198 - - [01/Sep/1998:04:16:05 -0500] "FTP (f)/pub/incoming/newfile.txt FTP/X.X" 200 9600 "-" "UNKNOWN"
24.112.48.33 - - [01/Sep/1998:05:23:41 -0500] "GET (c)/blur/index.cgi HTTP/1.0" 200 15130 "http://www.yahoo.ca/Recreation/Sports/Skating/Inline_Skating/Magazines/" "UNKNOWN"
24.112.48.33 - - [01/Sep/1998:05:23:43 -0500] "GET (c)/blur/gfx/spacer.GIF HTTP/1.0" 200 43 "http://www.skating.com/" "UNKNOWN"
24.112.48.33 - - [01/Sep/1998:05:23:43 -0500] "GET (c)/blur/gfx/spacer.GIF HTTP/1.0" 200 43 "http://www.skating.com/" "UNKNOWN"
24.112.48.33 - - [01/Sep/1998:05:23:44 -0500] "HEAD (c)/blur/banner.cgi HTTP/1.0" 302 191 "http://www.skating.com/" "UNKNOWN"
24.112.48.33 - - [01/Sep/1998:05:23:44 -0500] "GET (c)/blur/gfx/spacer.GIF HTTP/1.0" 200 43 "http://www.skating.com/" "UNKNOWN"
204.123.9.47 - - [01/Sep/1998:06:04:56 -0500] "GET (c)/skatecity/robots.txt HTTP/1.0" 200 162 "-" "UNKNOWN"
193.13.129.79 - - [01/Sep/1998:06:18:17 -0500] "GET (c)/blur/article.cgi HTTP/1.0" 200 4301 "http://altavista.digital.com/cgi-bin/query?pg=q&kl=XX&q=%22Salomon+inline%22" "UNKNOWN"
195.133.10.89 - - [01/Sep/1998:07:11:49 -0500] "GET (c)/skatecity/ah/ HTTP/1.0" 200 6665 "http://www.yahoo.com/Arts/Humanities/Literature/Genres/" "UNKNOWN"
195.133.10.89 - - [01/Sep/1998:07:11:52 -0500] "GET (c)/skatecity/ah/gfx/uchronia.sml.GIF HTTP/1.0" 200 1343 "http://www.skatecity.com/ah/" "UNKNOWN"
195.133.10.89 - - [01/Sep/1998:07:11:54 -0500] "GET (c)/skatecity/ah/gfx/intro.GIF HTTP/1.0" 200 911 "http://www.skatecity.com/ah/" "UNKNOWN"
155.78.124.187 - - [01/Sep/1998:08:08:51 -0500] "GET (c)/blur/resources/reviews.cgi HTTP/1.0" 200 9723 "http://www.hotbot.com/?SW=web&SM=MC&MT=Rollerblade%2bReviews&DC=10&DE=2&RG=NA&_v=2" "UNKNOWN"

One warning about this conversion process: Besides the non-availability of user agent information, getlogs also does not include the request method (GET, POST or HEAD) and so pwlog will make an educated guess when converting to common log format. Basically, it assumes that all web requests are GETs unless there is a return code in the 300s. In that case, pwlog decided that it's a HEAD. It will not assign the POST method to any entry in the log, which is of course quite wrong if you have a lot of CGI scripts running. This should not be a problem when you are running stats, but we include the warning here just so that you know.

Also, pwstat does not recognize Common Log format.

IP-to-Hostname Resolving and the Host Hash File

The way that the -r option in the pwlog and pwstat programs determines the machine names corresponding to the IP numbers in the weblogs is to do a host lookup for each number. However, since most people who hit a good website hit it more than once, doing a lookup for every single entry in a log file would be needlessly repetitious. Thus, the pwlog and pwstat programs maintain a file of matching IP numbers and hostnames, and they check in this file for a match before actually executing an IP lookup. At present, this is done for every single user who executes pwlog and pwstat; there is no Panix-wide common file which all pwlog and pwstat users can access. The name of the hostfile is .pwhosts, and you will find your copy in your home (login) directory.

The process of converting IP numbers to hostnames can be incredibly slow, whether it occurs in pwlog or in pwstat. In fact, it can be downright maddening if you have a popular site. Lookup time for just a couple days worth of hits on my own pages can take over an hour. clay once reported that it took about 10 hours to resolve the new hostnames seen during a week of traffic to his site, and that was back in late 1995, when web traffic was a fraction of what it is now.

Persons with popular sites will also find that their .pwhosts file can get pretty large. Mine, for example, was up to 475 kb by the summer of 1995, after only a few months of traffic to my pages. If you have a popular set of pages, it wouldn't be too long before your copy of .pwhosts was into the megabytes. At that point, it's time to ask if you really need to know the names of all the machines visiting your site.

All this said, you may understand why your time is better spent (and less computing time and disk space wasted) if you do not invoke the -r option in either pwlog or pwstat



Last Modified:Wednesday, 30-Jan-2013 12:14:10 EST
© Copyright 2006-2011 Public Access Networks Corporation