Understanding web logs: getlogs

getlogs, written at Panix, is a simple log-extracting utility for POLF (Panix Oldstyle Log Format) weblogs. The original author was Liz Reynolds. The basic function of the script is to print a listing of traffic on the user's webpages during the current calendar month.

The following is a short example of some actual getlogs output. Notice that output mixes together data pertaining to HTML and text pages, GIF graphics, CGI calls, and personal, corporate and FTP pages.


3819	WWW	182	1998:09:01:01:24:58	/export/httpd/htdocs/userdirs/rbs/Skate	209.240.199.53	301	http://www.xs4all.nl:80/~lowlevel/skate/linx.html	www1
3819	WWW	203	1998:09:01:03:25:47	/export/httpd/htdocs/userdirs/rbs/Skate	204.244.93.232	301	-	www2
3819	FTP	9600	1998:09:01:04:16:05	/pub/incoming/newfile.txt	166.84.197.198	200	-	ftp
3819	WWW	15130	1998:09:01:05:23:41	/export/httpd/htdocs/rbs/blur/index.cgi	24.112.48.33	200	http://www.yahoo.ca/Recreation/Sports/Skating/Inline_Skating/Magazines/	web4
3819	WWW	43	1998:09:01:05:23:43	/export/httpd/htdocs/rbs/blur/gfx/spacer.GIF	24.112.48.33	200	http://www.skating.com/	web4
3819	WWW	43	1998:09:01:05:23:43	/export/httpd/htdocs/rbs/blur/gfx/spacer.GIF	24.112.48.33	200	http://www.skating.com/	web4
3819	WWW	191	1998:09:01:05:23:44	/export/httpd/htdocs/rbs/blur/banner.cgi	24.112.48.33	302	http://www.skating.com/	web4
3819	WWW	43	1998:09:01:05:23:44	/export/httpd/htdocs/rbs/blur/gfx/spacer.GIF	24.112.48.33	200	http://www.skating.com/	web4
3819	WWW	162	1998:09:01:06:04:56	/export/httpd/htdocs/rbs/skatecity/robots.txt	204.123.9.47	200	-	web4
3819	WWW	4301	1998:09:01:06:18:17	/export/httpd/htdocs/rbs/blur/article.cgi	193.13.129.79	200	http://altavista.digital.com/cgi-bin/query?pg=q&kl=XX&q=%22Salomon+inline%22	web4
3819	WWW	6665	1998:09:01:07:11:49	/export/httpd/htdocs/rbs/skatecity/ah/index.html	195.133.10.89	200	http://www.yahoo.com/Arts/Humanities/Literature/Genres/	web4
3819	WWW	1343	1998:09:01:07:11:52	/export/httpd/htdocs/rbs/skatecity/ah/gfx/uchronia.sml.GIF	195.133.10.89	200	http://www.skatecity.com/ah/	web4
3819	WWW	911	1998:09:01:07:11:54	/export/httpd/htdocs/rbs/skatecity/ah/gfx/intro.GIF	195.133.10.89	200	http://www.skatecity.com/ah/	web4
3819	WWW	9723	1998:09:01:08:08:51	/export/httpd/htdocs/rbs/blur/resources/reviews.cgi	155.78.124.187	200	http://www.hotbot.com/?SW=web&SM=MC&MT=Rollerblade%2bReviews&DC=10&DE=2&RG=NA&_v=2	web4

What's being reported here?

ownerid:
The userid of the owner of the pages listed in the report, i.e., the person who just executed getlogs. In this example, the userid is 3819, which corresponds to the login "rbs".
usage:
Method of downloading. In the example above, note that an "FTP" entry has snuck into the list. This will happen if you have a corporate account and have arranged for anonymous ftp service.
bytes:
Number of bytes transferred. Note that if getlogs was run without the -a flag, the output will not reflect the bytes transferred by way of the Squids.
timestamp:
The time at which the access took place, in NYC local time.
filename:
Name of file downloaded. This is the path from the HTML document root directory to the location of the file. The entry containing "/htdocs/userdirs/rbs/Skate/" indicates a hit on the page http://www.panix.com/~rbs/Skate/. Similarly, "/htdocs/rbs/" indicates a hit on user rbs's corporate webspace.
host:
The ID of the machine which requested the webpage. This information tells you only the machine; you cannot find out, for example, the e-mail address of the person who made the request. You'll note that the machine ID is a sequence of numbers, what is referred to as an IP number. In most cases, the IP number can be translated into a hostname; you can use pwlog with the -r flag to have this translation performed on the output from getlogs. At some organizations especially concerned about security, the IP number (and the corresponding hostname) may refer to an intermediary gatekeeper computer rather than the actual computer which the requesting person is using As an example, the IP number 199.181.175.201 would mean a hit by the machine gatekeeper.nytimes.com; since that is a firewall machine, the actual requester could be anywhere in the nytimes.com domain.
status:
A code which reflects the completion status of handling the file request, with 200 meaning no error - i.e. the document was served properly. Other codes which you may see are:
  • 301/302 Redirected Request: This most often happens when someone requests a directory index file, but hasn't completely specified the URL. The server sends back the correct URL and the browser then makes the correct request for that. In the example above, this happens when a directory index request is missing the trailing slash. Another possible cause is a CGI script which returns a redirection URL ("Location: http://www.foo.com\n\n") instead of HTML or other content.
  • 304 Not Modified Request: Browser asked whether the file had been changed since a previous request, and finding that it hadn't, did not download another copy.
  • 400 Bad Requests
  • 401 Unauthorized Requests
  • 403 Forbidden Requests
  • 404 Not Found Requests
  • 4xx Client Errors: There was an error serving this request, the error resulting from something on the client end. Client errors outside the 400-404 range are rare.
  • 5xx Server Errors: There was an error serving this request, the error resulting from something on the server end. The most likely cause of such an error is a buggy CGI script, but there are a number of other possible reasons.
referrer:
The URL of the webpage which pointed the browser to the given file on your site. Note that this datum is often not available, and sometimes may be incorrect. The latter case is most likely to arise when someone is viewing a page, and then manually types your URL in. This may result in the page they were looking at before being logged as the referring URL, even if it contains no links to the page on your site.
server:
The name of the actual Panix machine that served the request. In general, for personal websites this name will be www1 or www2, and for corporate websites it should remain mostly constant, unless Panix staff has been trying to even the load on the webservers or for other technical reasons.

The last line of any getlogs output is the total number of bytes contained in the weblog output you received. This is useful if you want to later invoke getlogs with the -s flag, to make it continue where the previous one left off.

getlogs output is sent to "standard output", i.e., your terminal screen. To request that it be sent to a file, you need to use the redirection operator, ">"; e.g.,

getlogs > logfilename

getlogs, without any command-line parameters, returns data only for the current month - i.e. all the transfers that occurred between just after midnight of the first day of the current month and the time when the most recent hourly weblog-processing occurred. (It does not return just a report of all traffic since the last time you ran getlogs.) The -o flag causes getlogs to return data for the previous month.

Options

The following options are available with getlogs:

getlogs -o
Retrieve the logs for the preceding reporting period; i.e., last month. Very useful at the beginning of the month when the logs have just been reset but you need to get a report including the last day or two of the preceding month.
getlogs -c
Retrieve the logs only from the web accelerators (Squids), rather than the "main" web servers.
getlogs -a
Retrieve logs from the web accelerators (Squids) and the web servers. This option is useful for obtaining the count of bytes transfered and verifying it against the billing records, but will not give an accurate hit count.
getlogs -s N
Omits the first N characters of the weblogs. Useful if you only want to look at data which have accumulated since the last time you ran getlogs.

Deficiencies:

Note that getlogs only returns IP numbers (not hostnames) of the requestors' machines. You can use pwlog with the -r flag to attempt to resolve those IP numbers into hostnames.


Last Modified:Wednesday, 30-Jan-2013 12:14:11 EST
© Copyright 2006-2011 Public Access Networks Corporation