Understanding web logs: getclogs

getclogs is to XCLF (Extended Common Log Format) weblogs what getlogs is to POLF (Panix Oldstyle Log Format) weblogs. The basic function of the script is to print a listing of traffic on the user's webpages during the current calendar month.

The following is a short example of some actual getlogs output. As in the case of getlogs, the output will mix together data pertaining to HTML and text pages, GIF graphics, CGI calls, and personal, corporate and FTP pages.


68.2.201.90 - - [01/Aug/2004:00:29:43 -0400] "GET http://www.speedskating.com/wl/show.rp?id=inline_clubs/united_states HTTP/1.1" 200 17617 "http://www.google.com/search?hl=en&ie=UTF-8&q=speed+skating+in+phoenix&btnG=Google+Search" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.1) Gecko/20040707"
68.2.201.90 - - [01/Aug/2004:00:29:43 -0400] "GET http://www.speedskating.com/css/doz.css HTTP/1.1" 200 3553 "http://www.speedskating.com/wl/show.rp?id=inline_clubs/united_states" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.1) Gecko/20040707"
68.2.201.90 - - [01/Aug/2004:00:29:43 -0400] "GET http://www.speedskating.com/gfx/logo/ssk030130a.gif HTTP/1.1" 200 830 "http://www.speedskating.com/wl/show.rp?id=inline_clubs/united_states" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.1) Gecko/20040707"
219.18.208.79 - guest [01/Aug/2004:00:29:46 -0400] "GET http://www.skatecity.com/ HTTP/1.1" 200 1148 "http://www.google.com/search?q=SKATE&btnG=Google+%E6%A4%9C%E7%B4%A2&hl=ja&ie=UTF-8&c2coff=1" "Opera/7.23 (Windows 98; U)  [ja]"
219.18.208.79 - guest [01/Aug/2004:00:29:46 -0400] "GET http://www.skatecity.com/gfx/rhs_infobahn.gif HTTP/1.1" 200 739 "http://www.skatecity.com/" "Opera/7.23 (Windows 98; U)  [ja]"
219.18.208.79 - guest [01/Aug/2004:00:29:46 -0400] "GET http://www.skatecity.com/gfx/rhs_speedskating.gif HTTP/1.1" 200 666 "http://www.skatecity.com/" "Opera/7.23 (Windows 98; U)  [ja]"

What's being reported here?

host:
The ID of the machine which requested the webpage. This information tells you only the machine; you cannot find out, for example, the e-mail address of the person who made the request. You'll note that the machine ID is a sequence of numbers, what is referred to as an IP number. In most cases, the IP number can be translated into a hostname; you can use pwlog3 with the -r flag to have this translation performed on the output from getclogs. At some organizations especially concerned about security, the IP number (and the corresponding hostname) may refer to an intermediary gatekeeper computer rather than the actual computer which the requesting person is using As an example, the IP number 199.181.175.201 would mean a hit by the machine gatekeeper.nytimes.com; since that is a firewall machine, the actual requester could be anywhere in the nytimes.com domain.
client user name:
The name of the user on the client computer (defined by RFC 1413), assuming the Web server is extracting that information. At Panix, the server does not provide this information, and a dash (-) appears in its place.
authenticated user name:
If the client authenticated to get the page (as is the case if the page is .htpasswd protected), the username appers in the log (but not the password). Otherwise, a dash (-) appears in its place.
timestamp:
The time at which the access took place, in NYC local time.
file URL:
URL of the file which was accessed.
status:
A code which reflects the completion status of handling the file request, with 200 meaning no error - i.e. the document was served properly. Other codes which you may see are:
  • 301/302 Redirected Request: This most often happens when someone requests a directory index file, but hasn't completely specified the URL. The server sends back the correct URL and the browser then makes the correct request for that. In the example above, this happens when a directory index request is missing the trailing slash. Another possible cause is a CGI script which returns a redirection URL ("Location: http://www.foo.com\n\n") instead of HTML or other content.
  • 304 Not Modified Request: Browser asked whether the file had been changed since a previous request, and finding that it hadn't, did not download another copy.
  • 400 Bad Requests
  • 401 Unauthorized Requests
  • 403 Forbidden Requests
  • 404 Not Found Requests
  • 4xx Client Errors: There was an error serving this request, the error resulting from something on the client end. Client errors outside the 400-404 range are rare.
  • 5xx Server Errors: There was an error serving this request, the error resulting from something on the server end. The most likely cause of such an error is a buggy CGI script, but there are a number of other possible reasons.
bytes:
Number of bytes transferred. Note that if getclogs was run without the -a flag, the output will not reflect the bytes transferred by way of the Squids.
referrer:
The URL of the webpage which pointed the browser to the given file on your site. Note that this datum is often not available, and sometimes may be incorrect. The latter case is most likely to arise when someone is viewing a page, and then manually types your URL in. This may result in the page they were looking at before being logged as the referring URL, even if it contains no links to the page on your site.
user-agent:
The name and version of the browser or other client software making the request.

The last line of any getclogs output is the total number of bytes contained in the weblog output you received. This is useful if you want to later invoke getclogs with the -s flag, to make it continue where the previous one left off.

getclogs output is sent to "standard output", i.e., your terminal screen. To request that it be sent to a file, you need to use the redirection operator, ">"; e.g.,

getclogs > logfilename

getclogs, without any command-line parameters, returns data only for the current month - i.e. all the transfers that occurred between just after midnight of the first day of the current month and the time when the most recent hourly weblog-processing occurred. (It does not return just a report of all traffic since the last time you ran getclogs.) The -o flag causes getclogs to return data for the previous month.

Options

The following options are available with getclogs:

getclogs -o
Retrieve the logs for the preceding reporting period; i.e., last month. Very useful at the beginning of the month when the logs have just been reset but you need to get a report including the last day or two of the preceding month.
getclogs -c
Retrieve the logs only from the web accelerators (Squids), rather than the "main" web servers.
getclogs -a
Retrieve logs from the web accelerators (Squids) and the web servers. This option is useful for obtaining the count of bytes transfered and verifying it against the billing records, but will not give an accurate hit count.
getclogs -s N
Omits the first N characters of the weblogs. Useful if you only want to look at data which have accumulated since the last time you ran getclogs.
Deficiencies:

Note that getclogs only returns IP numbers (not hostnames) of the requestors' machines. You can use pwlog3 with the -r flag to attempt to resolve those IP numbers into hostnames.



Last Modified:Wednesday, 30-Jan-2013 12:14:07 EST
© Copyright 2006-2011 Public Access Networks Corporation