|
What is the general format of the weblogs available at Panix?
The Panix webserver logs are available in two different formats: the Panix Oldstyle Log Format (POLF) and the Extended Common Log Format (XCLF). In each case, the raw information pertaining to traffic on each webserver is processed once an hour by a program that reads the data from each webserver in turn, separates the information by website-owner, and appends the relevant data to each website-owner's logfile. As a result, the data in your weblog will be almost, but not completely, chronological. They will be grouped in chunks that follow each other chronologically, each chunk representing an hour's worth of data. Within each chunk will appear fist all the data (for that hour) from one webserver, then from another, and so on. Thus, if your webpages were accessed from two different webservers in the span of one hour, all the accesses from one of the webservers, S1, will precede in the given chunk all the accesses from the other webserver, S2, even though some of the S1 accesses may have happened later within the hour that some of the S2 accesses.
Most of the log-processing programs do not require that the logs be in strict chronological order. If, for some reason, you do need your log to be strictly chronological, you can try to use the program weblogsort (consult the man page for how to use it).
How do I find out who's been downloading my webpages?
There are a few generally available utilities you can use for obtaining information from the Panix webserver logs, including which of your pages were accessed and what machine(s) requested them:
What are Squids, and how do they relate to the data in my weblogs?
Panix uses a number of webservers, known as Squids, which serve as web accelerators. The squids cache web pages as they are served, and if a later request is made for a cached page and the actual page has not changed, the page is served from the cache rather than from the "main" web server on which the original page resides. When this happens, duplicate log entries are created for the given request: one for the "main" web server, recording the request, and the other for the squid which actually served the page. The log entry for the "main" web server will not show any bytes transferred.
Both getlogs and getclogs, when invoked without any command-line parameters, will output only those weblog entries which pertain to the "main" web servers. This means that the actual number of "hits" listed in the output will be accurate, and so will the number of requests for each specific file. However, the number of bytes transferred may be highly inaccurate, since the output will not contain any entries pertaining to actual transfers from the squids. The requests that were actually serviced by the squids will look as if no bytes, or very few bytes, were transferred.
If you need to get, from your weblogs, an accurate count of bytes transferred (the byte-count that is reflected in your bill), you can do so by using the -a switch to getlogs or getclogs. Be aware, however, that this output will show an inflated number of hits and file-accesses, since there will be two log entries for each of the requests that were serviced by the squids.
How do I find out which of my webpages are getting traffic?
Having found out which of your pages were hit, there are currently two Panix utilities available for processing log information and generating a meaningful statistics file:
If you would instead prefer to use third-party webstats program (e.g., WebTrends, FunnelWeb, Analog, etc.), the input format you will want is almost certainly XCLF - which you can obtain either by using getclogs to produce the logs, or by using pwlog to covert from POLF to XCLF.
How do I get user email addresses from the logs?
In short, you don't. At best, the logs can only tell you what machine was used when your page(s) were browsed, or an intermediary "gateway" machine. By making use of CGI scripts and Server Side Includes, you can attempt to obtain an e-mail address from an environment variable, but most browsers do not pass this information due to privacy considerations.
Thus, you probably need to create a feedback form for users to fill out and a CGI script to process it. Once that's done, you then have to rely on them to a) bother to fill it out and b) not lie when they do so.
How do I find out where my Web traffic is coming from?
The Panix weblogs, in either format, include "HTTP_REFERER" info, which indicates, for each access to one of your webpages, the webpage (yours or somebody else's) that "sent" the visitor to your webpage. An analysis of this data is included in both pwstat programs.
An alternative approach is to track referrals to specific pages in your website by using Server Side Includes. In short, in whichever pages you want to track referral information, place an SSI that execs a script either by cgi or cmd. For example,
<!--#exec cgi="/pcgi-bin/reflogger.cgi"-->
All your script has to do is record the DOCUMENT_URI and HTTP_REFERER environment variables to a file. You'll probably want to write another script which reduces the content in your referrer log file to something more readable, but that's up to you.