panix.user.html FAQ
Logs and Analysis
The purpose of pwstat is to produce a meaningful statistics
report from the contents of a POLF weblog,
as presented by either getlogs or
pwlog.
(Note. For XCLF weblogs, use
pwstat3.)
By default, the report is generated in HTML format, for web display.
It includes the following information:
- Total traffic (in files and bytes delivered) from your site during
the reporting period.
- Analysis of traffic by date.
- Analysis of traffic by hour of the day.
- Analysis of traffic by archive name; i.e., filename. See the
pwlog
discussion for an explanation of the "(u)" and "(c)" sequences
which precede the filenames.
- Analysis of traffic by archive type; i.e., the filename extension.
Exact archive type breakdown could be difficult to interpret if, for
example, you are making significant use of CGI scripts to deliver
your site's content, as such will be listed under "CGI".
- Analysis of traffic by requesting top-level domain;
e.g., gov, com, uk, etc.
(Note: See the -r option.)
- Analysis of traffic by requesting continent, as determined by
examination of the requesting domain.
(Note: See the -r option.)
The non-country top-level domains are determined as follows:
com is treated as international and reported as "Commercial",
edu although possibly international is treated as US and
reported as "North America",
gov is reported as "North America",
mil is reported as "North America",
net is treated as international and reported as "Network",
and
org is treated as international and reported as "Noncommercial
Organization",
- Analysis of traffic by requesting reversed sub-domain; e.g.,
requests from America On-Line users are reported as com.aol.*.
(Note: See the -r option.)
- List of URLs most frequently referring visitors to your site.
- List of domains most frequently referring visitors to your site.
The information in these sections is fairly self-explanatory, but
it is perhaps worth stating here that in all but the
total-traffic section, the information is separated into several
columns, including:
- Requests
- This many files were delivered serving the requests described
on this line.
- %Reqs
- The information on this line constitutes so many percent of
all the file requests made during the reporting period.
- Bytes Sent
- This many bytes were delivered serving the requests described
on this line.
- %Byte
- The information on this line constitutes so many percent of the total
bytes delivered during the reporting period.
Note on host identification: In the raw weblogs, IP
numbers rather than hostnames are used to identify the machines
requesting your webpages. When you use pwstat with the -r
option, pwstat attempts to resolve those numbers to hostnames, and
saves the resolved IP/hostname pairs in the .pwhosts file
in your directory. During every subsequent run of pwstat, even
without the -r flag, those hosts whose IPs already appear in the
.pwhosts file will, by default, be identified by hostname rather
than by IP. You can use the -R flag to turn this off and to get
all hosts listed by IP number, as if the .pwhosts information
didn't exist.
Running pwstat
If you have already run getlogs or pwlog,
the procedure for obtaining webstats is to simply type:
pwstat logfilename > statfilename
If you do not specify an input file name, pwstat
will automatically call getlogs for you.
In other words, instead of typing
getlogs > logfilename
pwstat logfilename > statfilename
you can just type
pwstat > statfilename
You can also create a statistical extract from more than one input
log file by typing
pwstat logfile1name logfile2name > statfilename
Options
The complete list of pwstat's options is included in its
help message, which you can obtain by typing
pwstat -h
Notable among these options are:
- pwstat -b pattern
- Include only requests from machines which include
this pattern (a Perl regexp).
Note: If the -r option is used, this test is made after
the IP-to-hostname conversion is attempted,
- pwstat -B pattern
- Omit requests from machines which include
this pattern (a Perl regexp).
Note: If you specify any combination of the -b, -B, -m and -M options,
only one of them will be evaluated. Preference is in the order just
given (i.e., -b always wins)
- pwstat -d somedate
- Omit requests before the spcified date.
The format of the date must be YYYY:MM:DD; for example,
to obtain a report limited to requests on or after August 15, 1995,
you would replace somedate with 1995:08:15.
Note: Remember that the only requests which will be checked against
the specified date are those from the log file(s) you've specified.
- pwstat -D somedate
- Omit requests after the specified date.
- pwstat -f pattern
- Include only requests for filenames which include this pattern
(a Perl regexp).
- pwstat -F pattern
- Omit requests for filenames which include this pattern
(a Perl regexp).
- pwstat -g
- "Smash" the filenames of graphics, reducing any filename with
extension bmp, gif, jpg, jpeg or png to
(gfx)
This is handy if you have directories full of GIFs and JPEGs that
you don't want to see listed individually in your stats.
- pwstat -j N
- In the list of URLs which most frequently referred visitors to your
site, include only the N most frequent URLs. If this option is
not specified, then the default is 25. If you do not want this
section included in your pwstat report, then specify
pwstat -j 0.
- pwstat -J N
- In the list of domains which most frequently referred visitors to your
site, include only the N most frequent domains. If this option is
not specified, then the default is 25. If you do not want this
section included in your pwstat report, then specify
pwstat -J 0.
- pwstat -k pattern
- In the list of URLs which most frequently referred visitors to your
site, exclude URLS which match this pattern (a Perl regex).
This option is most useful when you want to exclude referrals from
within your own domain. For example, if your domain were
www.skatecity.com, then you exclude self-referrals by specifying
pwstat -k 'www\.skatecity\.com'.
- pwstat -K pattern
- In the list of domains which most frequently referred visitors to your
site, exclude domains which match this pattern (a Perl regex).
- pwstat -l
- Execute getlogs -o
and use the result as input for pwstat. This results in
pwstat output based on the previous getlogs reporting period.
This option is ignored if you specify an input log filename.
- pwstat -m
- Omit any request coming from any *.panix.com and *.access.net host.
- pwstat -M
- Omit any request coming from outside the *.panix.com and *.access.net
domains.
- pwstat -o
- In the reversed sub-domain section of the report, the last
portion of a computer name is normally lopped off; e.g.,
gatekeeper.nytimes.com would just be reported as
com.nytimes.*
as would all requests from everyone else in the nytimes.com
domain. To force hostnames to be completely reported, invoke
the -o option.
- pwstat -q list
- Filter log entries by usage type, where "list" can
be one or more of c, u, or f.
If c, then we want corporate web hits included; if u, include
personal web hits; and if f, include ftp transfers.
Note: Most Panix users do not have both corporate and personal
web traffic, but corporate users may want to use this option
to generate separate reports for their web and ftp traffic.
- pwstat -r
- Turns on IP-to-hostname resolving. In the raw weblogs, the machines
requesting your webpages are normally identified by IP number, and to turn
that number into a computer name, a host lookup must be performed.
Every time you use the -r option to pwstat, the newly
found resolutions of IP numbers into hostnames are appended to a
special file in your directory, and these results
are used during subsequent runs of pwstat.
See below for more information on why you should
not use
this option unless you really need to know the domains and subdomains
of the computers visiting your site.
- pwstat -R
- Reports all the requesting hosts by IP number, even in cases where
the IPs have been previously resolved to hostnames and the results are
available in the .pwhosts file.
- pwstat -s N
- Execute getlogs -sN
and use the result as input for pwstat. The N is an
integral value indicating the number of bytes at the beginning of
the getlogs report to ignore/skip.
This option is ignored if you specify an input log filename.
- pwstat -t
- Generate a text-only report. The default is an HTML report.
- pwstat -u
- Normally, unresolved IP numbers are listed in the domain and
reversed sub-domain sections of the pwstat report as
simply "Unresolved". To force all IP numbers to be individually
reported in the reversed sub-domain section, invoke this option.
- pwstat -U
- The -u option will likely result in more data than you want,
but perhaps you still want some sort of guess-timate of the number of
different sites visiting your webpages. The -U option
will force partial reporting of unresolved IP numbers, ignoring
the last number in the four-number sequence. For example, the
IP number 166.84.197.198 would be listed as 166.84.197.*, as would
all other machines in the 166.84.197.* network that happened to visit
your site.
- pwstat -y scheme
- The pwstat output includes near the top a line that says
"Approx. Cost of External Transmissions $12.34".
This cost is by default calculated using the formula for
personal web service. However, the various levels of corporate
web service have different cost formulas, but pwstat
has no way of knowing which to use unless you tell it.
Thus, you may specify one of the following schemes:
personal, corporate,
basic, standard or deluxe.
(Note: Panix assesses monthly charges on your total traffic.
If you have invoked any pwstat options which cause it
to skip log entries, then the value calculated
will not correspond with what you are actually charged.)
IP-to-Hostname Resolving and the Host Hash File
The way that the -r option in the
pwlog and pwstat programs determines the machine names
corresponding to the IP numbers in the weblogs is to do a host
lookup for each number. However, since most people who hit a
good website hit it more than once,
doing a lookup for every single entry in a log
file would be needlessly repetitious. Thus, the pwlog and
pwstat programs
maintain a file of matching IP numbers and hostnames, and they
check in this file for a match before actually executing an
IP lookup. At present, this is done for every single user who
executes pwlog and pwstat; there is no Panix-wide
common file which all pwlog and pwstat users can
access. The name of the hostfile is .pwhosts, and you will
find your copy in your home (login) directory.
The process of converting IP
numbers to hostnames can be incredibly slow,
whether it occurs in pwlog or in pwstat.
In fact, it can be downright maddening if you have a popular site.
Lookup time for just a couple days worth of hits on my own pages can take
over an hour.
clay once reported that it took about 10
hours to resolve the new hostnames seen during a week of traffic to
his site, and that was back in late 1995, when web traffic
was a fraction of what it is now.
Persons with popular sites will also find that their .pwhosts
file can get pretty large. Mine, for example, was up to 475 kb by the
summer of 1995, after only a few months of traffic to my pages. If
you have a popular set of pages, it wouldn't be too long before your copy
of .pwhosts was into the megabytes. At that point, it's time to
ask if you really need to know the names of all the machines visiting
your site.
All this said, you may understand why your time is better spent
(and less computing time and disk space wasted) if you do not invoke
the -r option in either pwlog or pwstat
Last modified:
Friday, 22-Jul-2005 08:23:35 EDT
rbs, askanas