|
Server : Apache/2.4.62 System : FreeBSD fbsdweb2.web.rcn.net 14.1-RELEASE FreeBSD 14.1-RELEASE releng/14.1-n267679-10e31f0946d8 GENERIC amd64 User : www ( 80) PHP Version : 8.3.8 Disable Function : NONE Directory : /domains/svo3oda/weblog/ |
Upload File : |
DOCUMENTATION
WebLog 2.00 by Darryl C. Burgdorf ([email protected])
http://awsd.com/scripts/weblog/
WebLog is a comprehensive access log analysis tool. It allows you to
keep track of activity on your site by month, week, day and hour, to
monitor total hits, bytes transferred and page views, and to keep track
of your most popular pages. It can also print out secondary reports to
track "user sessions," showing the paths taken through your site by your
visitors and giving you a rough idea of how long they spent looking at
your pages, and to provide you with information on referring sites, the
search engine keywords which brought your visitors and the agents and
platforms they used while visiting.
===========================================
I. THE REPORTS
The primary WebLog access report provides the following information:
A. Long-Term Statistics
1. Monthly Statistics: An overview of site activity (number
of hits, number of bytes transferred, and approximate number
of visitors) per month for each month since you started
running WebLog.
2. Daily Statistics (Past Five Weeks): An overview of site
activity per day for the past five weeks.
3. Day of Week Statistics: An overview of site activity by
weekday, maintained as a running total since you started
running WebLog.
4. Hourly Statistics: An overview of site activity by hour of
the day, maintained as a running total.
5. "Record Book": A simple listing of the days on which your
site had the most hits, transferred the most data and saw
the most visitors.
Each of the "Long-Term Statistics" reports (except the "record
book," of course) lists four pieces of information: Hits,
Bytes, Visits and PViews. The number of "hits" is the total
number of files requested from the server. For example, if a
visitor loads a page which includes four inline graphics, a
total of six hits will be recorded in the access log. The
number of bytes represents the total amount of information
transferred by the server in filling those requests. (Note that
WebLog automatically factors in a bit extra in its calculations
to allow for the fact that "header" information -- which is not
recorded in the server access log -- is sent by the server along
with each file.) The number of "visits" is an approximation of
the number of actual individual visitors to your Web site. This
is only a *very* rough approximation, and should be regarded as
such. The number of "pview" shows the number of Web pages
viewed by your visitors. Each of the "Long-Term Statistics"
reports also includes a simple "bar graph" representation; the
graph can be configured to reflect whichever of the four items
you're most interested in being able to track "at a glance."
B. Statistics for The Current Month
1. Top N Files by Number of Hits (optional): A list of the
pages most frequently requested.
2. Top N Files by Volume (optional): A list of the pages which
resulted in the greatest number of bytes transferred.
3. Complete File Statistics (optional): A list of all pages
accessed in the current calendar month, with the date of
last access, number of times requested, and total number of
bytes transferred.
4. Top N Most Frequently Requested 404 Files (optional): A
list of the pages people are requesting most often which
don't actually exist on your site.
5. Complete 404 File Not Found Statistics (optional): A
complete list of those nonexistent files.
6. User ID Statistics (optional): A complete list of user IDs
(and the associated second-level domains) utilized by the
visitors to your Web site. Note that this report can, of
course, only be generated if at least part of your Web site
is password protected through your server's default system.
7. "Top Level" Domains: A breakdown of how many visits you've
had from each type of domain (.com, .net, .edu, etc.)
8. Top N Domains by Number of Hits (optional): A list of the
IP addresses (domains) from which people have visited your
site most often.
9. Top N Domains by Volume (optional): A list of the IP
addresses from which people have requested the greatest
amount of information.
10. Complete Domain Statistics (optional): A complete list of
the IP addresses from which people have visited your site
since the beginning of the current calendar month.
Each of the "Current Month" reports resets automatically at the
beginning of each month. This allows you to easily keep track
of things while preventing the report file from reaching too
ridiculous a size over time.
The optional access details report keeps track of "user sessions." It
will show you detailed "tracks" of the paths taken through your site by
visitors for however many days you specify, and will give you overview
information regarding how many unique visitors you've had each day and
how long they seem to be staying around. If logging of referring URLs
is enabled, it will also show you, where possible, where your visitors
came from. Please note that precise tracking of the number of visitors
is impossible; the information in this report is at best a reasonably
close approximation based on the information in your server access log.
The optional referring URL report logs the URLs reported by browsers as
the "referers" directing them to the various listed pages. You should
be aware that this information is far from perfect. Many browsers do
not provide any information on the referring page; even those that do
can at times provide false or misleading data. And the fact that a page
is listed as the referer to a given page does *not* necessarily mean
that it actually contains a link to that page. Of course, this report
is only available if your server log contains the necessary information.
The optional keywords report logs the keywords used by your visitors
to find you in the various Internet search engines and directories. The
major search engines are each listed individually. (Note that not all
search engines provide search keywords in their URLs, and so some are
not listed here.) Again, this report is only available if your server
log contains the necessary information.
The optional agent and platform reports list the agents (browsers)
and platforms (operating systems) utilized by visitors to your pages.
(Browsers which "spoof" other browsers -- such as MSIE or newer versions
of AOL's browser claiming to be Netscape, or WebTV claiming to be MSIE
claiming to be Netscape -- are identified as what they really are,
rather than as what they claim to be.) The first report details the
agents utilized; the second, the platforms. The third report combines
the data from the first two. The fourth report is a complete and
essentially unprocessed listing of the raw data from the agent log.
Again, of course, this report is only available if your server log
contains the necessary information.
The referring URL, keywords and agent/platform reports will not
automatically reset, so you'll want to keep an eye on their sizes and
delete them periodically when they start to get too large to handle.
(CAVEAT:
(Like any log analysis software, WebLog is based squarely upon
several unfortunately questionable assumptions. Chief among these
is the assumption that any accesses from a specific IP address
within a reasonably short period of time belong to a single user,
and the assumption that analysis of access logs can actually tell
you anything useful about site visitors, anyway.
(It is possible for different users to access your site with the
same IP address, so a single "user session" might actually reflect
visits from multiple users. As well, thanks to the number of
systems which now employ local caching, it is quite likely that some
of the pages which seem to be accessed only once are in actuality
viewed many times by many different users.
(For more information on these problems, you might want to take
a look at some or all of the following articles:
Getting Real About Usage Statistics
- Tim Stehle
<http://www.wprc.com/wpl/stats.html>
Making Sense of Web Usage Statistics
- Dana Noonan
<http://www.piperinfo.com/pl01/usage.html>
Interpreting WWW Statistics
- Doug Linder
<http://gopher.nara.gov:70/0h/what/stats/webanal.html>
Why Web Usage Statistics are (Worse Than) Meaningless
- Jeff Goldberg)
<http://www.cranfield.ac.uk/stats/>
(WebLog also assumes that the time between the loading of one page
and the loading of the next, so long as it is less than 30 minutes,
is actually spent looking at the first page. This is clearly not
necessarily the case. The user could have gotten up to fix himself
lunch or use the bathroom. He could have reloaded another page
already in his browser's cache, or could even have gone to look at
pages on other sites before returning to yours. There is no way of
knowing.
(Finally, WebLog assumes that the average length of time spent
viewing the last -- or only -- page visited in a user session is 30
seconds. Again, there is obviously no way to check the validity of
this assumption.)
===========================================
II. SETTING UP AND RUNNING WEBLOG
The files that you need are as follows:
weblog.pl: This is the main program file. You don't actually need to
do anything to it; in fact, you don't even have to execute it.
config.pl: This is the configuration file. Everything you need to
change or modify is contained here. This is also the file that you
will execute. (Things are set up this way so that you can effectively
maintain multiple versions of the script, for example if you want to
run separate log analyses for different sites, just by keeping
separate config files for each.)
bar1.gif, bar2.gif, bar3.gif, bar4.gif, bar5.gif and bar6.gif: These
six small graphics files are used to create the bar graphs in the main
access report.
As noted above, the WebLog configuration file, and not the WebLog
program itself, should be executed. (And please note that it should
be executed from the telnet command prompt rather than your browser;
WebLog is *not* a CGI script, and most likely won't run correctly if you
try to access it from your browser.) The configuration file should, of
course, be set executable. Make sure that the first line of the script
matches the location of your system's Perl interpreter. As well, the
following variables need to be defined:
$LogFile: The path (not URL) of the NCSA-format access log file from
which the log reports will be generated. Note that this file is
generated by your server; if you're not sure where to find it or what
it's called, check with your system administrators. It is possible,
though not likely, that you don't actually have access to log data.
If that is the case, then you won't be able to use WebLog at all.
The script can read either standard (aka "common format") log files
or extended (aka "combined") log files. You don't need to specify
the type, as WebLog determines it automatically when it reads the
file. Obviously, if you're dealing only with a standard format log
file, WebLog won't be able to generate agent or referer reports.
$IPLog: The path to an optional file in which resolved IP/domain
pairs will be stored. Logging this information will allow WebLog to
run much faster, especially if you're running multiple reports from a
single log file. However, especially on a busy site, the log file
could become *very* large. If you define an IP log file, keep an eye
on its size.
$FileDir: The path of the directory in which the various report files
will be created.
$ReportFile, $DetailsFile, $RefsFile, $KeywordsFile and $AgentsFile:
The file names to be used for each of the five reports WebLog can
generate. All but the first are optional; if you don't assign a
file name, the report simply won't be generated.
$SystemName: The name or description which you want to appear at the
top of your reports (e.g., "WebScripts").
$OrgName and $OrgDomain: The name and domain of the "host" organization
(e.g., ISP and isp.com). If these variables are defined, accesses
from this organization/domain will be counted separately from other
accesses in the details report.
$GraphURL: The URL of the directory containing the bar graph images
(e.g., "http://awsd.com/graphs"). Do NOT include a trailing slash!
$GraphBase: This variable defines the information on which you want the
bar graphs in the main report to be based. It can be set either as
"hits", "visits", "pviews" or "bytes"; if left undefined (or defined
incorrectly), graphs will be based on bytes transferred.
$IncludeOnlyRefsTo and $ExcludeRefsTo: Regexs specifying files or
directories to include or ignore in the files lists. For example, to
include only files in a "scripts" subdirectory, $IncludeOnlyRefsTo =
"^/scripts" would suffice. Multiple entries should be "OR"ed
(e.g., $IncludeOnlyRefsTo = "(^/dir1|^/dir2)").
$IncludeOnlyDomain and $ExcludeDomain: Regexs specifying domains to
include or ignore in the log file. If you want your log analysis to
ignore any visits by you to your own site, for example, set the
$ExcludeDomain variable to your own IP address. (Note that even if
you don't ignore your own visits completely, you can still track them
separately in the details report by using the $OrgName and $OrgDomain
variables.)
$IncludeQuery: If this variable is set to "0" any query information
contained in a URL will be stripped as the log file is processed. If
it is set to "1" the information will be retained.
$PrintFiles: A flag specifying whether the lists of accessed files
should be generated. (Normally, of course, you'd want to do so.
However, for example, if you generate a separate access report for
each site on a server, and also a report for the server as a whole,
you might want to suppress the files listings on the server-wide
report.) 0 = no; 1 = yes.
$Print404: A flag specifying whether the "Code 404" file lists should
be printed. 0 = no; 1 = yes.
$PrintUserIDs: A flag specifying whether the User ID list should be
generated. If no portion of your site is password protected, or if
you use a password system other than that which is integral to your
server software (.htaccess in the case of most UNIX systems), then
this list can be turned off, as your log file won't contain any user
IDs, anyway.
$PrintDomains: A flag specifying whether or not to print lists of
visiting IP addresses. 0 = no; 1 = yes. This variable can also be
set to "2" to indicate that you want only second-level domains
tracked. (In other words, for example, one hit each from
user1.foo.com and user2.foo.com will show up simply as two hits
from foo.com, which can greatly reduce the size of your log file,
especially if your site is busy!)
$PrintTopNFiles: The number of files to include in the "Top N Files"
lists. Set to 0 if you don't want to print the lists.
$TopFileListFilter: Regex defining files to exclude from the "Top N
Files" lists. The default value of "(\.gif|\.jpg|\.jpeg|Code 404)"
will filter out most image files and any frequently-requested but non-
existing files.
$PrintTopNDomains: The number of domains to include in the "Top N
Domains" lists. (This, of course, is irrelevant if you're not
printing domain lists.)
$LogOnlyNew: Setting this variable to "1" will instruct WebLog to
ignore any entries in the log file being analyzed which date from
before the end of the last log file analyzed. If you're afraid that
you might accidentally run the script with the same log file twice in
a row, setting this to "1" will prevent any data duplication. If, on
the other hand, you won't necessarily be analyzing log files in strict
chronological order, you will want to keep this set to "0" so that all
information is parsed.
$NoSessions: If set to "1" this variable will instruct WebLog *not* to
include visitor counts on the monthly, daily and day-of-week lists.
It will also disable creation of the details report.
$NoResolve: By default, WebLog will attempt to resolve any IP numbers
in the log file to domain names. This can take a while, especially
with larger log files. If you don't want the script to bother -- if,
for example, you don't care whether visitors came from ".com", ".net"
or ".jp" sites, or if your log file already contains resolved domain
names wherever possible, anyway -- just set this variable to "1".
$DetailsFilter: A regex defining files to exclude from the details
report. (It's also used to determine what qualifies as a "page view"
in the main report.) The default value of "(\.gif|\.jpg|\.jpeg)" will
filter out most image files, making it easier to follow which actual
pages were viewed, and allowing a (theoretically) more accurate
tracking of the time spent on each page.
$DetailsDays: The number of days past to include in the details report.
(This, of course, is only relevant if you're actually printing the
details report.) The number cannot be greater than 35.
$refsexcludefrom and $refsexcludeto: If you want references to or from
certain files ignored in the referring URLs report, define them here.
You might want to exclude any references from within the same domain,
for example, so that you can more easily see what *outside* locations
are sending visitors to your site.
$RefsStripWWW: Setting this variable to "1" will instruct the script to
remove the "www" prefix from URLs. If you don't strip those, the same
URL could end up appearing twice in your referring URL list, both as
"www.foo.com" and as "foo.com"; if you *do* strip the prefix, though,
while the lists will be a bit easier to read and interpret, you'll end
up with some URLs which you can't actually follow unless you manually
put the "www" back. (On some systems, for whatever reason, it's
mandatory.)
$RefsMinHits: This variable defines the minimum number of references
that must come from a particular page before that page is included in
the final report. If you have a very busy site, and just want to know
where *most* people are coming from, set it relatively high. On the
other hand, if you have a fairly quiet site, or if you're interested
in tracking all accesses, set it low.
$AgentsIgnore: If you wish to ignore references to particular files in
your agents/platforms report, list them here. Eliminating references
to graphic images, for example, will prevent your report from
indicating an overly-high percentage of graphical browsers, since
only hits to actual pages will be included.
===========================================
This documentation assumes that you have at least a general familiarity
with setting up Perl scripts. If you need more specific assistance,
check with your system administrators, consult the WebScripts FAQs
(frequently-asked questions) file <http://awsd.com/scripts/faqs.shtml>,
or ask on the WebScripts Forum <http://awsd.com/scripts/forum/>.
-- Darryl C. Burgdorf