I decided I wanted to see this more easily: I wanted to know people were finding interesting, based on what Google and their own browsing turned up.
So I wrote a little script to go through the active logfile, pull out the successful page requests, tally them up, and write out hrefs and titles as links.
The actual code appears behind the [read more] link. It should work anywhere, and is reasonably configurable, based on local needs and constraints.
# topten.pl: a simple script to pull the top ten most requested URLs
# from a webserver logfile and output them to a text file, with hrefs and
# accurate titles. Assumes some adequately-formed html (ie, titles exist).
# Assumptions:
# o your webserver logs in common log format (CLF).
# o your server supports some way of including snippets or include files (the output
# file’s name and location are configurable, but it has to be someplace where you can write
# and the http listener can read)
# o A sample data line appears at the end of the file to you can see what this script is
# expecting.
# written largely by me <paul@paulbeard.org> with able assistance and sanity checking by
# Mark Reed <whose email I will protect>
#
my ($weblog_name, $weblog_url,$logfile, $outfile, $keys, $i, $url, %URLS, $docroot,$status);
$logfile=(‘/usr/local/weblogs/httpd-access.log’);
$weblog_name = ‘cloudy, chance of sun breaks’;
$docroot = “/www”
$outfile = “$docroot/includes/topten.html”
open (LOG, “<$logfile”) or die “$!”
while (<LOG>) {
($url,$status) = (split)[6,8];
$url .= “index.html” if $url eq “/movabletype/”
next unless $url =~ (/\.html?/);
next unless $status =~ (/200|201|202|203|302|304/ );
$URLS{$url}++;
}
close LOG;
open STDOUT, “>$outfile” or die “$!”
$i = 0;
foreach $key (sort by_hits keys %URLS) {
my $hits = $URLS{$key};
my $file = “$docroot$key”
if ($i <= 10) {
open (PAGE, “<$file”) or die “$file $!”
while (<PAGE>) {
if (m|title>$weblog_name: (.*)</title|) {
my $title = $1;
print STDOUT “<a href=”$key”>$title<\/a> $URLS{$key} requests<br />\n”
}
}
close PAGE;
$i++;
}
}
close STDOUT;
sub by_hits {
$URLS{$b} <=> $URLS{$a}
}
__DATA__
168.210.56.98 – – [17/Dec/2003:00:01:33 -0800] “GET /movabletype/archives/001299.html
HTTP/1.0” 200 11401 “-” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; Q312461; EvaluNet; Feedreader; .NET CLR 1.0.3705)”