[update: this is what I actually do. Try this, instead. ]
I have seen a couple of sites offer up lists of their “greatest hits” and wanted to do something similar. But groping through 4 years of apache logs wasn’t too appealing.
I hunted around for some ideas on getting apache logs into a MySQL database (lots of people log directly into MySQL, but my stuff here is a tad too brittle for that). I found this page, with a script designed to parse the logs’ lines into their constituent fields, open up a database connection with DBD, and insert all the stuff.
I didn’t find the regex featured in the script to be functional. So I ended up coming up with my own. I’m still working on what to include/exclude. I don’t want to keep track of every time my feeds were retrieved: I want to know when an actual page was requested.
I also borrowed the database schema used by the mod_log_sql project (in contrib/create_tables.sql).
#!/usr/bin/perl use DBD::mysql; #Database options: $dbUser = "loguser" $dbPass = "l0gger" $dbName = "apachelogs" $database = DBI->connect("dbi:mysql:$dbName:localhost", $dbUser, $dbPass); #204.95.98.252 - - [24/Dec/2003:15:23:38 +0000] "GET /archive/writing/2003/08/19 HTTP/1.0" 200 11873 "-" "msnbot/0.11 (+http://search.msn.com/msnbot.htm)" while (<>) { my ($client, $identuser, $authuser, $date, $method, $url, $protocol, $status, $bytes, $referer, $agent) = /(\S+).*? (\S+) (\S+) \[(.*?)\] "(\S*) (\S+) (\S+)" (\S+) (\S+) "(\S+)" "(.*)?"/; next if $status = m/404/; next unless $request_uri = m/\/wordpress|\/movabletype/go; $q = "insert into access_log (remote_host, remote_user, request_time, request_method, request_uri, request_protocol, status, bytes_sent, referer, agent) values(".$database->quote($client).", ".$database->quote($authuser).", '".$date."', ".$database->quote($method).", ".$database->quote($url).", " .$database->quote($protocol).", ".$database->quote($status).", " .$database->quote($bytes).", ".$database->quote($referer).", " .$database->quote($agent).")" my $sth = $database->prepare($q); $sth->execute(); }
and this gets you the top 25 entries: you’ll see what I mean about excluding stuff):
mysql> select request_uri, count(request_uri) as count from access_log group by request_uri order by count desc limit 25; +--------------------------------------------------------------+-------+ | request_uri | count | +--------------------------------------------------------------+-------+ | / | 74 | | /favicon.ico | 64 | | /wordpress/ | 55 | | /wordpress/wp-atom.php | 36 | | /wordpress/index.php/archives/category/education/ | 36 | | /public/calendar/rss/rss.php?cal=CBC&rssview=week | 35 | | /wordpress/wp-content/plugins/wp-hashcash-getkey.php | 29 | | http://www.paulbeard.org | 20 | | /mt/mt-comments.cgi?entry_id=1284 | 15 | | /movabletype/index.atom | 13 | | /wordpress/index.php | 11 | | /movabletype/archives/000608.html | 9 | | /wordpress | 7 | | /wordpress/index.php?p=1260 | 6 | | /wordpress/index.php?p=1569 | 5 | | /wordpress/index.php?p=1514 | 5 | | /wordpress/index.php?p=1653 | 5 | | /wordpress/index.php?p=1681 | 5 | | /wordpress/index.php/archives/2006/02/26/whats-this-about/ | 5 | | /wordpress/index.php/archives/2005/01/15/recipe-saag-paneer/ | 4 | | /wordpress/index.php?p=211 | 4 | | /wordpress/index.php?cat=20 | 4 | | /db/entry.db | 4 | | /wordpress/wp-comments-post.php | 4 | | /wordpress/index.php?m=20020422/ | 4 |
I think I can live with just logging any uri that has wordpress or movabletype in it (just added that bit to the script).
Of course, I’ll soon find out that someone has already done this, but in the event I don’t it might be interesting to make this into a WordPress plugin.
Now playing: Moving the River by Prefab Sprout from the album “Two Wheels Good” | Get it