Apache Log Parsing
Just like Joe Landman I’ve been annoyed for years by the fact that
there hasn’t been a single Apache Log Parsing Library out there, on
the internet. Today I finally decided to do one last Google search query, only
to find out that there’s a few libraries, but these
are too bloated for my taste (and usage.)
Tweaking the Regex
The Regex query as given by Joe is almost perfect. I modified two parts in
order to be able to parse a few more queries and make it a bit more useful.
Split Request Method and Path
One of the return values of the original Regex query is the following:
# c = incoming request (GET, PUT, HEAD, ... with relative URI part)
By splitting the request method and the path, it’s easier to handle these
values later. I.e., instead of “GET /abc” we will now get a “GET” and a “/abc”
Fix Content-Length Bug
In some cases apache will redirect a request, i.e., send a Location: header.
These responses don’t have a Content-Length:, instead they have a dash as
value in the logs. The original Regex incorrectly tries to parse this dash
with \d. By changing this into a \S we fix this bug.
Because webalizer limits the amount of Search Strings to the top 20
(Search Strings being the queries that people google for before they land on
my website) I decided to make a simple utility based on apachelog.
(Note that I found out about the existing apachelog library only afterwards I
created my own, hence the duplicate name.)
Following is a simple tool to search queries from Google referers.
import apachelog import sys if __name__ == '__main__': # enumerate each entry in the apache log file for entry in apachelog.enumerate(sys.argv): # check whether the domain is something-google # and contains a query string if entry.referer and \ 'google' in entry.referer.netloc and \ entry.referer.kwargs.get('q', None): print r.kwargs['q']
See, it’s really not that hard to parse some log files. Btw, turns out
somebody googled for pointer contain garbege untilit is uninitialized in
order to get to my website, now tell me that isn’t cute.
As always, code can be found on github.