Apache Log Parsing

Apache Log Parsing

Just like Joe Landman I’ve been annoyed for years by the fact that
there hasn’t been a single Apache Log Parsing Library out there, on
the internet. Today I finally decided to do one last Google search query, only
to find out that there’s a few libraries, but these
are too bloated for my taste (and usage.)

Tweaking the Regex

The Regex query as given by Joe is almost perfect. I modified two parts in
order to be able to parse a few more queries and make it a bit more useful.

Split Request Method and Path

One of the return values of the original Regex query is the following:

# c[6] = incoming request (GET, PUT, HEAD, ... with relative URI part)

By splitting the request method and the path, it’s easier to handle these
values later. I.e., instead of “GET /abc” we will now get a “GET” and a “/abc”
value.

Fix Content-Length Bug

In some cases apache will redirect a request, i.e., send a Location: header.
These responses don’t have a Content-Length:, instead they have a dash as
value in the logs. The original Regex incorrectly tries to parse this dash
with \d. By changing this into a \S we fix this bug.

Simple Usage

Because webalizer limits the amount of Search Strings to the top 20
(Search Strings being the queries that people google for before they land on
my website) I decided to make a simple utility based on apachelog.

(Note that I found out about the existing apachelog library only afterwards I
created my own, hence the duplicate name.)

Following is a simple tool to search queries from Google referers.

import apachelog
import sys

if __name__ == '__main__':
    # enumerate each entry in the apache log file
    for entry in apachelog.enumerate(sys.argv[1]):
        # check whether the domain is something-google
        # and contains a query string
        if entry.referer and \
                'google' in entry.referer.netloc and \
                entry.referer.kwargs.get('q', None):
            print r.kwargs['q']
    

See, it’s really not that hard to parse some log files. Btw, turns out
somebody googled for pointer contain garbege untilit is uninitialized in
order to get to my website, now tell me that isn’t cute.

Code

As always, code can be found on github.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>