1. You might be interested Linux Networking also. It does not hurt to take a look.
  2. Back to main page

Apache Log Parser In Python

Preface
Recently i need to parse Apache Log file in Python script. However, i don't find any ready made script online(not lucky *_*).
Hence i end up writing this small function myself to parse standard apache log file. I post it up here so someone may find it useful in the future.

Usage
This function take a line of common log from apache log file and return a dictionary. This is the common log format I use to test the script against with. A "- -" is used to deliminate ip address from the rest because I don't log IdentD and username.
**************************************************************************************************************
IP_ADDRESS - - [DATE_STRING] "METHOD REQUEST_STRING PROTOCOL" RETURN_CODE BYTE_SENT "REFERER_URL" "USER_AGENT" **************************************************************************************************************

For example
**************************************************************************************************************
62.185.109.11 - - [23/Jul/2003:10:21:23 -0700] "GET /my_directory/myscript.cgi?param1=11111¶m2=2222¶m3=33333 HTTP/1.1" 200 115239 "http://www.referer.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" **************************************************************************************************************
Function
     This function return a dictionary with the following keys:
               dict = {'ip_address': ipaddress,
                 'date_time' : date,
                 'request' :  request,
                 'return_code': code,
                 'return_byte': byte,
                 'refering_url': referer,
                 'user_agent' : user_agent
                }
      

##=========Start ===============================
#!/usr/bin/env python
def parse (input) :
     import re,string 
     SB    = "["
     EB    = "]"
     IP_SEPR = "- -"

     output = {}

     #clean empty space at the beginning.
     line = string.lstrip(input)
     [ip,rest] = string.split(line,IP_SEPR)
     output['ip_address'] = string.strip(ip)

     #parse the date with the brackets included.
     s_bracket = string.index(rest,SB)
     e_bracket = string.index(rest,EB)
     date_str = string.strip(rest[s_bracket+1:e_bracket])
     output['date_time'] = date_str

     #parse request string to get method, request and protocol.
     current_ind  = e_bracket+1
     request_start = -1
     request_end = -1
     magic_flag = 0

     while current_ind < len(rest):
           if request_start != -1:
                magic_flag = 1
           if rest[current_ind] == "\"" and request_start == -1:
                request_start = current_ind
           if rest[current_ind] == "\"" and request_start != -1 and magic_flag == 1:
                request_end = current_ind

           if request_start >= 0 and request_end >= 0:
                break
           current_ind = current_ind +1

     get_str = string.strip(rest[request_start+1:request_end])
     [method,request,protocol] = string.split(get_str," ")
     output['method']= method
     output['request'] = request
     output['protocol'] = protocol

     #parse return code
     rest = string.strip(rest[request_end+1:])
     ret_code_e_ind = string.index(rest," ")
     ret_code = rest[:ret_code_e_ind]
     output['return_code'] = ret_code

     #parse byte sent
     rest = string.lstrip(rest[ret_code_e_ind+1:])
     byte_sent_e_ind = string.index(rest," ")
     byte_sent = rest[:byte_sent_e_ind]
     output['return_byte'] = byte_sent

     #parse refering url
     after_byte_sent = rest[byte_sent_e_ind+1:]
     s_quote_ref_url = string.index(after_byte_sent,"\"")
     after_byte_sent = after_byte_sent[s_quote_ref_url+1:]
     e_quote_ref_url = string.index(after_byte_sent,"\"")
     if e_quote_ref_url-s_quote_ref_url==1:
            output['refering_url'] = ""
     else:
            output['refering_url'] = after_byte_sent[:e_quote_ref_url]


     #parse user agent
     after_ref_url = after_byte_sent[e_quote_ref_url+1:]
     s_quote_user_agent = string.index(after_ref_url,"\"")
     after_ref_url = after_ref_url[s_quote_user_agent+1:]
     e_quote_user_agent = string.index(after_ref_url,"\"")
     if e_quote_user_agent - s_quote_user_agent==1:
            output['user_agent'] = ""
     else:
            output['user_agent'] = after_ref_url[:e_quote_user_agent]
     return output
##=======================End =================

Testing
I test the script with log line above, here is the result.
            ===================== Script ==================================
            #Start
            if __name__=="__main__":
                 input = """
                        62.185.109.11 - - [23/Jul/2003:10:21:23 -0700] "GET /my_directory/myscript.cgi?param1=11111¶m2=2222¶m3=33333 HTTP/1.1" 200 115239 "http://www.referer.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
             """
                 result  = parse(input)
                 for x in result.keys():
                      print x," --  ",result[x]
            #End
            ===================Output====================================
            refering_url --  http://www.referer.com/
            date_time --  23/Jul/2003:10:21:23 -0700
            protocol  --  HTTP/1.1
            user_agent --  Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)
            return_byte --  115239
            ip_address --  62.185.109.11
            request --  /my_directory/myscript.cgi?param1=11111¶m2=2222¶m3=33333
            method --  GET
            return_code --  200
            ==============================================================================
      
Note
You use this script at your own risk. It is not very elegant but it works for me. I will clean it up later and put more comments in the script. Good luck.
Hosted by www.Geocities.ws

1