| Preface |
|
Recently i need to parse Apache Log file in Python script. However, i don't find any ready made script online(not lucky *_*). Hence i end up writing this small function myself to parse standard apache log file. I post it up here so someone may find it useful in the future. |
| Usage |
This function take a line of common log from apache log file and return a dictionary. This is the common log format I use to test the script against with. A "- -" is used to deliminate ip address from the rest because I don't log IdentD and username.**************************************************************************************************************For example **************************************************************************************************************
|
This function return a dictionary with the following keys:
dict = {'ip_address': ipaddress,
'date_time' : date,
'request' : request,
'return_code': code,
'return_byte': byte,
'refering_url': referer,
'user_agent' : user_agent
}
##=========Start ===============================
#!/usr/bin/env python
def parse (input) :
import re,string
SB = "["
EB = "]"
IP_SEPR = "- -"
output = {}
#clean empty space at the beginning.
line = string.lstrip(input)
[ip,rest] = string.split(line,IP_SEPR)
output['ip_address'] = string.strip(ip)
#parse the date with the brackets included.
s_bracket = string.index(rest,SB)
e_bracket = string.index(rest,EB)
date_str = string.strip(rest[s_bracket+1:e_bracket])
output['date_time'] = date_str
#parse request string to get method, request and protocol.
current_ind = e_bracket+1
request_start = -1
request_end = -1
magic_flag = 0
while current_ind < len(rest):
if request_start != -1:
magic_flag = 1
if rest[current_ind] == "\"" and request_start == -1:
request_start = current_ind
if rest[current_ind] == "\"" and request_start != -1 and magic_flag == 1:
request_end = current_ind
if request_start >= 0 and request_end >= 0:
break
current_ind = current_ind +1
get_str = string.strip(rest[request_start+1:request_end])
[method,request,protocol] = string.split(get_str," ")
output['method']= method
output['request'] = request
output['protocol'] = protocol
#parse return code
rest = string.strip(rest[request_end+1:])
ret_code_e_ind = string.index(rest," ")
ret_code = rest[:ret_code_e_ind]
output['return_code'] = ret_code
#parse byte sent
rest = string.lstrip(rest[ret_code_e_ind+1:])
byte_sent_e_ind = string.index(rest," ")
byte_sent = rest[:byte_sent_e_ind]
output['return_byte'] = byte_sent
#parse refering url
after_byte_sent = rest[byte_sent_e_ind+1:]
s_quote_ref_url = string.index(after_byte_sent,"\"")
after_byte_sent = after_byte_sent[s_quote_ref_url+1:]
e_quote_ref_url = string.index(after_byte_sent,"\"")
if e_quote_ref_url-s_quote_ref_url==1:
output['refering_url'] = ""
else:
output['refering_url'] = after_byte_sent[:e_quote_ref_url]
#parse user agent
after_ref_url = after_byte_sent[e_quote_ref_url+1:]
s_quote_user_agent = string.index(after_ref_url,"\"")
after_ref_url = after_ref_url[s_quote_user_agent+1:]
e_quote_user_agent = string.index(after_ref_url,"\"")
if e_quote_user_agent - s_quote_user_agent==1:
output['user_agent'] = ""
else:
output['user_agent'] = after_ref_url[:e_quote_user_agent]
return output
##=======================End =================
|
Testing |
I test the script with log line above, here is the result.
===================== Script ==================================
#Start
if __name__=="__main__":
input = """
62.185.109.11 - - [23/Jul/2003:10:21:23 -0700] "GET /my_directory/myscript.cgi?param1=11111¶m2=2222¶m3=33333 HTTP/1.1" 200 115239 "http://www.referer.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
"""
result = parse(input)
for x in result.keys():
print x," -- ",result[x]
#End
===================Output====================================
refering_url -- http://www.referer.com/
date_time -- 23/Jul/2003:10:21:23 -0700
protocol -- HTTP/1.1
user_agent -- Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)
return_byte -- 115239
ip_address -- 62.185.109.11
request -- /my_directory/myscript.cgi?param1=11111¶m2=2222¶m3=33333
method -- GET
return_code -- 200
==============================================================================
|
|
Note |
| You use this script at your own risk. It is not very elegant but it works for me. I will clean it up later and put more comments in the script. Good luck. |