I'm trying to parse a file of about 200 MB in size. I decided to use Python re module for this task. However, upon some further study, I found that the BNF grammar-based PyParsing provides what I'm looking for.
To test my code, I used a 5MB file and to my surprise, the code takes more than 3 minutes to parse this file. Could somebody please review my code and see if I'm making any mistakes?
Here is the file-content:
16:31:19.321 xxxTIM23 L1_Rx_TCH Uplink BAD TCH BLOCK(2) send_report_btch 16:31:19.321 xxxdummy L1_Rx_TCH Uplink BAD TCH BLOCK(2) send_report_btch 16:31:19.321 xtext345 L3_Tx_BCH Downlink DumpStack: xdfsfosifjsfj 16:31:19.321 xtext345 L3_Tx_BCH Downlink DumpStack: xdfsfosifjsfj 16:31:19.321 unwanted L1_Rx_TCH Uplink BAD TCH BLOCK(2) send_report_btch 16:31:19.321 xrandom3 L1_Rx_TCH Uplink BAD TCH BLOCK(2) send_report_btch 16:31:19.321 xtext345 L3_Tx_BCH Downlink DumpStack: xdfsfosifjsfj 16:31:19.321 xtext345 L1_Rx_TCH Uplink BAD TCH BLOCK(2) send_report_btch
Hers's the file content that I'm interested in:
16:31:19.321 xxxTIM23 L1_Rx_TCH Uplink BAD TCH BLOCK(2) send_report_btch 16:31:19.321 xxxdummy L1_Rx_TCH Uplink BAD TCH BLOCK(2) send_report_btch 16:31:19.321 unwanted L1_Rx_TCH Uplink BAD TCH BLOCK(2) send_report_btch 16:31:19.321 xrandom3 L1_Rx_TCH Uplink BAD TCH BLOCK(2) send_report_btch 16:31:19.321 xtext345 L1_Rx_TCH Uplink BAD TCH BLOCK(2) send_report_btch
I'm sorry that I can't share the exact log file as the content is confidential to my company. I've shared one extra code and sample data file from the internet below.
from pyparsing import *
FILE_PATH = "C:\\User\\Sam\\Desktop\\log.txt"
#base grammer
space = Literal(" ")
digits = Word(nums)
timestamps = Word(nums+":.")
tags = Word(alphanums+"_-")
brackets = oneOf("{ } [ ] ( )")
#custom grammer
badTchTag = Literal("L1_Rx_TCH")
badTchText = Literal("BAD TCH BLOCK")
badTchExt = Literal("send_report_btch")
#query
query = Combine(timestamps+space+Suppress(SkipTo(badTchTag))+badTchTag+space+Suppress(SkipTo(badTchText))+badTchText+brackets+digits+brackets+space+badTchExt)
#parse
with open(FILE_PATH) as file_ptr:
try:
output = query.searchString(file_ptr.read())
for line in output:
print(line)
except ParceException:
print('not found!')
file_ptr.close()
Even though the code does a bunch of other stuff, it is the most important part. The execution of query.searchString() is taking more than 3 minutes even for a 5MB file.
[Edit] Another Example:
This dataset (http://opensource.indeedeng.io/imhotep/files/nasa_19950630.22-19950728.12.tsv.gz) has one log file inside it(final size is about 141 MB). I unzipped this file and opened it in Python to test Pyparsing.
Below is a snapshot of file contents:
unicomp6.unicomp.net - 804571214 GET /shuttle/countdown/count.gif 200 40310 unicomp6.unicomp.net - 804571214 GET /images/NASA-logosmall.gif 200 786
unicomp6.unicomp.net - 804571214 GET /images/KSC-logosmall.gif 200 1204
d104.aa.net - 804571215 GET /shuttle/countdown/count.gif 200 40310
d104.aa.net - 804571215 GET /images/NASA-logosmall.gif 200 786
d104.aa.net - 804571215 GET /images/KSC-logosmall.gif 200 1204
129.94.144.152 - 804571217 GET /images/ksclogo-medium.gif 304 0
199.120.110.21 - 804571217 GET /images/launch-logo.gif 200 1713
I wrote below Python code to extract lines with (* GET /images/*) content like this:
unicomp6.unicomp.net - 804571214 GET
/images/NASA-logosmall.gif unicomp6.unicomp.net - 804571214
GET /images/KSC-logosmall.gif d104.aa.net - 804571215
GET /images/NASA-logosmall.gif d104.aa.net - 804571215
GET /images/KSC-logosmall.gif 129.94.144.152 - 804571217 GET /images/ksclogo-medium.gif 199.120.110.21 - 804571217 GET /images/launch-logo.gif
from pyparsing import *
digits = Word(nums)
first = Word(alphanums+"._-")
space = OneOrMore(" ")
dash = Literal("-")
REQ1 = Literal("GET")
REQ2 = Literal("/images/")
query = Combine(first+space+dash+space+digits+space+REQ1+space+REQ2+first)
try:
result = query.searchString(open("C:\\Users\\Sam\\Desktop\\19950630.23-19950801.00.tsv", encoding="Latin-1").read())
for item in result:
print(result)
except Exception as e:
print(str(e))
This code also takes forever to run and I've to kill the execution prematurely. Could you please help me to identify what wrong am I doing ?
Literal,Words,numsand so on. Can you add the necessary imports for us? \$\endgroup\$SkiptToandParceExceptionlook like typos to me. Please fix the post so that the code is runnable. \$\endgroup\$