Parsing a file using PyParsing

Question

I'm trying to parse a file of about 200 MB in size. I decided to use Python re module for this task. However, upon some further study, I found that the BNF grammar-based PyParsing provides what I'm looking for.

To test my code, I used a 5MB file and to my surprise, the code takes more than 3 minutes to parse this file. Could somebody please review my code and see if I'm making any mistakes?

Here is the file-content:

16:31:19.321 xxxTIM23 L1_Rx_TCH Uplink BAD TCH BLOCK(2) send_report_btch 16:31:19.321 xxxdummy L1_Rx_TCH Uplink BAD TCH BLOCK(2) send_report_btch 16:31:19.321 xtext345 L3_Tx_BCH Downlink DumpStack: xdfsfosifjsfj 16:31:19.321 xtext345 L3_Tx_BCH Downlink DumpStack: xdfsfosifjsfj 16:31:19.321 unwanted L1_Rx_TCH Uplink BAD TCH BLOCK(2) send_report_btch 16:31:19.321 xrandom3 L1_Rx_TCH Uplink BAD TCH BLOCK(2) send_report_btch 16:31:19.321 xtext345 L3_Tx_BCH Downlink DumpStack: xdfsfosifjsfj 16:31:19.321 xtext345 L1_Rx_TCH Uplink BAD TCH BLOCK(2) send_report_btch

Hers's the file content that I'm interested in:

16:31:19.321 xxxTIM23 L1_Rx_TCH Uplink BAD TCH BLOCK(2) send_report_btch 16:31:19.321 xxxdummy L1_Rx_TCH Uplink BAD TCH BLOCK(2) send_report_btch 16:31:19.321 unwanted L1_Rx_TCH Uplink BAD TCH BLOCK(2) send_report_btch 16:31:19.321 xrandom3 L1_Rx_TCH Uplink BAD TCH BLOCK(2) send_report_btch 16:31:19.321 xtext345 L1_Rx_TCH Uplink BAD TCH BLOCK(2) send_report_btch

I'm sorry that I can't share the exact log file as the content is confidential to my company. I've shared one extra code and sample data file from the internet below.

from pyparsing import *
FILE_PATH = "C:\\User\\Sam\\Desktop\\log.txt"
#base grammer
space = Literal(" ")
digits = Word(nums)
timestamps = Word(nums+":.")
tags = Word(alphanums+"_-")
brackets = oneOf("{ } [ ] ( )")

#custom grammer
badTchTag = Literal("L1_Rx_TCH")
badTchText = Literal("BAD TCH BLOCK")
badTchExt = Literal("send_report_btch")

#query
query = Combine(timestamps+space+Suppress(SkipTo(badTchTag))+badTchTag+space+Suppress(SkipTo(badTchText))+badTchText+brackets+digits+brackets+space+badTchExt)

#parse
with open(FILE_PATH) as file_ptr:
    try:
        output = query.searchString(file_ptr.read())
        for line in output:
            print(line)
    except ParceException:
        print('not found!')
file_ptr.close()

Even though the code does a bunch of other stuff, it is the most important part. The execution of query.searchString() is taking more than 3 minutes even for a 5MB file.

[Edit] Another Example:

This dataset (http://opensource.indeedeng.io/imhotep/files/nasa_19950630.22-19950728.12.tsv.gz) has one log file inside it(final size is about 141 MB). I unzipped this file and opened it in Python to test Pyparsing.

Below is a snapshot of file contents:

unicomp6.unicomp.net - 804571214 GET /shuttle/countdown/count.gif 200 40310 unicomp6.unicomp.net - 804571214 GET /images/NASA-logosmall.gif 200 786
unicomp6.unicomp.net - 804571214 GET /images/KSC-logosmall.gif 200 1204
d104.aa.net - 804571215 GET /shuttle/countdown/count.gif 200 40310
d104.aa.net - 804571215 GET /images/NASA-logosmall.gif 200 786
d104.aa.net - 804571215 GET /images/KSC-logosmall.gif 200 1204
129.94.144.152 - 804571217 GET /images/ksclogo-medium.gif 304 0
199.120.110.21 - 804571217 GET /images/launch-logo.gif 200 1713

I wrote below Python code to extract lines with (* GET /images/*) content like this:

unicomp6.unicomp.net - 804571214 GET
/images/NASA-logosmall.gif unicomp6.unicomp.net - 804571214
GET /images/KSC-logosmall.gif d104.aa.net - 804571215
GET /images/NASA-logosmall.gif d104.aa.net - 804571215
GET /images/KSC-logosmall.gif 129.94.144.152 - 804571217 GET /images/ksclogo-medium.gif 199.120.110.21 - 804571217 GET /images/launch-logo.gif

from pyparsing import *

digits = Word(nums)
first = Word(alphanums+"._-")
space = OneOrMore(" ")
dash = Literal("-")
REQ1 = Literal("GET")
REQ2 = Literal("/images/")

query = Combine(first+space+dash+space+digits+space+REQ1+space+REQ2+first)

try:
    result = query.searchString(open("C:\\Users\\Sam\\Desktop\\19950630.23-19950801.00.tsv", encoding="Latin-1").read())
    for item in result:
        print(result)
except Exception as e:
    print(str(e))

This code also takes forever to run and I've to kill the execution prematurely. Could you please help me to identify what wrong am I doing ?

Welcome to codereview! Would you mind adding more context to your question ? How does the whole file looks like. What do you need the result to look like ? What does the code want to achieve? I personally don't know the module you're using — Grajdeanu Alex
– Grajdeanu Alex, Commented Jul 6, 2017 at 7:38
The code in the post does not run — it's missing definitions of Literal, Words, nums and so on. Can you add the necessary imports for us? — Gareth Rees
– Gareth Rees, Commented Jul 6, 2017 at 7:41
In case of such egregious performance problems, I'd suspect your grammar as the primary culprit. Could you add the grammar you are using, as well as an extensive description of what exactly you want to do with the parsing? — Vogel612
– Vogel612, Commented Jul 6, 2017 at 8:46
@sircasms: I don't think this code is real, because SkiptTo and ParceException look like typos to me. Please fix the post so that the code is runnable. — Gareth Rees
– Gareth Rees, Commented Jul 6, 2017 at 14:13
@Gareth Rees and others .. I'm sorry guys. I was in office, so couldn't copy-paste the code/log because of company policy. I've corrected the mistake in code and added more information about what I want the code to do. Also added a second example. Thank you. — SirPunch
– SirPunch, Commented Jul 6, 2017 at 15:30

Gareth Rees · Accepted Answer · 2017-07-06 16:06:28Z

When speeding up any code, the usual technique is to try lots of improvements and measure the performance. So let's try some different approaches to parsing this log file and see how long they take when run on this standard test case:

INPUT = '''\
unicomp6.unicomp.net - 804571214 GET /shuttle/countdown/count.gif 200 40310
unicomp6.unicomp.net - 804571214 GET /images/NASA-logosmall.gif 200 786
unicomp6.unicomp.net - 804571214 GET /images/KSC-logosmall.gif 200 1204
d104.aa.net - 804571215 GET /shuttle/countdown/count.gif 200 40310
d104.aa.net - 804571215 GET /images/NASA-logosmall.gif 200 786
d104.aa.net - 804571215 GET /images/KSC-logosmall.gif 200 1204
129.94.144.152 - 804571217 GET /images/ksclogo-medium.gif 304 0
199.120.110.21 - 804571217 GET /images/launch-logo.gif 200 1713
'''

First, using searchString as in the post:

def test1():
    return query.searchString(INPUT)

>>> from timeit import timeit
>>> timeit(test1, number=1000)
5.339049099013209

(This is basically the same as the code in the post, but skipping the bit where the input is read from a file, so that we can focus on the performance of the parsing.)

Now, the problem with searchString is that it searches — in particular, if it doesn't find a match starting at character \$n\$ in the input, then (in the worst case) it will try again starting at character \$n+1\$, and if that fails, try again starting at character \$n+2\$, and so on.

But when parsing a logfile like this, you know that every match has to start at the beginning of a line, and must end before the end of the line. So you can help out PyParsing by splitting the input into lines yourself, and calling parseString (which gives up immediately if there is no match) instead of searchString (which keeps trying as discussed above):

def test2():
    result = []
    for line in INPUT.splitlines():
        try:
            result.append(query.parseString(line))
        except ParseException:
            pass
    return result

>>> timeit(test2, number=1000)
0.8144860619213432

That's about 6½ times as fast as test1.

Now, PyParsing prioritizes flexibility and readability over performance. So it's not necessarily the right tool for a high-performance high-volume application. Perhaps we can do without PyParsing and just split the log entry?

def test3():
    result = []
    for line in INPUT.splitlines():
        _, _, _, method, path, _, _ = line.split()
        if method == 'GET' and path.startswith('/images/'):
            result.append(line)
    return result

>>> timeit(test3, number=1000)
0.011201590998098254

That's 480 times as fast as test1.

Thank you, Gareth. I'll alter my code to include parseString method and also I'll check performance without using Pyparsing. — SirPunch
– SirPunch, Commented Jul 7, 2017 at 3:38
Gareth, I ran the parsing task without using pyparsing and the complete execution finished in 0.3 seconds as compared to 3.5 minutes of pyparsing. Thank you for your help. :) — SirPunch
– SirPunch, Commented Jul 10, 2017 at 7:33

Stack Exchange Network

Parsing a file using PyParsing

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Parsing a file using PyParsing

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions