1

I'm parsing the US Patent XML files (downloaded from Google patent dumps) using Python and Beautifulsoup; parsed data is exported to MYSQL database.

Each year's data contains close to 200-300K patents - which means parsing 200-300K xml files.

The server on which I'm running the python script is pretty powerful - 16 cores, 160 gigs of RAM, etc. but still it is taking close to 3 days to parse one year's worth of data. enter image description here enter image description here

I've been learning and using python since 2 years - so I can get stuff done but do not know how to get it done in the most efficient manner. I'm reading on it.

How can I optimize the below script to make it efficient?

Any guidance would be greatly appreciated.

Below is the code:

from bs4 import BeautifulSoup
import pandas as pd
from pandas.core.frame import DataFrame
import MySQLdb as db
import os

cnxn = db.connect('xx.xx.xx.xx','xxxxx','xxxxx','xxxx',charset='utf8',use_unicode=True)

def separated_xml(infile):
    file = open(infile, "r")
    buffer = [file.readline()]
    for line in file:
        if line.startswith("<?xml "):
            yield "".join(buffer)
            buffer = []
        buffer.append(line)
    yield "".join(buffer)
    file.close()

def get_data(soup):
    df = pd.DataFrame(columns = ['doc_id','patcit_num','patcit_document_id_country', 'patcit_document_id_doc_number','patcit_document_id_kind','patcit_document_id_name','patcit_document_id_date','category'])
    if soup.findAll('us-citation'):
        cit = soup.findAll('us-citation')
    else:
        cit = soup.findAll('citation')
    doc_id = soup.findAll('publication-reference')[0].find('doc-number').text
    for x in cit:
        try:
            patcit_num = x.find('patcit')['num']
        except:
            patcit_num = None
        try:
            patcit_document_id_country = x.find('country').text
        except:
            patcit_document_id_country = None   
        try:     
            patcit_document_id_doc_number = x.find('doc-number').text
        except: 
            patcit_document_id_doc_number = None
        try:
            patcit_document_id_kind = x.find('kind').text
        except:
            patcit_document_id_kind = None
        try:
            patcit_document_id_name = x.find('name').text
        except:
            patcit_document_id_name = None
        try: 
            patcit_document_id_date = x.find('date').text
        except:
            patcit_document_id_date = None
        try:
            category = x.find('category').text
        except:
            category = None
        print doc_id
        val = {'doc_id':doc_id,'patcit_num':patcit_num, 'patcit_document_id_country':patcit_document_id_country,'patcit_document_id_doc_number':patcit_document_id_doc_number, 'patcit_document_id_kind':patcit_document_id_kind,'patcit_document_id_name':patcit_document_id_name,'patcit_document_id_date':patcit_document_id_date,'category':category}    
        df = df.append(val, ignore_index=True)
    df.to_sql(name = 'table_name', con = cnxn, flavor='mysql', if_exists='append')
    print '1 doc exported'

i=0

l = os.listdir('/path/')
for item in l:
    f = '/path/'+item
    print 'Currently parsing - ',item
    for xml_string in separated_xml(f):
        soup = BeautifulSoup(xml_string,'xml')
        if soup.find('us-patent-grant'):
            print item, i, xml_string[177:204]          
            get_data(soup)
        else:
            print item, i, xml_string[177:204],'***********************************soup not found********************************************'
        i+=1
print 'DONE!!!'
3
  • 1
    Run it on a small dataset with profiler to see where your issues are. docs.python.org/2/library/profile.html Also, remove all try/except clauses and check properly instead. Catching errors is expensive. How often is that thing printing to console? Once per file? Únbuffered printing is also expensive, print less if possible. You can also parallelize the program to run on more threads. This will only run on one core, one thread. Commented Jul 29, 2015 at 9:18
  • Yes, I'm researching how I can make it to use all the CPUs. I'm running the script using nohup so that the output is written to "nohup.out" file. Also, I need to check if that particular value exists because there are millions of records and the script could go bust if the value doesn't exist; unless there's a better way to check Commented Jul 29, 2015 at 9:24
  • 1
    Maybe switch to lxml.html? Some people experienced that BeautifulSoup is much slower than lxml.html, f.e : blog.dispatched.ch/2010/08/16/beautifulsoup-vs-lxml-performance Commented Jul 29, 2015 at 9:25

2 Answers 2

2

Here is a tutorial on multi-threading, because currently that code will run on 1 thread, 1 core.

Remove all try/except statements and handle the code properly. Exceptions are expensive.

Run a profiler to find the chokepoints, and multi-thread those or find a way to do them less times.

Sign up to request clarification or add additional context in comments.

5 Comments

16 threads (or 16 separate jobs, each doing 1/16th of the files) will run nearly 16 times as fast -- perhaps 5 hours per "year".
I've read the multi threading tutorial but I'm still confused as to how I could apply it to my scenario. @rickjames - I can get the list of files in the folder and divide them among 16 threads to be parsed; how do I tell the program to run these threads parallelly?
They are threads, they're always running in parallel if there are enough resources available. And as long as they aren't blocked waiting for the same resource. Another thing. I don't know how big your files are, but you might want to read in, say, 160 files and run on them, and then read in another 160 files. Disk access is time expensive
53 files in total - each containing concatenated text of close to 4000 individual xml files - basically the program reads each big file, splits it into inidividual xml files and parses the xml. If I were to divide the whole task into 10 threads I'd allcoate 5 files per thread.
Only 53? (Versus 16.) It might be simpler to simply spawn off 53 threads (or processes) and let them contend for resources. The critical resource in this approach would be RAM. "Swapping" would slow things down and make this approach not good. (The CPU and/or I/O saturation would not be a problem.)
0

So, you're doing two things wrong. First, you're using BeautifulSoup, which is slow, and second, you're using a "find" call, which is also slow.

As a first cut, look at lxml's ability to pre-compile xpath queries (Look at the heading "The Xpath class). That will give you a huge speed boost.

Alternatively, I've been working on a library to do this kind of parsing declaratively, using best practices for lxml speed, including precompiled xpath called yankee.

Yankee on PyPI | Yankee on GitHub

You could do the same thing with yankee like this:


from yankee.xml import Schema, fields as f

# Create a schema for citations

class Citation(Schema):
    num = f.Str(".//patcit")
    country = f.Str(".//country")
    # ... and so forth for the rest of your fields

# Then create a "wrapper" to get all the citations

class Patent(Schema):
    citations = f.List(".//us-citation|.//citation")

# Then just feed the Schema your lxml.etrees for each patent:

import lxml.etree as ET

schema = Patent()

for _, doc in ET.iterparse(xml_string, "xml"):
    result = schema.load(doc)

The result will look like this:

{
    "citations": [
        {
            "num": "<some value>",
            "country": "<some value>",
        },
        {
            "num": "<some value>",
            "country": "<some value>",
        },
    ]
}


I would also check out Dask to help you multithread it more efficiently. Pretty much all my projects use it.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.