0

I have hundreds of company report .txt files, and I want to extract some information from it. For example, one part of the file looks like this:

Mr. Davido will receive a base salary of $700,000 during the initial and any subsequent 
term. The Chief Executive Officer of the Company (the CEO) and the Board (or a committee
thereof) shall review Mr. Davidos base salary at least annually, and may increase it at 
any time in their sole discretion

I am trying to use pyparsing to extract the base salary value of the guy.

code

from pyparsing import * 

# define grammar
digits = "0123456789"
integer = Word( digits )
money = Group("$"+integer+','+integer + Optional(','+integer , ' '))
start = Word("base salary") 
salary = start + money

#search
for t in text:
  result = salary.parseString( text )
print result

This always gives the error:

pyparsing.ParseException: Expected W:(base...) (at char 0), (line:1, col:1)

After some simple tests, I find that use this code I can only find what I want from the particular form of text which start with:

"base salary $700,000......"

and it can only identify the first case appears in that text.

So I was wondering if someone could help me with it. And, if possible also identify the name of the guy, and store the name and salary into a dataframe.

Thank you so much.

2
  • 1
    I am going to go ahead and say you cant. Pyparsing is for structured texts, where what you have is a natural language problem. NLTK may (MAY!) be the tool to use... though the tool I would use is interns. Commented Oct 12, 2014 at 12:22
  • Thanks a lot @Tritium21, I will give NLTK a try. Commented Oct 12, 2014 at 14:30

1 Answer 1

1

I'll answer your specific question first. parseString is used when you have defined a comprehensive grammar that will match everything from the beginning of the text. Since you are trying to pick out a specific phrase from somewhere in the middle of the input line, use searchString or scanString instead.

As pyparsing's author, I will concur with @Tritium21 - unless there are some specific forms and phrases that you can look for, you will tear your hair out trying to parse this kind of natural language input.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you so much Paul, I will try other toolkit.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.