1

I am using Python and spaCy as my NLP library. I am new to NLP work and I hope for some guidance in order to extract tabular information from a text. My goal is to find what type of expenses are frozen or not. Any guidance would be highly appreciated.

 TYPE_OF_EXPENSE    FROZEN?       NOT_FROZEN?
  purchase order    frozen           null 
     capital        frozen           null
   consulting       frozen           null
business meetings   frozen           null
 external hires     frozen           null
       KM&L          null         not frozen
      travel         null         not frozen


import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Non-revenue-generating purchase order expenditures will be frozen. All capital
         related expenditures are frozen effectively for Q4. Following spending categories
         are frozen: Consulting, (including existing engagements), Business meetings. 
         Please note that there is a hiring  freeze for external  hires, subcontractors 
         and  consulting services. KM&L expenditure will 
         not be frozen. Travel cost will not be on ‘freeze’.)

My ultimate goal is to extract all this table into an excel file. Even if you can advise for few of the categories above I would be deeply grateful. Thank you very much in advance.

2 Answers 2

0

Few questions: Are the categories predefined and will they remain that way? If so you can simply build a small vocabulary with only those words and work on that. The second thing to do is to first do basic preprocessing such as case adjusting etc.

Then split your input into sentences using some kind of sentence tokenizer. Once this is done split those sentences into tokens, nltk has a nice tokenizer that lets you define phrases so new york will be tokenied as new_york and so on. Once you tokenize each sentence simply use a window-based method if you find a matching token in a sentence where you look say 4 tokens before and after to find any negations with the word frozen. So for a sentence you can get tokens like

[All,capital,related,expenditures,are,frozen,effectively,for,Q4]

this hits a match for both frozen and capital keywords. Simply check the window size before and after frozen for negations and if you find any mark capital as False for frozen else mark it as true, since this can simply be done using a binary true/false column.

Sign up to request clarification or add additional context in comments.

2 Comments

thank you very much for your answer, however I am still struggling. Do you have any guidance with examples for each step you mention above?
nltk.org/book follow this link and solve the exercises therein to get a hold of basic classical nlp.
0

If the example you gave is a common example in your work, you can break down the tasks into the following steps:

  1. Defining rules using Spacy to describe the pattern of the sentences. For example KM&L expenditure will not be frozen could be [{"lower":{"REGEX": "^.*expenditure"}},{"lower":"not"}, {"LEMMA":"be"}, {"LOWER": "frozen"}](I didn't test it, so please make changes accordingly). You probably need to write as many rules as possible.

  2. Split the paragraphs into sentences using NLTK Tokenize (see example)

  3. For each sentence, use rule based matching from Spacy

  4. To extract the TYPE_OF_EXPENSE, get the substring KM&L by counting the characters from front or backwards in the matched sentence depending on the matched rule. For example, in this sentence KM&L expenditure will not be frozen, you can count from back because expenditure will not be frozen is defined in the rule and the length of the string is fixed.

Hope this helps.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.