0

I'm now extracting firm's name from the text data(10-k statement data).

I first tried using nltk StanfordTagger and extracted all the word tagged as organization. However, it quiet often failed to recall all the names of firms, and as I'm applying tagger to every single related sentence, it took such a long time.

So, I'm trying to extract all the words starting with Capital letter(or the words characters are all comprised of Capital letters).

So I find out that the regex below helpful.

(?:[A-Z]+[a-zA-Z\-0-9]*\.?\s?)+

However, It cannot distinguish the name of segment from the name of firm.

For example,

sentence : The Company's customers include, among others, Conner Peripherals Inc.("Conner"), Maxtor Corporation ("Maxtor"). The largest proportion of Applieds consolidated net sales and profitability has been and continues to be derived from sales of manufacturing equipment in the Silicon Systems segment to the global semiconductor industry.

I want to extract Conner Peripherals Inc, Conner, Maxtor Corporation, Maxtor, Applieds, but not 'Silicon Systems' since it is the name of segment.

So, I tried using

(?:[A-Z]+[a-zA-Z\-0-9]*\.?\s?)+(?!segment|Segment)

However, it still extract 'Silicon Systems'.

Could you help me solving this problem?

(Or do you have any idea of how to extract only the firm's name from the text data?)

Thanks a lot!!!

6
  • If I use your first regex with re.findall on the text I get ['Company', 'Inc.', 'Conner', 'Corporation ', 'Maxtor', 'The ', 'Applieds ', 'Systems ']. This does not really match what you stated you wanted to match in the question, but you also said the regex works well. Am I missing something? Commented Nov 3, 2017 at 3:01
  • @SethMMorton Oh I think you are looking for the subgroup of regex result (which is basic setting for the re.findall). I'am using re.finditer method, and capturing only the full match! Commented Nov 3, 2017 at 3:27
  • It's really important to include all relevant information needed so that people can replicate your results and help you. Commented Nov 3, 2017 at 3:34
  • Also, if you don't want to capture that group then you should make it non-capturing so that it is clear you are not going to capture it. Regular expressions are hard enough to read as-is, it is best to add self-documentation whenever possible. Commented Nov 3, 2017 at 3:36
  • @SethMMorton I'm sorry, I thought this was just trivial detail! I will edit the question! Commented Nov 3, 2017 at 3:36

2 Answers 2

1

You need to capture all consecutive texts! and then, mark individual words starting with caps as non-capturing(?:) so that you can capture consecutive words!

>>> re.findall("((?:[A-Z]+[a-zA-Z\-0-9']*\.?\s?)+)+?(?![Ss]egment)",sentence)
["The Company's ", 'Conner Peripherals Inc.', 'Conner', 'Maxtor Corporation ', 'Maxtor', 'The ', 'Applieds ', '']
Sign up to request clarification or add additional context in comments.

2 Comments

It seems like my question isn't obvious enough. I tried your regex regex101.com/r/NCNKMP/1 however, it still captures 'Silicon System'.
please try it in a python interpreter! it only captures ['The ', 'Applieds ', '']
0

The NLTK approach, or any machine learning, seems to be a better approach here. I can only explain what the difficulty and current issue with the regex approach are.

The problem is that the matches expected can contain space separated phrases, and you want to avoid matching a certain phrase ending with segment. Even if you correct the negative lookahead as (?!\s*[Ss]egment), and make the pattern linear with something like \b[A-Z][a-zA-Z0-9-]*(?:\s+[A-Z][a-zA-Z0-9-]*)*\b\.?(?!\s+[sS]egment), you will still match Silicon, a part of the unwanted match.

What you might try to do is to match all these entities and discard after matching, and only keep those entities in other contexts by capturing them into Group 1.

See the sample regex for this:

\b[A-Z][a-zA-Z0-9-]*(?:\s+[A-Z][a-zA-Z0-9-]*)*\b\.?\s+[sS]egment\b|(\b[A-Z][a-zA-Z0-9-]*(?:\s+[A-Z][a-zA-Z0-9-]*)*\b\.?)

Since it is unwieldy, you should think of building it from blocks, dynamically:

import re
entity_rx = r"\b[A-Z][a-zA-Z0-9-]*(?:\s+[A-Z][a-zA-Z0-9-]*)*\b\.?"
rx = r"{0}\s+[sS]egment\b|({0})".format(entity_rx)
s = "The Company's customers include, among others, Conner Peripherals Inc.(\"Conner\"), Maxtor Corporation (\"Maxtor\"). The largest proportion of Applieds consolidated net sales and profitability has been and continues to be derived from sales of manufacturing equipment in the Silicon Systems segment to the global semiconductor industry."
matches = filter(None, re.findall(rx, s))
print(matches) 
# => ['The Company', 'Conner Peripherals Inc.', 'Conner', 'Maxtor Corporation', 'Maxtor', 'The', 'Applieds']

So,

  • \b - matches a word boundary
  • [A-Z][a-zA-Z0-9-]* - an uppercase letter followed with letters/digits/-
  • (?:\s+[A-Z][a-zA-Z0-9-]*)* - zero or more sequences of
    • \s+ - 1+ whitespaces
    • [A-Z][a-zA-Z0-9-]* - an uppercase letter followed with letters/digits/-
  • \b - trailing word boundary
  • \.? - an optional .

Then, this block is used to build

  • {0}\s+[sS]egment\b - the block we defined before followed with
    • \s+ - 1+ whitespaces
    • [sS]egment\b - either segment or Segment whole words
  • | - or
  • ({0}) - Group 1 (what re.findall actually returns): the block we defined before.

filter(None, res) (in Python 2.x, in Python 3.x use list(filter(None, re.findall(rx, s)))) will filter out empty items in the final list.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.