I'm now extracting firm's name from the text data(10-k statement data).
I first tried using nltk StanfordTagger and extracted all the word tagged as organization. However, it quiet often failed to recall all the names of firms, and as I'm applying tagger to every single related sentence, it took such a long time.
So, I'm trying to extract all the words starting with Capital letter(or the words characters are all comprised of Capital letters).
So I find out that the regex below helpful.
(?:[A-Z]+[a-zA-Z\-0-9]*\.?\s?)+
However, It cannot distinguish the name of segment from the name of firm.
For example,
sentence : The Company's customers include, among others, Conner Peripherals Inc.("Conner"), Maxtor Corporation ("Maxtor"). The largest proportion of Applieds consolidated net sales and profitability has been and continues to be derived from sales of manufacturing equipment in the Silicon Systems segment to the global semiconductor industry.
I want to extract Conner Peripherals Inc, Conner, Maxtor Corporation, Maxtor, Applieds, but not 'Silicon Systems' since it is the name of segment.
So, I tried using
(?:[A-Z]+[a-zA-Z\-0-9]*\.?\s?)+(?!segment|Segment)
However, it still extract 'Silicon Systems'.
Could you help me solving this problem?
(Or do you have any idea of how to extract only the firm's name from the text data?)
Thanks a lot!!!
re.findallon the text I get['Company', 'Inc.', 'Conner', 'Corporation ', 'Maxtor', 'The ', 'Applieds ', 'Systems ']. This does not really match what you stated you wanted to match in the question, but you also said the regex works well. Am I missing something?