2

I am trying to extract some data from this page. I would like to extract any texts between two strings (Item 1A RISK FACTORS and Item 1B UNRESOLVED STAFF COMMENTS). It is difficult to come up with the right regular expression to do that.

import re
import html2text

url = "https://www.sec.gov/Archives/edgar/data/104169/000010416916000079/wmtform10-kx1312016.htm"
html = urllib.urlopen(url).read()

text = html2text.html2text(html)

regex= '(?<=Item 1A Risk Factors)(.*)(?=Item 1B Unresolved)'

match = re.search(regex, text, flags=re.IGNORECASE)

print match

The above code returns 'none'. Any suggestions?

4
  • 1
    Don't parse HTML with regex? Could you use CSS selectors or an Xpath with an actual parser? Commented Apr 5, 2017 at 20:28
  • The html source does not contain neither the string "Item 1A Risk Factors" nor "Item 1B Unresolved". Commented Apr 5, 2017 at 20:30
  • "Item 1A Risk Factors" nor "Item 1B Unresolved" are in the actual texts. That's why I remove html tags first and try to use regular expression. Hope this makes sense. Commented Apr 5, 2017 at 20:32
  • It's probably worth noting that html2text converts HTML source to valid Markdown, not plain text. Commented Apr 5, 2017 at 20:43

2 Answers 2

2

If you want to use regEx, you may use below code which runs in Python 3.5.2. Try printing your "text" to see the actual value of ITEM 1A which is different from what you see in the webpage (ITEM\&#160\;1A). Hope this helps.

import urllib.request
from urllib.error import URLError, HTTPError
import re
import contextlib

mainpage = "https://www.sec.gov/Archives/edgar/data/104169/000010416916000079/wmtform10-kx1312016.htm"

try:
    with contextlib.closing(urllib.request.urlopen(mainpage)) as url:
        htmltext = url.read().decode('utf-8')
        #print(htmltext)
except HTTPError as e:
    print("HTTPError") 
except URLError as e:
    print("URLError") 
else:
    results = re.findall(r'(?=ITEM\&\#160\;1A\.(.*)(RISK FACTORS))(.*)(?=ITEM\&\#160\;1B\.(.*)(UNRESOLVED))',htmltext)
    print (results)
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks! @anonyXmous
1

You could just remove the html tags with this

Find:

<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?</\1\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>

Replace with nothing: ""

Then run this on the resulting string

1A\s*\.\s*RISK\s+FACTORS(.*?)1B\s*\.\s*UNRESOLVED\s+STAFF\s+COMMENTS

What you want is in capture group 1.

You could wrap the text in your own app or,

Paste the group 1 string into a http://www.regexformat.com app
document, right click context menu -> Misc Utilities -> Word Wrap.
Enter a value of about 60 in max line length.

And it pops out 5k of wrapped text, like below (which is truncated).

The risks described below could materially and adversely 
affect our business, results of operations, financial 
condition and liquidity.  Our business operations could also
be affected by additional factors that apply to all 
companies operating in the U.S. and globally.Strategic 
RisksGeneral or macro-economic factors, both domestically 
and internationally, may materially adversely affect our 
financial performance.General economic conditions, globally 
or in one or more of the markets we serve, may adversely 
affect our financial performance.  Higher interest rates, 
lower or higher prices of petroleum products, including 
crude oil, natural gas, gasoline, and diesel fuel, higher 
costs for electricity and other energy, weakness in the 
housing market, inflation, deflation, increased costs of 
essential services, such as medical care and utilities, 
higher levels of unemployment, decreases in consumer 
disposable income, unavailability of consumer credit, higher
consumer debt levels, changes in consumer spending and 
shopping patterns, fluctuations in currency exchange rates, 
higher tax rates, imposition of new taxes and surcharges, 
other changes in tax laws, other regulatory changes, overall

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.