6

Currently, I'm trying to scrape 10-K submission text files on sec.gov.

Here's an example text file:
https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/0001193125-15-356351.txt

The text document contains things like HTML tags, CSS styles, and JavaScript. Ideally, I'd like to scrape only the content after removing all the tags and styling.

First, I tried the obvious get_text() method from BeautifulSoup. That didn't work out.
Then I tried using regex to remove everything between < and >. Unfortunately, also this didn't work out entirely. It keeps some tags, styles, and scripts.

Does anyone have a clean solution for me to accomplish my goal?

Here is my code so far:

import requests
import re

url = 'https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/0001193125-15-356351.txt'
response = requests.get(url)
text = re.sub('<.*?>', '', response.text)
print(text)

1 Answer 1

5

Let's set a dummy string based on the example:

original_content = """
<script>console.log("test");</script>
<TD VALIGN="bottom" ALIGN="center"><FONT STYLE="font-family:Arial; ">(Address of principal executive offices)</FONT></TD>
"""

Now let's remove all the javascript.

from lxml.html.clean import Cleaner # remove javascript

# Delete javascript tags (some other options are left for the sake of example).

cleaner = Cleaner(
    comments = True, # True = remove comments
    meta=True, # True = remove meta tags
    scripts=True, # True = remove script tags
    embedded = True, # True = remove embeded tags
)
clean_dom = cleaner.clean_html(original_content)

(From https://stackoverflow.com/a/46371211/1204332)

And then we can either remove the HTML tags (extract the text) with the HTMLParser library:

from HTMLParser import HTMLParser

# Strip HTML.

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

text_content = strip_tags(clean_dom)

print text_content

(From: https://stackoverflow.com/a/925630/1204332)

Or we could get the text with the lxml library:

from lxml.html import fromstring

print fromstring(original_content).text_content()
Sign up to request clarification or add additional context in comments.

2 Comments

The fact that we are using a class here is just an implementation detail for this library (HTMLParser). You can see the documentation here: docs.python.org/2/library/htmlparser.html . As you can see in their page, that's how they do it. Classes are handy, have a look when you have the time. :) Good coding, and welcome to Stack Overflow!
I guess the difference lies in the parsers and methods used. While lxml is a binding for the C libraries libxml2 and libxslt, the HTMLParser library is a Python based solution, much simpler. For the sake of completeness, I added the lxml option to the answer. If all you need is to clean the HTML tags, you could perhaps get away just with HTMLParser. In my experience, lxml was often the go-to tool. But I still use HTMLParser for removing HTML tags, as it gets the job done fine.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.