Get a clean string from HTML, CSS and JavaScript

Question

Currently, I'm trying to scrape 10-K submission text files on sec.gov.

Here's an example text file:
https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/0001193125-15-356351.txt

The text document contains things like HTML tags, CSS styles, and JavaScript. Ideally, I'd like to scrape only the content after removing all the tags and styling.

First, I tried the obvious get_text() method from BeautifulSoup. That didn't work out.
Then I tried using regex to remove everything between < and >. Unfortunately, also this didn't work out entirely. It keeps some tags, styles, and scripts.

Does anyone have a clean solution for me to accomplish my goal?

Here is my code so far:

import requests
import re

url = 'https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/0001193125-15-356351.txt'
response = requests.get(url)
text = re.sub('<.*?>', '', response.text)
print(text)

Ivan Chaer · Accepted Answer · 2018-09-05 21:30:42Z

5

Let's set a dummy string based on the example:

original_content = """
<script>console.log("test");</script>
<TD VALIGN="bottom" ALIGN="center"><FONT STYLE="font-family:Arial; ">(Address of principal executive offices)</FONT></TD>
"""

Now let's remove all the javascript.

from lxml.html.clean import Cleaner # remove javascript

# Delete javascript tags (some other options are left for the sake of example).

cleaner = Cleaner(
    comments = True, # True = remove comments
    meta=True, # True = remove meta tags
    scripts=True, # True = remove script tags
    embedded = True, # True = remove embeded tags
)
clean_dom = cleaner.clean_html(original_content)

(From https://stackoverflow.com/a/46371211/1204332)

And then we can either remove the HTML tags (extract the text) with the HTMLParser library:

from HTMLParser import HTMLParser

# Strip HTML.

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

text_content = strip_tags(clean_dom)

print text_content

(From: https://stackoverflow.com/a/925630/1204332)

Or we could get the text with the lxml library:

from lxml.html import fromstring

print fromstring(original_content).text_content()

edited Sep 5, 2018 at 21:30

answered Sep 5, 2018 at 17:11

Ivan Chaer

7,0901 gold badge40 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ivan Chaer Over a year ago

The fact that we are using a class here is just an implementation detail for this library (HTMLParser). You can see the documentation here: docs.python.org/2/library/htmlparser.html . As you can see in their page, that's how they do it. Classes are handy, have a look when you have the time. :) Good coding, and welcome to Stack Overflow!

Ivan Chaer Over a year ago

I guess the difference lies in the parsers and methods used. While lxml is a binding for the C libraries libxml2 and libxslt, the HTMLParser library is a Python based solution, much simpler. For the sake of completeness, I added the lxml option to the answer. If all you need is to clean the HTML tags, you could perhaps get away just with HTMLParser. In my experience, lxml was often the go-to tool. But I still use HTMLParser for removing HTML tags, as it gets the job done fine.

Collectives™ on Stack Overflow

Get a clean string from HTML, CSS and JavaScript

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related