How to strip entire HTML, CSS and JS code or tags from HTML page in python [duplicate]

Question

Possible Duplicate:
BeautifulSoup Grab Visible Webpage Text
Web scraping with Python

Say I am a very complex HTML page consisting usual HTML tags, CSS & JS in the middle. We might see all worst cases.

All I want is strip all the above tags/ code and return "text".

In simple terms:

<html><body>Text</body></html>

This might contain JS, CSS etc. etc..

I am trying to use BeautifulSoup but its not removing JS from the code.. Now ,I am thinking to use Regex.. but not sure how to do

edit1

Here is my try on a simple bootstrap html page...

from bs4 import BeautifulSoup as bs
import requests

bs( requests.get(MY-URL).text ).get_text()

$ return text

html
Home
Le styles
body {
        padding-top: 10%;
        padding-left: 30%;
      }
HTML5 shim, for IE6-8 support of HTML5 elements
[if lt IE 9]>
      <script src="http://htm...html5.js"></script>
    <![endif]
Home | Under Construction
Sample Page 1
The app
might
face some ........
Firefox
. Ple..
/container
var _gaq = _gaq || [];

  _gaq.push(['_trackPageview']);

  (function() {
    var ga = do...............
  })();

BeautifulSoup should allow you to remove the content of the <script> tag (i.e. the JavaScript), doesn't it? — Reinstate Monica -- notmaynard
– Reinstate Monica -- notmaynard, Commented Jan 15, 2013 at 18:36
Hh no. you don't need regex for that, show us your code and the html, and where you got stuck. — root
– root, Commented Jan 15, 2013 at 18:39
take a look at my above question. Actually, I am working on a very big database of urls..all are random. I need to extract "Text" of "body". How do I do it? — Dennis Ritchie
– Dennis Ritchie, Commented Jan 15, 2013 at 18:50

het.oosten · Accepted Answer · 2013-01-15 18:52:03Z

1

Django using this function to strip tags from text:

def strip_tags(value):
    """Returns the given HTML with all tags stripped."""
    return re.sub(r'<[^>]*?>', '', force_unicode(value))

(You won't need the force_unicode part)

answered Jan 15, 2013 at 18:52

het.oosten

8858 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Simon Steinberger Dec 11, 2024 at 8:30

I think, this will still leave the JS code in place and will only strip the surrounding <script> tags. Not really what OP asked. Sorry I don't have a working example, but xml.etree and BeautifulSoup should be able to do what you like.

Collectives™ on Stack Overflow

How to strip entire HTML, CSS and JS code or tags from HTML page in python [duplicate]

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related