0

Possible Duplicate:
BeautifulSoup Grab Visible Webpage Text
Web scraping with Python

Say I am a very complex HTML page consisting usual HTML tags, CSS & JS in the middle. We might see all worst cases.

All I want is strip all the above tags/ code and return "text".

In simple terms:

<html><body>Text</body></html>

This might contain JS, CSS etc. etc..

I am trying to use BeautifulSoup but its not removing JS from the code.. Now ,I am thinking to use Regex.. but not sure how to do

edit1

Here is my try on a simple bootstrap html page...

from bs4 import BeautifulSoup as bs
import requests

bs( requests.get(MY-URL).text ).get_text()

$ return text

html
Home
Le styles
body {
        padding-top: 10%;
        padding-left: 30%;
      }
HTML5 shim, for IE6-8 support of HTML5 elements
[if lt IE 9]>
      <script src="http://htm...html5.js"></script>
    <![endif]
Home | Under Construction
Sample Page 1
The app
might
face some ........
Firefox
. Ple..
/container
var _gaq = _gaq || [];

  _gaq.push(['_trackPageview']);

  (function() {
    var ga = do...............
  })();
3
  • 2
    BeautifulSoup should allow you to remove the content of the <script> tag (i.e. the JavaScript), doesn't it? Commented Jan 15, 2013 at 18:36
  • Hh no. you don't need regex for that, show us your code and the html, and where you got stuck. Commented Jan 15, 2013 at 18:39
  • take a look at my above question. Actually, I am working on a very big database of urls..all are random. I need to extract "Text" of "body". How do I do it? Commented Jan 15, 2013 at 18:50

1 Answer 1

1

Django using this function to strip tags from text:

def strip_tags(value):
    """Returns the given HTML with all tags stripped."""
    return re.sub(r'<[^>]*?>', '', force_unicode(value))

(You won't need the force_unicode part)

Sign up to request clarification or add additional context in comments.

1 Comment

I think, this will still leave the JS code in place and will only strip the surrounding <script> tags. Not really what OP asked. Sorry I don't have a working example, but xml.etree and BeautifulSoup should be able to do what you like.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.