0

I dont know much about html... How do you remove just text from the page? For example if the html page reads as:

<meta name="title" content="How can I make money at home online? No gimmacks please? - Yahoo! Answers">
<title>How can I make money at home online? No gimmicks please? - Yahoo! Answers</title>

I just want to extract this.

How can I make money at home online? No gimmicks please? - Yahoo! Answers

I am using re function:

def striphtml(data):
  p = re.compile(r'<.*?>')
  return p.sub(' ',data)

but still it's not doing what I intend it to do..?

The above function is called as:

for lines in filehandle.readlines():

        #k = str(section[6].strip())
        myFile.write(lines)

        lines = striphtml(lines)
        content.append(lines)
2

3 Answers 3

2

Don't use Regular expressions for HTML/XML parsing. Try http://www.crummy.com/software/BeautifulSoup/ instead.

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('Your resource<title>hi</title>')
soup.title.string # Your title string.
Sign up to request clarification or add additional context in comments.

1 Comment

Update: try from bs4 import BeautifulSoup
2

Use an html parser for that. One could be BeautifulSoup

To get text content of the page:

 from BeautifulSoup import BeautifulSoup


 soup = BeautifulSoup(your_html)
 text_nodes = soup.findAll(text = True)
 retult = ' '.join(text_nodes)

Comments

1

I usually use http://lxml.de/ for html parsing! it is really easy to use, and pretty much to get tags you can use xpath for it! which just make things easy as well as fast.

I have a example of use, in a script that I did to read a xml feed and count the words:

https://gist.github.com/1425228

Also you can find more examples in the documentation: http://lxml.de/lxmlhtml.html

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.