0

I'm trying to work on a project about page ranking.

I want to make an index (dictionary) which looks like this:
file1.html -> [[cat, ate, food, drank, milk], [file2.html, file3.html]]
file2.html -> [[dog, barked, ran, away], [file1.html, file4.html]]

Fetching links is easy - look for anchor tags.

My question is - how do I fetch text? The text in the html files is not enclosed within any tags like <p>

Thanks in advance for all the help

2 Answers 2

1

Use an HTML parser - something like BeautifulSoup.

Sign up to request clarification or add additional context in comments.

1 Comment

Yes, I'm using beautifulsoup, unfortunately, I'm unable to parse text thats not enclosed within any tags
0

If the text isn't enclosed in tags is it really HTML?
As Amber says, you'll have an easier job of this using some HTML parser like BeautifulSoup.

The example below demonstrates a simple method for returning text within tags.
This method works for any tag AFAIK.

>>> from BeautifulSoup import BeautifulSoup as bs
>>> html = '''
... <div><a href="/link1">link1 contents</a></div>
... <div><a href="/link2">link2 contents</a></div>
... '''
>>> soup = bs(html)
>>> for anchor_tag in soup.findAll('a'):
...   print anchor_tag.contents[0]
... 
link1 contents
link2 contents

Apart from that I can imagine that you'd want a dictionary with a count of how many times a certain term appeared in some HTML document. defaultdict is good for that kind of thing:

>>> from collections import defaultdict
>>> d = defaultdict(int)
>>> for anchor_tag in soup.findAll('a'):
...   d[anchor_tag.contents[0]] += 1
... 
>>> d
defaultdict(<type 'int'>, {u'link1 contents': 1, u'link2 contents': 1})

Hopefully that gives you some ideas to run with. Come back and open another question if you run into other issues.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.