Beautifulsoup HTML data extraction with BeautifulSoup and Python

Question

I have HTML text that looks like many instances of the following structure:

<DOC>
<DOCNO> XXX-2222 </DOCNO>
<FILEID>AP-NR-02-12-88 2344EST</FILEID>
<HEAD>Reports Former Saigon Officials Released from Re-education Camp</HEAD>
<TEXT>
Lots of text here
</TEXT>
</DOC>

What I need to do is index each structure, with the DocNo, Headline, and Text, to later be analysed (tokenised, etc.).

I was thinking of using BeautifulSoup, and this is the code I have so far:

soup = BeautifulSoup (file("AP880212.html").read()) 
num = soup.findAll('docno')

But this only gives me results of the following format:

<docno> AP880212-0166 </docno>, <docno> AP880212-0167 </docno>, <docno> AP880212-0168 </docno>, <docno> AP880212-0169 </docno>, <docno> AP880212-0170 </docno>

How do I extract the numbers within the <> ? And link them with the headlines and texts?

Thank you very much,

Sasha

That1Guy · Accepted Answer · 2013-02-13 23:00:51Z

2

To get the contents of the tags:

docnos = soup.findAll('docno')
for docno in docnos:
    print docno.contents[0]

answered Feb 13, 2013 at 23:00

That1Guy

7,2335 gold badges50 silver badges60 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

sashalu Over a year ago

and if I wanted to link the docs nos, the titles and the docs?

That1Guy Over a year ago

You can iterate over soup.findAll('doc'), collecting the tag contents you want before iterating over soup.findAll('docno') and creating keys in the first loop and establishing values in the second loop. That is to say, have a nested loop (not an additional loop).

That1Guy Over a year ago

Also consider using BeautifulStoneSoup since this is xml. Its easy to use (just like BeautifulSoup). Do from BeautifulSoup import BeautifulStoneSoup and use it exactly as you would ordinarily

Preet Kukreti Over a year ago

@That1Guy BeautifulStoneSoup is deprecated in bs4. I admit its what I first tried as well :). It has been replaced by passing features="xml" kwarg to the BeautifulSoup constructor.

That1Guy Over a year ago

@PreetKukreti Absolutely right. I was under the impression the OP was using BeautifulSoup 3 (findAll vs find_all in their code). Thank you for the correction. I personally use 3 as I've found bs4 to be slightly buggy.

Preet Kukreti · Accepted Answer · 2013-02-15 02:42:59Z

1

Something like this:

html = """<DOC>
<DOCNO> XXX-2222 </DOCNO>
<FILEID>AP-NR-02-12-88 2344EST</FILEID>
<HEAD>Reports Former Saigon Officials Released from Re-education Camp</HEAD>
<TEXT>
Lots of text here
</TEXT>
</DOC>
"""

import bs4

d = {}

soup = bs4.BeautifulSoup(html, features="xml")
docs = soup.findAll("DOC")
for doc in docs:
    d[doc.DOCNO.getText()] = (doc.HEAD.getText(), doc.TEXT.getText())

print d
#{u' XXX-2222 ': 
#   (u'Reports Former Saigon Officials Released from Re-education Camp', 
#    u'\nLots of text here\n')}

Note that I pass features="xml" to the constructor. This is because there are a lot of non-standard html tags in your input. You will probably also want to .strip() text before you save it into the dictionary so it is not so whitespace sensitive (unless that is your intention, of course).

Update:

If there are multiple DOC's in the same file, and the features="xml" is limiting to one, its probably because the XML parser is expecting to have only one root element.

E.g. If you wrap your entire input XML in a single root element, it should work:

<XMLROOT>
    <!-- Existing XML (e.g. list of DOC elements) -->
</XMLROOT>

so you can either do this in your file, or what I would suggest is to do this programmatically on the input text before you pass it to beautifulsoup:

root_element_name = "XMLROOT"  # this can be anything
rooted_html = "<{0}>\n{1}\n</{0}>".format(root_element_name, html)
soup = bs4.BeautifulSoup(rooted_html, features="xml")

edited Feb 15, 2013 at 2:42

answered Feb 13, 2013 at 23:00

Preet Kukreti

8,66731 silver badges36 bronze badges

4 Comments

sashalu Over a year ago

And if I wanted to read it out a series of .html files out of a directory? Currently I'm doing path = '/TREC-AP88-90-qrels1-50/Docs' for infile in glob.glob(os.path.join(path,'*.html') ): soup = BeautifulSoup (file(infile).read(), features="xml") and then your code, but it's not giving me the correct results..

sashalu Over a year ago

In fact, it's the 'xml' part that makes it stop after the first doc.. is there any way to go around that?

Preet Kukreti Over a year ago

@user2070177 see my update. Ive tested it with multiple docs and it works.

Preet Kukreti Over a year ago

@user2070177 You could wrap this code in a function that takes the raw html text and returns a dictionary; the return dict which you can then aggregate into a master dictionary (in the scope of your main directory/file iteration loop) using dict.update(..). This is a much cleaner and better separated design.

Robin Mukanganise · Accepted Answer · 2013-02-14 08:34:15Z

0

docnos = soup.findAll('docno')
for docno in docnos:
       print docno.renderContents()

You can also use renderContents() to extract information from tags.

answered Feb 14, 2013 at 8:34

Robin Mukanganise

633 silver badges8 bronze badges

Collectives™ on Stack Overflow

Beautifulsoup HTML data extraction with BeautifulSoup and Python

3 Answers 3

5 Comments

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related