0

I have HTML text that looks like many instances of the following structure:

<DOC>
<DOCNO> XXX-2222 </DOCNO>
<FILEID>AP-NR-02-12-88 2344EST</FILEID>
<HEAD>Reports Former Saigon Officials Released from Re-education Camp</HEAD>
<TEXT>
Lots of text here
</TEXT>
</DOC>

What I need to do is index each structure, with the DocNo, Headline, and Text, to later be analysed (tokenised, etc.).

I was thinking of using BeautifulSoup, and this is the code I have so far:

soup = BeautifulSoup (file("AP880212.html").read()) 
num = soup.findAll('docno')

But this only gives me results of the following format:

<docno> AP880212-0166 </docno>, <docno> AP880212-0167 </docno>, <docno> AP880212-0168 </docno>, <docno> AP880212-0169 </docno>, <docno> AP880212-0170 </docno>

How do I extract the numbers within the <> ? And link them with the headlines and texts?

Thank you very much,

Sasha

3 Answers 3

2

To get the contents of the tags:

docnos = soup.findAll('docno')
for docno in docnos:
    print docno.contents[0]
Sign up to request clarification or add additional context in comments.

5 Comments

and if I wanted to link the docs nos, the titles and the docs?
You can iterate over soup.findAll('doc'), collecting the tag contents you want before iterating over soup.findAll('docno') and creating keys in the first loop and establishing values in the second loop. That is to say, have a nested loop (not an additional loop).
Also consider using BeautifulStoneSoup since this is xml. Its easy to use (just like BeautifulSoup). Do from BeautifulSoup import BeautifulStoneSoup and use it exactly as you would ordinarily
@That1Guy BeautifulStoneSoup is deprecated in bs4. I admit its what I first tried as well :). It has been replaced by passing features="xml" kwarg to the BeautifulSoup constructor.
@PreetKukreti Absolutely right. I was under the impression the OP was using BeautifulSoup 3 (findAll vs find_all in their code). Thank you for the correction. I personally use 3 as I've found bs4 to be slightly buggy.
1

Something like this:

html = """<DOC>
<DOCNO> XXX-2222 </DOCNO>
<FILEID>AP-NR-02-12-88 2344EST</FILEID>
<HEAD>Reports Former Saigon Officials Released from Re-education Camp</HEAD>
<TEXT>
Lots of text here
</TEXT>
</DOC>
"""

import bs4

d = {}

soup = bs4.BeautifulSoup(html, features="xml")
docs = soup.findAll("DOC")
for doc in docs:
    d[doc.DOCNO.getText()] = (doc.HEAD.getText(), doc.TEXT.getText())

print d
#{u' XXX-2222 ': 
#   (u'Reports Former Saigon Officials Released from Re-education Camp', 
#    u'\nLots of text here\n')}

Note that I pass features="xml" to the constructor. This is because there are a lot of non-standard html tags in your input. You will probably also want to .strip() text before you save it into the dictionary so it is not so whitespace sensitive (unless that is your intention, of course).

Update:

If there are multiple DOC's in the same file, and the features="xml" is limiting to one, its probably because the XML parser is expecting to have only one root element.

E.g. If you wrap your entire input XML in a single root element, it should work:

<XMLROOT>
    <!-- Existing XML (e.g. list of DOC elements) -->
</XMLROOT>

so you can either do this in your file, or what I would suggest is to do this programmatically on the input text before you pass it to beautifulsoup:

root_element_name = "XMLROOT"  # this can be anything
rooted_html = "<{0}>\n{1}\n</{0}>".format(root_element_name, html)
soup = bs4.BeautifulSoup(rooted_html, features="xml")

4 Comments

And if I wanted to read it out a series of .html files out of a directory? Currently I'm doing path = '/TREC-AP88-90-qrels1-50/Docs' for infile in glob.glob(os.path.join(path,'*.html') ): soup = BeautifulSoup (file(infile).read(), features="xml") and then your code, but it's not giving me the correct results..
In fact, it's the 'xml' part that makes it stop after the first doc.. is there any way to go around that?
@user2070177 see my update. Ive tested it with multiple docs and it works.
@user2070177 You could wrap this code in a function that takes the raw html text and returns a dictionary; the return dict which you can then aggregate into a master dictionary (in the scope of your main directory/file iteration loop) using dict.update(..). This is a much cleaner and better separated design.
0
docnos = soup.findAll('docno')
for docno in docnos:
       print docno.renderContents()

You can also use renderContents() to extract information from tags.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.