I have HTML text that looks like many instances of the following structure:
<DOC>
<DOCNO> XXX-2222 </DOCNO>
<FILEID>AP-NR-02-12-88 2344EST</FILEID>
<HEAD>Reports Former Saigon Officials Released from Re-education Camp</HEAD>
<TEXT>
Lots of text here
</TEXT>
</DOC>
What I need to do is index each structure, with the DocNo, Headline, and Text, to later be analysed (tokenised, etc.).
I was thinking of using BeautifulSoup, and this is the code I have so far:
soup = BeautifulSoup (file("AP880212.html").read())
num = soup.findAll('docno')
But this only gives me results of the following format:
<docno> AP880212-0166 </docno>, <docno> AP880212-0167 </docno>, <docno> AP880212-0168 </docno>, <docno> AP880212-0169 </docno>, <docno> AP880212-0170 </docno>
How do I extract the numbers within the <> ? And link them with the headlines and texts?
Thank you very much,
Sasha