How can I extract XML text using python BeautifulSoup?

Question

I'm trying to extract dialog from the Folger Library Shakespeare TEI XML editions. A typical chunk of dialog looks like this:

<sp xml:id="sp-0024" who="#HORATIO">
<speaker xml:id="spk-0024">
<w xml:id="w0003030">HORATIO</w>
</speaker>
<ab xml:id="ab-0024">
<join type="line" xml:id="ftln-0024" n="1.1.24" ana="#short" target="#w0003040 #c0003050 #w0003060 #c0003070 #w0003080 #c0003090 #w0003100 #p0003110"/>
<w xml:id="w0003040" n="1.1.24">A</w>
<c xml:id="c0003050" n="1.1.24"> </c>
<w xml:id="w0003060" n="1.1.24">piece</w>
<c xml:id="c0003070" n="1.1.24"> </c>
<w xml:id="w0003080" n="1.1.24">of</w>
<c xml:id="c0003090" n="1.1.24"> </c>
<w xml:id="w0003100" n="1.1.24">him</w>
<pc xml:id="p0003110" n="1.1.24">.</pc>
</ab>
</sp>

I basically want to get output that looks like this: ['Horatio','A piece of him.'] but for all the dialog of a particular character. In other words, I want to be able to input the Folger Shakespeare TEI XML file and output files like gertrude.txt and horatio.txt each containing all the collected dialog from that particular character.

I can get all the dialog/stage direction/etc of a particular speaker with soup.find_all(who=u'#GERTRUDE') but then I can't seem to do anything else with the results, like drill down further, get the text between the tags, etc, without re-parsing the data all over again. Here's what happens:

>>> gertrude=soup.find_all(who=u'#GERTRUDE')
>>> gertrude.w
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'ResultSet' object has no attribute 'w'
>>> gertrude.get_text()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'ResultSet' object has no attribute 'get_text'

Two points: how are you trying to use the data? Can you post a larger sample that would be a useful sample size for your testing against your use case. Yes, we can help you extract the data, but there's an amount of interpretation to do, so knowing how you want to use it is important. — MattH
– MattH, Commented May 2, 2013 at 21:37
@MartijnPieters, I'll look into that, thanks. I don't know anything about parsing XML so I just chose the first thing I heard of. — Jonathan
– Jonathan, Commented May 3, 2013 at 1:57
@MattH, Fair enough. I edited my question to make it clearer. — Jonathan
– Jonathan, Commented May 3, 2013 at 1:58
@Jono: Shucks, I was hoping for a larger sample to get a clear picture of the document and how it relates to character dialogue. You're still not making much sense about how you want to use the data. You're saying you want all the lines of dialogue for each character in a named character file. And then you say you know how to get the stage direction. Personally, I'd imagine the relative position of the dialogue and directions is important but you seem to indicate that it isn't. — MattH
– MattH, Commented May 3, 2013 at 8:47

Martijn Pieters · Accepted Answer · 2013-05-03 06:57:06Z

1

BeautifulSoup's .find_all() method returns a ResultSet object, which is a specialized kind of list. You have 0 or more matches, and you need to either loop over that result set or use indexing to get at individual elements contained in the result set:

for speaker in soup.find_all(who=u'#GERTRUDE'):

answered May 3, 2013 at 6:57

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How can I extract XML text using python BeautifulSoup?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related