0

I'm trying to extract dialog from the Folger Library Shakespeare TEI XML editions. A typical chunk of dialog looks like this:

<sp xml:id="sp-0024" who="#HORATIO">
<speaker xml:id="spk-0024">
<w xml:id="w0003030">HORATIO</w>
</speaker>
<ab xml:id="ab-0024">
<join type="line" xml:id="ftln-0024" n="1.1.24" ana="#short" target="#w0003040 #c0003050 #w0003060 #c0003070 #w0003080 #c0003090 #w0003100 #p0003110"/>
<w xml:id="w0003040" n="1.1.24">A</w>
<c xml:id="c0003050" n="1.1.24"> </c>
<w xml:id="w0003060" n="1.1.24">piece</w>
<c xml:id="c0003070" n="1.1.24"> </c>
<w xml:id="w0003080" n="1.1.24">of</w>
<c xml:id="c0003090" n="1.1.24"> </c>
<w xml:id="w0003100" n="1.1.24">him</w>
<pc xml:id="p0003110" n="1.1.24">.</pc>
</ab>
</sp>

I basically want to get output that looks like this: ['Horatio','A piece of him.'] but for all the dialog of a particular character. In other words, I want to be able to input the Folger Shakespeare TEI XML file and output files like gertrude.txt and horatio.txt each containing all the collected dialog from that particular character.

I can get all the dialog/stage direction/etc of a particular speaker with soup.find_all(who=u'#GERTRUDE') but then I can't seem to do anything else with the results, like drill down further, get the text between the tags, etc, without re-parsing the data all over again. Here's what happens:

>>> gertrude=soup.find_all(who=u'#GERTRUDE')
>>> gertrude.w
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'ResultSet' object has no attribute 'w'
>>> gertrude.get_text()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'ResultSet' object has no attribute 'get_text'
8
  • 2
    Why not use lxml directly? Commented May 2, 2013 at 21:07
  • Two points: how are you trying to use the data? Can you post a larger sample that would be a useful sample size for your testing against your use case. Yes, we can help you extract the data, but there's an amount of interpretation to do, so knowing how you want to use it is important. Commented May 2, 2013 at 21:37
  • @MartijnPieters, I'll look into that, thanks. I don't know anything about parsing XML so I just chose the first thing I heard of. Commented May 3, 2013 at 1:57
  • @MattH, Fair enough. I edited my question to make it clearer. Commented May 3, 2013 at 1:58
  • @Jono: Shucks, I was hoping for a larger sample to get a clear picture of the document and how it relates to character dialogue. You're still not making much sense about how you want to use the data. You're saying you want all the lines of dialogue for each character in a named character file. And then you say you know how to get the stage direction. Personally, I'd imagine the relative position of the dialogue and directions is important but you seem to indicate that it isn't. Commented May 3, 2013 at 8:47

1 Answer 1

1

BeautifulSoup's .find_all() method returns a ResultSet object, which is a specialized kind of list. You have 0 or more matches, and you need to either loop over that result set or use indexing to get at individual elements contained in the result set:

for speaker in soup.find_all(who=u'#GERTRUDE'):
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.