0

I want to extract the text from the subtitles transcript of a youtube video. I got the XML file using video.google.com. Now I want to extract the text from the xml file. I tried the following but I am getting an AttributeError: 'NoneType' object has no attribute 'text' error. I am adding only a sample of the xml file as it can get too long.

from xml.etree import cElementTree as ET
xmlstring  = """<timedtext format="3">
<style type="text/css" id="night-mode-pro-style"/>
<link type="text/css" rel="stylesheet" id="night-mode-pro-link"/>
<head>
<pen id="1" fc="#E5E5E5"/>
<pen id="2" fc="#CCCCCC"/>
<ws id="0"/>
<ws id="1" mh="2" ju="0" sd="3"/>
<wp id="0"/>
<wp id="1" ap="6" ah="20" av="100" rc="2" cc="40"/>
</head>
<body>
<w t="0" id="1" wp="1" ws="1"/>
<p t="30" d="5010" w="1">
<s ac="252">in</s>
<s t="569" ac="252">the</s>
<s t="1080" ac="252">last</s>
<s t="1260" ac="227">video</s>
<s p="2" t="1500" ac="187">we</s>
<s p="2" t="1860" ac="160">started</s>
<s p="2" t="2190" ac="234">talking</s>
</p>
<p t="2570" d="2470" w="1" a="1"></p>
<p t="2580" d="5100" w="1">
<s ac="252">about</s>
<s t="59" ac="227">Markov</s>
<s t="660" ac="252">models</s>
<s p="1" t="1200" ac="217">as</s>
<s t="1379" ac="252">a</s>
<s t="1440" ac="252">way</s>
<s t="1949" ac="252">to</s>
<s t="2009" ac="252">model</s>
</p>
</body>
</timedtext>"""

words = []
root = ET.fromstring(xmlstring)
for page in list(root):
    words.append(page.find('s').text)

text = ' '.join(words)

The text of the video is in the <s> tags but I am not able to extract them. Any idea what to do? Thanks in advance

2 Answers 2

2

s tag is found inside p tag and p tag is found inside body tag. You may change the code slight.

words = []
root = ET.fromstring(xmlstring)
body = root.find("body")

for page in body.findall("p"):
    for s in page.findall("s"):
        words.append(s.text)

text = ' '.join(words)
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot Mitiku.
1

You can loop s tag directly

root = ET.fromstring(xmlstring) 
words = [s.text for s in root.findall(".//s")] 
text = ' '.join(words)

1 Comment

Hmm this is lot cleaner. Thanks

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.