Extracting text from a nested tags in XML using BeautifulSoup in python

Question

I am trying to extract the text out of nested tags for example the xml is in the form:

<thread id = 1_1>
  <post id = 1>
    <title>
      <ne>MediaPortal</ne> Install Guide
    </title>
    <content>
      <ne>MediaPortal</ne> Install Guide 0. Introduction and pre-requisites 
      <ne>MediaPortal</ne> is an open-source and free full-fledged <ne>HTPC</ne>
      front-end. It does everything you can ask for in a media center: video 
      playback, music playback, photo viewing, weather, TV tuning and recording, 
      etc. It has wide community support and thanks to it's excellent plug-in 
      and  skinning framework, there are lots of community-developed extensions 
      you can  pick and choose to make it your own. It is far more configurable 
      than <ne>Windows Media Center</ne>, and it works out-of-the-box with the 
      <ne>MCE</ne> remote. And because it provides so much more configuration 
      some find it a daunting task to install and configure. Therefore, this 
      guide will help alleviate some of that burden and help get a 
      <ne>MediaPortal</ne> installation up &amp; running. This guide is not 
      intended to replace the wonderful <ne>MediaPortal</ne> documentation, but 
      rather to introduce the AVS community to <ne>MediaPortal</ne> and provide
      a quick and easy set-up guide. If you need more details on configuration
    </content>
  </post>
</thread>

I need to extract data within the tags and save it in a separate file. I am able to do that and then I extract the tag having out of the beautiful soup object. Now, I want to extract the text from the and tags and put it in a separate file. Please give some suggestion how can this be achieved.

After extracting the tags out of the soup object if I do

for title in soup.find('title')
   print title.string

then it gives None on console for title tags having tags before extracting tags.

jcollado · Accepted Answer · 2011-11-22 07:13:35Z

1

From BeautifulSoup documentation:

For your convenience, if a tag has only one child node,
and that child node is a string,the child node is made
available as tag.string, as well as tag.contents[0].

However, in your case:

>>> t = soup.find('title')
<title><ne>MediaPortal</ne> Install Guide</title>

Hence, in your case, you cannot use tag.string. However, you can still use tag.contents or tag.text:

>>> t.contents
[<ne>MediaPortal</ne>, u' Install Guide']
>>> t.text
u'MediaPortalInstall Guide'

answered Nov 22, 2011 at 7:13

jcollado

40.5k9 gold badges108 silver badges139 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user977815 Over a year ago

Thanks jcollado, t.text worked for me. I was able to pull the text present in <title> and <content> tags after removing whole of <ne> tags.

Collectives™ on Stack Overflow

Extracting text from a nested tags in XML using BeautifulSoup in python

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related