2

I am trying to extract the text out of nested tags for example the xml is in the form:

<thread id = 1_1>
  <post id = 1>
    <title>
      <ne>MediaPortal</ne> Install Guide
    </title>
    <content>
      <ne>MediaPortal</ne> Install Guide 0. Introduction and pre-requisites 
      <ne>MediaPortal</ne> is an open-source and free full-fledged <ne>HTPC</ne>
      front-end. It does everything you can ask for in a media center: video 
      playback, music playback, photo viewing, weather, TV tuning and recording, 
      etc. It has wide community support and thanks to it's excellent plug-in 
      and  skinning framework, there are lots of community-developed extensions 
      you can  pick and choose to make it your own. It is far more configurable 
      than <ne>Windows Media Center</ne>, and it works out-of-the-box with the 
      <ne>MCE</ne> remote. And because it provides so much more configuration 
      some find it a daunting task to install and configure. Therefore, this 
      guide will help alleviate some of that burden and help get a 
      <ne>MediaPortal</ne> installation up &amp; running. This guide is not 
      intended to replace the wonderful <ne>MediaPortal</ne> documentation, but 
      rather to introduce the AVS community to <ne>MediaPortal</ne> and provide
      a quick and easy set-up guide. If you need more details on configuration
    </content>
  </post>
</thread>

I need to extract data within the tags and save it in a separate file. I am able to do that and then I extract the tag having out of the beautiful soup object. Now, I want to extract the text from the and tags and put it in a separate file. Please give some suggestion how can this be achieved.

After extracting the tags out of the soup object if I do

for title in soup.find('title')
   print title.string

then it gives None on console for title tags having tags before extracting tags.

1 Answer 1

1

From BeautifulSoup documentation:

For your convenience, if a tag has only one child node,
and that child node is a string,the child node is made
available as tag.string, as well as tag.contents[0].

However, in your case:

>>> t = soup.find('title')
<title><ne>MediaPortal</ne> Install Guide</title>

Hence, in your case, you cannot use tag.string. However, you can still use tag.contents or tag.text:

>>> t.contents
[<ne>MediaPortal</ne>, u' Install Guide']
>>> t.text
u'MediaPortalInstall Guide'
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks jcollado, t.text worked for me. I was able to pull the text present in <title> and <content> tags after removing whole of <ne> tags.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.