1

I recently started working on python. I am trying to parse a xml document. Consider following xml file for reference:

<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications
      with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies,
      an evil sorceress, and her own childhood to become queen
      of the world.</description>
   </book>
</catalog>

Here I want to retrieve first book tag with all its contents, i.e.

<book id="bk101">
  <author>Gambardella, Matthew</author>
  <title>XML Developer's Guide</title>
  <genre>Computer</genre>
  <price>44.95</price>
  <publish_date>2000-10-01</publish_date>
  <description>An in-depth look at creating applications
  with XML.</description>
</book>

I come from scala background, there I can easily do this with

val node = scala.xml.XML.loadString(str)
val nodeSeq = node \\ "book"
nodeSeq.head.toString()

I have tried doing this with lxml with xpath but it gets complicated (fetch recursively content for nested elements) to achieve above requirement. Is there any easy way to do this in python? Also can it be extended for html?

TIA

1
  • Have you tried using minidom, probably its the easiest package for someone from Scala or Java background. Commented Nov 27, 2015 at 9:38

2 Answers 2

1

Use lxml and xpath

from lxml import etree

data = """<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications
      with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies,
      an evil sorceress, and her own childhood to become queen
      of the world.</description>
   </book>
</catalog>"""

tree = etree.fromstring(data)
book = tree.xpath("//catalog/book") #or book = tree.xpath("(//catalog/book)[1]")
for i in book[0]:#[0] means first book
    print etree.tostring(i)

Output-

<book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications
      with XML.</description>
   </book>
Sign up to request clarification or add additional context in comments.

Comments

0

This is the XPath to extract only the first book:

//catalog/book[1]

And this is the full code to return the results you want:

from lxml import html

XML = """<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications
      with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies,
      an evil sorceress, and her own childhood to become queen
      of the world.</description>
   </book>
</catalog>"""

tree = html.fromstring(XML)
first_book = tree.xpath('//catalog/book[1]')[0]
book_id = first_book.xpath('@id')[0]
author = first_book.xpath('.//author/text()')[0]
title = first_book.xpath('.//title/text()')[0]
genre = first_book.xpath('.//genre/text()')[0]
price = first_book.xpath('.//price/text()')[0]
publish_date = first_book.xpath('.//publish_date/text()')[0]
description = first_book.xpath('.//description/text()')[0].replace('\n',' ').replace('  ','')

print """Book Id:\t\t{}
Author:\t\t\t{}
Title:\t\t\t{}
Genre:\t\t\t{}
Price:\t\t\t{}
Publish Date:\t{}
Description:\t{}""".format(book_id,author,title,genre,price,publish_date,description)

Output:

Book Id:        bk101
Author:         Gambardella, Matthew
Title:          XML Developer's Guide
Genre:          Computer
Price:          44.95
Publish Date:   2000-10-01
Description:    An in-depth look at creating applications with XML.

If you need to get the same information from every book inside the <catalog> you'll just need to change //catalog/book[1] to //catalog/book and then loop through the results to extract each book's fields data.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.