18

I have this example xml file

<page>
  <title>Chapter 1</title>
  <content>Welcome to Chapter 1</content>
</page>
<page>
 <title>Chapter 2</title>
 <content>Welcome to Chapter 2</content>
</page>

I like to extract the contents of title tags and content tags.

Which method is good to extract the data, using pattern matching or using xml module. Or is there any better way to extract the data.

0

6 Answers 6

26

There is already a built-in XML library, notably ElementTree. For example:

>>> from xml.etree import cElementTree as ET
>>> xmlstr = """
... <root>
... <page>
...   <title>Chapter 1</title>
...   <content>Welcome to Chapter 1</content>
... </page>
... <page>
...  <title>Chapter 2</title>
...  <content>Welcome to Chapter 2</content>
... </page>
... </root>
... """
>>> root = ET.fromstring(xmlstr)
>>> for page in list(root):
...     title = page.find('title').text
...     content = page.find('content').text
...     print('title: %s; content: %s' % (title, content))
...
title: Chapter 1; content: Welcome to Chapter 1
title: Chapter 2; content: Welcome to Chapter 2
Sign up to request clarification or add additional context in comments.

3 Comments

@SudeepKodavati: If you think Santa has answered the question to your satisfaction, please "accept" his answer.
I like this interface, you can index into child tags root[0][1][0]..., as well as get an iterator from any node that will walk all child nodes! list( root[0][1].itertext() )Super handy!
cElementTree is no longer needed on supported versions of Python (3.3+), use ElementTree.
3

You can also try this code to extract texts:

from bs4 import BeautifulSoup
import csv

data ="""<page>
  <title>Chapter 1</title>
  <content>Welcome to Chapter 1</content>
</page>
<page>
 <title>Chapter 2</title>
 <content>Welcome to Chapter 2</content>
</page>"""

soup = BeautifulSoup(data, "html.parser")

########### Title #############
required0 = soup.find_all("title")
title = []
for i in required0:
    title.append(i.get_text())

########### Content #############
required0 = soup.find_all("content")
content = []
for i in required0:
    content.append(i.get_text())

doc1 = list(zip(title, content))
for i in doc1:
    print(i)

Output:

('Chapter 1', 'Welcome to Chapter 1')
('Chapter 2', 'Welcome to Chapter 2')

Comments

3

Code :

from xml.etree import cElementTree as ET

tree = ET.parse("test.xml")
root = tree.getroot()

for page in root.findall('page'):
    print("Title: ", page.find('title').text)
    print("Content: ", page.find('content').text)

Output:

Title:  Chapter 1
Content:  Welcome to Chapter 1
Title:  Chapter 2
Content:  Welcome to Chapter 2

1 Comment

cElementTree is no longer needed on supported versions of Python (3.3+), use ElementTree.
2

I personally prefer parsing using xml.dom.minidom like so:

In [18]: import xml.dom.minidom

In [19]: x = """\
<root><page>
  <title>Chapter 1</title>
  <content>Welcome to Chapter 1</content>
</page>
<page>
 <title>Chapter 2</title>
 <content>Welcome to Chapter 2</content>
</page></root>"""

In [28]: doc = xml.dom.minidom.parseString(x)
In [29]: doc.getElementsByTagName("page")
Out[30]: [<DOM Element: page at 0x94d5acc>, <DOM Element: page at 0x94d5c8c>]

In [32]: [p.firstChild.wholeText for p in doc.getElementsByTagName("title") if p.firstChild.nodeType == p.TEXT_NODE]
Out[33]: [u'Chapter 1', u'Chapter 2']

In [34]: [p.firstChild.wholeText for p in doc.getElementsByTagName("content") if p.firstChild.nodeType == p.TEXT_NODE]
Out[35]: [u'Welcome to Chapter 1', u'Welcome to Chapter 2']

In [36]: for node in doc.childNodes:
             if node.hasChildNodes:
                 for cn in node.childNodes:
                     if cn.hasChildNodes:
                         for cn2 in cn.childNodes:
                             if cn2.nodeType == cn2.TEXT_NODE:
                                 print cn2.wholeText
Out[37]: Chapter 1
         Welcome to Chapter 1
         Chapter 2
         Welcome to Chapter 2

1 Comment

@qed root and doc are the same thing in this case. I updated the code.
-1

Recommend you a simple library. Here's an example: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

from simplified_scrapy.simplified_doc import SimplifiedDoc
html ='''
<page>
  <title>Chapter 1</title>
  <content>Welcome to Chapter 1</content>
</page>
<page>
 <title>Chapter 2</title>
 <content>Welcome to Chapter 2</content>
</page>'''
doc = SimplifiedDoc(html)
pages = doc.pages
print ([(page.title.text,page.content.text) for page in pages])

Result:

[('Chapter 1', 'Welcome to Chapter 1'), ('Chapter 2', 'Welcome to Chapter 2')]

Comments

-1

For working (navigating, searching, and modifying) with XML or HTML data, I found BeautifulSoup library very useful. For installation problem or detailed information, click on link.

To find Attribute (tag) or multi-attribute values:

from bs4 import BeautifulSoup
data = """<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">

<pdf2xml producer="poppler" version="0.48.0">
<page number="1" position="absolute" top="0" left="0" height="1188" width="918">
<text top="246" left="135" width="178" height="16" font="1">PALS SOCIETY OF 
CANADA</text>
<text top="261" width="86" height="16" font="1">13479 77 AVE</text>
</page>
</pdf2xml>"""

soup = BeautifulSoup(data, features="xml")
page_tag = soup.find_all('page')
for each_page in page_tag:
    text_tag = each_page.find_all('text')
    for text_data in text_tag:
        print("Text : ", text_data.text)
        print("Left attribute : ", text_data.get("left"))

Output:

Text :  PALS SOCIETY OF CANADA
Left tag :  135
Text :  13479 77 AVE
Left tag :  None

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.