Extracting text from XML using python

Question

I have this example xml file

<page>
  <title>Chapter 1</title>
  <content>Welcome to Chapter 1</content>
</page>
<page>
 <title>Chapter 2</title>
 <content>Welcome to Chapter 2</content>
</page>

I like to extract the contents of title tags and content tags.

Which method is good to extract the data, using pattern matching or using xml module. Or is there any better way to extract the data.

Santa · Accepted Answer · 2011-10-07 18:49:48Z

26

There is already a built-in XML library, notably ElementTree. For example:

>>> from xml.etree import cElementTree as ET
>>> xmlstr = """
... <root>
... <page>
...   <title>Chapter 1</title>
...   <content>Welcome to Chapter 1</content>
... </page>
... <page>
...  <title>Chapter 2</title>
...  <content>Welcome to Chapter 2</content>
... </page>
... </root>
... """
>>> root = ET.fromstring(xmlstr)
>>> for page in list(root):
...     title = page.find('title').text
...     content = page.find('content').text
...     print('title: %s; content: %s' % (title, content))
...
title: Chapter 1; content: Welcome to Chapter 1
title: Chapter 2; content: Welcome to Chapter 2

answered Oct 7, 2011 at 18:49

Santa

11.6k8 gold badges54 silver badges64 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

MattH Over a year ago

@SudeepKodavati: If you think Santa has answered the question to your satisfaction, please "accept" his answer.

ThorSummoner Over a year ago

I like this interface, you can index into child tags root[0][1][0]..., as well as get an iterator from any node that will walk all child nodes! list( root[0][1].itertext() )Super handy!

Gringo Suave Over a year ago

cElementTree is no longer needed on supported versions of Python (3.3+), use ElementTree.

Ashok Kumar Jayaraman · Accepted Answer · 2019-05-10 12:11:39Z

You can also try this code to extract texts:

from bs4 import BeautifulSoup
import csv

data ="""<page>
  <title>Chapter 1</title>
  <content>Welcome to Chapter 1</content>
</page>
<page>
 <title>Chapter 2</title>
 <content>Welcome to Chapter 2</content>
</page>"""

soup = BeautifulSoup(data, "html.parser")

########### Title #############
required0 = soup.find_all("title")
title = []
for i in required0:
    title.append(i.get_text())

########### Content #############
required0 = soup.find_all("content")
content = []
for i in required0:
    content.append(i.get_text())

doc1 = list(zip(title, content))
for i in doc1:
    print(i)

Output:

('Chapter 1', 'Welcome to Chapter 1')
('Chapter 2', 'Welcome to Chapter 2')

Sashini Hettiarachchi · Accepted Answer · 2019-10-18 03:00:39Z

3

Code :

from xml.etree import cElementTree as ET

tree = ET.parse("test.xml")
root = tree.getroot()

for page in root.findall('page'):
    print("Title: ", page.find('title').text)
    print("Content: ", page.find('content').text)

Output:

Title:  Chapter 1
Content:  Welcome to Chapter 1
Title:  Chapter 2
Content:  Welcome to Chapter 2

answered Oct 18, 2019 at 3:00

Sashini Hettiarachchi

1,7262 gold badges13 silver badges22 bronze badges

1 Comment

Gringo Suave Over a year ago

cElementTree is no longer needed on supported versions of Python (3.3+), use ElementTree.

Andrew Stromme · Accepted Answer · 2018-06-11 00:18:39Z

2

I personally prefer parsing using xml.dom.minidom like so:

In [18]: import xml.dom.minidom

In [19]: x = """\
<root><page>
  <title>Chapter 1</title>
  <content>Welcome to Chapter 1</content>
</page>
<page>
 <title>Chapter 2</title>
 <content>Welcome to Chapter 2</content>
</page></root>"""

In [28]: doc = xml.dom.minidom.parseString(x)
In [29]: doc.getElementsByTagName("page")
Out[30]: [<DOM Element: page at 0x94d5acc>, <DOM Element: page at 0x94d5c8c>]

In [32]: [p.firstChild.wholeText for p in doc.getElementsByTagName("title") if p.firstChild.nodeType == p.TEXT_NODE]
Out[33]: [u'Chapter 1', u'Chapter 2']

In [34]: [p.firstChild.wholeText for p in doc.getElementsByTagName("content") if p.firstChild.nodeType == p.TEXT_NODE]
Out[35]: [u'Welcome to Chapter 1', u'Welcome to Chapter 2']

In [36]: for node in doc.childNodes:
             if node.hasChildNodes:
                 for cn in node.childNodes:
                     if cn.hasChildNodes:
                         for cn2 in cn.childNodes:
                             if cn2.nodeType == cn2.TEXT_NODE:
                                 print cn2.wholeText
Out[37]: Chapter 1
         Welcome to Chapter 1
         Chapter 2
         Welcome to Chapter 2

edited Jun 11, 2018 at 0:18

Andrew Stromme

2,2401 gold badge24 silver badges35 bronze badges

answered Oct 7, 2011 at 18:55

chown

52.9k17 gold badges138 silver badges170 bronze badges

1 Comment

Andrew Stromme Over a year ago

@qed root and doc are the same thing in this case. I updated the code.

dabingsou · Accepted Answer · 2020-01-29 07:19:58Z

-1

Recommend you a simple library. Here's an example: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

from simplified_scrapy.simplified_doc import SimplifiedDoc
html ='''
<page>
  <title>Chapter 1</title>
  <content>Welcome to Chapter 1</content>
</page>
<page>
 <title>Chapter 2</title>
 <content>Welcome to Chapter 2</content>
</page>'''
doc = SimplifiedDoc(html)
pages = doc.pages
print ([(page.title.text,page.content.text) for page in pages])

Result:

[('Chapter 1', 'Welcome to Chapter 1'), ('Chapter 2', 'Welcome to Chapter 2')]

answered Jan 29, 2020 at 7:19

dabingsou

2,4691 gold badge7 silver badges8 bronze badges

Comments

Himanshu Gupta · Accepted Answer · 2023-03-16 07:27:19Z

For working (navigating, searching, and modifying) with XML or HTML data, I found BeautifulSoup library very useful. For installation problem or detailed information, click on link.

To find Attribute (tag) or multi-attribute values:

from bs4 import BeautifulSoup
data = """<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">

<pdf2xml producer="poppler" version="0.48.0">
<page number="1" position="absolute" top="0" left="0" height="1188" width="918">
<text top="246" left="135" width="178" height="16" font="1">PALS SOCIETY OF 
CANADA</text>
<text top="261" width="86" height="16" font="1">13479 77 AVE</text>
</page>
</pdf2xml>"""

soup = BeautifulSoup(data, features="xml")
page_tag = soup.find_all('page')
for each_page in page_tag:
    text_tag = each_page.find_all('text')
    for text_data in text_tag:
        print("Text : ", text_data.text)
        print("Left attribute : ", text_data.get("left"))

Output:

Text :  PALS SOCIETY OF CANADA
Left tag :  135
Text :  13479 77 AVE
Left tag :  None

Collectives™ on Stack Overflow

Extracting text from XML using python

6 Answers 6

3 Comments

Comments

1 Comment

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

3 Comments

Comments

1 Comment

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related