Python XML parsing, lxml, urllib.request

Question

I am a little bit stuck trying to parse a XML file retrieved from url, my goal is to get this xml file into a well structured object to easily retrieve its data. My current code results in the following error:

>>> tree = etree.parse(data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lxml.etree.pyx", line 3299, in lxml.etree.parse (src/lxml/lxml.etree.c:72421)
  File "parser.pxi", line 1791, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:105883)
  File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:106182)
  File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:105181)
  File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:100131)
  File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:94254)
  File "parser.pxi", line 690, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:95690)
  File "parser.pxi", line 618, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:94722)
OSError: Error reading file '<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
    xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:wfw="http://wellformedweb.org/CommentAPI/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:atom="http://www.w3.org/2005/Atom"

Code:

(scraper) gmf:scr gmf$ python3
Python 3.4.2 (default, Jan  2 2015, 20:14:16) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.54)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

>>> import urllib.request
>>> from lxml import etree
>>>
>>> opener = urllib.request.build_opener()
>>> f = opener.open('https://nordfront.se/feed')
data = f.read()
f.close()
>>> tree = etree.parse(data)

I'm very thankful for your help

See related question stackoverflow.com/questions/26163247/…. — Mihai8
– Mihai8, Commented Jan 30, 2015 at 15:11

unutbu · Accepted Answer · 2015-01-30 17:05:01Z

8

Per the doc string (see help(ET.parse)), ET.parse expects the first argument to be

a file name/path

import lxml.etree as ET    
tree = ET.parse(filename)

a file object

with open('data.xml') as f:
    tree = ET.parse(f)

a file-like object

import io
tree = ET.parse(io.BytesIO(data))

a URL using the HTTP or FTP protocol

import urllib.request
opener = urllib.request.build_opener()
tree = ET.parse(opener.open(url))

This final option, which passes opener.open(url) directly to ET.parse instead of defining data = f.read() is probably the option you'd want to use.

Alternatively, when you already have the XML in a string, data, you can use ET.fromstring:

root = ET.fromstring(data)

Note, however, that parse returns an ElementTree, while fromstring returns an Element.

edited Jan 30, 2015 at 17:05

answered Jan 30, 2015 at 15:15

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python XML parsing, lxml, urllib.request

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related