4

I am a little bit stuck trying to parse a XML file retrieved from url, my goal is to get this xml file into a well structured object to easily retrieve its data. My current code results in the following error:

>>> tree = etree.parse(data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lxml.etree.pyx", line 3299, in lxml.etree.parse (src/lxml/lxml.etree.c:72421)
  File "parser.pxi", line 1791, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:105883)
  File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:106182)
  File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:105181)
  File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:100131)
  File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:94254)
  File "parser.pxi", line 690, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:95690)
  File "parser.pxi", line 618, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:94722)
OSError: Error reading file '<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
    xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:wfw="http://wellformedweb.org/CommentAPI/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:atom="http://www.w3.org/2005/Atom"


Code:

(scraper) gmf:scr gmf$ python3
Python 3.4.2 (default, Jan  2 2015, 20:14:16) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.54)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

>>> import urllib.request
>>> from lxml import etree
>>>
>>> opener = urllib.request.build_opener()
>>> f = opener.open('https://nordfront.se/feed')
data = f.read()
f.close()
>>> tree = etree.parse(data)


I'm very thankful for your help

1

1 Answer 1

8

Per the doc string (see help(ET.parse)), ET.parse expects the first argument to be

  • a file name/path

    import lxml.etree as ET    
    tree = ET.parse(filename)
    
  • a file object

    with open('data.xml') as f:
        tree = ET.parse(f)
    
  • a file-like object

    import io
    tree = ET.parse(io.BytesIO(data))
    
  • a URL using the HTTP or FTP protocol

    import urllib.request
    opener = urllib.request.build_opener()
    tree = ET.parse(opener.open(url))
    

This final option, which passes opener.open(url) directly to ET.parse instead of defining data = f.read() is probably the option you'd want to use.

Alternatively, when you already have the XML in a string, data, you can use ET.fromstring:

root = ET.fromstring(data)

Note, however, that parse returns an ElementTree, while fromstring returns an Element.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.