0

I am having an issue trying to extract the email from a xml file using Python3.

My code is:

import xml.etree.ElementTree as ET
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

data = '''<row>
    <row _id="row-jyi7-56ru_b7km" _uuid="00000000-0000-0000-B614-7FFDD7C1595B" _position="0" _address="https://www.dati.lombardia.it/resource/zzzz-zzzz/row-jyi7-56ru_b7km">
        <codice_regionale>MI1604</codice_regionale>
        <denom_farmacia>Farmacia Varesina</denom_farmacia>
        <indirizzo>VIA VARESINA, 121</indirizzo>
        <localita>Milano</localita>
        <telefono>3480813398</telefono>
        <email>[email protected]</email>
        <caratterizzazione>urbana</caratterizzazione>
        <esenzioni>true</esenzioni>
        <location latitude="45.500881" longitude="9.141339"/>
</row>'''

tree = ET.fromstring(data) #standard ET
results = tree.findall('email') #find the count section in xml

print(results.text)

The error I get is

Traceback (most recent call last):
  File "farmacie.py", line 25, in <module>
    tree = ET.fromstring(data) #standard ET
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/xml/etree/ElementTree.py", line 1321, in XML
    return parser.close()
xml.etree.ElementTree.ParseError: no element found: line 12, column 6

How can I solve this?

1
  • 1
    You're missing a closing </row> or that extra <row> at the start isn't supposed to be there Commented Mar 20, 2020 at 12:24

1 Answer 1

1

So it looks like you have the row element defined twice (or you are missing the extra end tag), which is causing one issue. The next is that findall() will return a list, so you would need to pick one, or print them all out:

import xml.etree.ElementTree as ET

data = '''<row _id="row-jyi7-56ru_b7km" _uuid="00000000-0000-0000-B614-7FFDD7C1595B" _position="0" _address="https://www.dati.lombardia.it/resource/zzzz-zzzz/row-jyi7-56ru_b7km">
        <codice_regionale>MI1604</codice_regionale>
        <denom_farmacia>Farmacia Varesina</denom_farmacia>
        <indirizzo>VIA VARESINA, 121</indirizzo>
        <localita>Milano</localita>
        <telefono>3480813398</telefono>
        <email>[email protected]</email>
        <caratterizzazione>urbana</caratterizzazione>
        <esenzioni>true</esenzioni>
        <location latitude="45.500881" longitude="9.141339"/>
</row>'''

tree = ET.fromstring(data) #standard ET
results = tree.findall('email') #find the count section in xml

print(results[0].text)

Or:

for r in results:
    print(r.text)

Update:

After getting the full dataset, the correct way to get all of the emails would be:

import xml.etree.ElementTree as ET
import requests

data = requests.get('https://www.dati.lombardia.it/api/views/5dq5-xs9z/rows.xml').content

tree = ET.fromstring(data)
results = tree.findall("./row/row/email")

for r in results:
    print(r.text)

Results (2,684 rows):

[email protected]
[email protected]
[email protected]
[email protected]
...
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks, now it works. However when I try to extend the process to a bigger xml (dati.lombardia.it/api/views/5dq5-xs9z/rows.xml) it still does not work. Any suggestions?
From the dataset you linked, it looks like you might be looking for tree.findall("./row/row/email"). That will pull all the email elements from the entire set.
Thank you very much. However I am still having issues, if I try to insert the entire dataset I still get an issue (xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 401, column 60). How should I solve this? Should I import the data through link or copy pasting it in the data variable? Thank you again for the help!!!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.