0

I am having issue parsing an xml result using python. I tried using etree.Element(text), but the error says Invalid tag name. Does anyone know if this is actually an xml and any way of parsing the result using a standard package? Thank you!

import requests, sys, json
from lxml import etree
response = requests.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=1593319917&report=XML")

text=response.text
print(text)

<?xml version="1.0" ?>
<ExchangeSet xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xmlns="https://www.ncbi.nlm.nih.gov/SNP/docsum" xsi:schemaLocation="https://www.ncbi.nlm.nih.gov/SNP/docsum ftp://ftp.ncbi.nlm.nih.gov/snp/specs/docsum_eutils.xsd" ><DocumentSummary uid="1593319917"><SNP_ID>1593319917</SNP_ID><ALLELE_ORIGIN/><GLOBAL_MAFS><MAF><STUDY>SGDP_PRJ</STUDY><FREQ>G=0.5/1</FREQ></MAF></GLOBAL_MAFS><GLOBAL_POPULATION/><GLOBAL_SAMPLESIZE>0</GLOBAL_SAMPLESIZE><SUSPECTED/><CLINICAL_SIGNIFICANCE/><GENES><GENE_E><NAME>FLT3</NAME><GENE_ID>2322</GENE_ID></GENE_E></GENES><ACC>NC_000013.11</ACC><CHR>13</CHR><HANDLE>SGDP_PRJ</HANDLE><SPDI>NC_000013.11:28102567:G:A</SPDI><FXN_CLASS>upstream_transcript_variant</FXN_CLASS><VALIDATED>by-frequency</VALIDATED><DOCSUM>HGVS=NC_000013.11:g.28102568G&gt;A,NC_000013.10:g.28676705G&gt;A,NG_007066.1:g.3001C&gt;T|SEQ=[G/A]|LEN=1|GENE=FLT3:2322</DOCSUM><TAX_ID>9606</TAX_ID><ORIG_BUILD>154</ORIG_BUILD><UPD_BUILD>154</UPD_BUILD><CREATEDATE>2020/04/27 06:19</CREATEDATE><UPDATEDATE>2020/04/27 06:19</UPDATEDATE><SS>3879653181</SS><ALLELE>R</ALLELE><SNP_CLASS>snv</SNP_CLASS><CHRPOS>13:28102568</CHRPOS><CHRPOS_PREV_ASSM>13:28676705</CHRPOS_PREV_ASSM><TEXT/><SNP_ID_SORT>1593319917</SNP_ID_SORT><CLINICAL_SORT>0</CLINICAL_SORT><CITED_SORT/><CHRPOS_SORT>0028102568</CHRPOS_SORT><MERGED_SORT>0</MERGED_SORT></DocumentSummary>
</ExchangeSet>
2
  • It is not clear what do you try to do.. Please explain Commented May 3, 2021 at 9:58
  • Your xml is valid. Please show the code to read the xml and error message. Commented May 3, 2021 at 11:19

2 Answers 2

1

You're using the wrong method to parse your XML. The etree.Element class is for creating a single XML element. For example:

>>> a = etree.Element('a')
>>> a
<Element a at 0x7f8c9040e180>
>>> etree.tostring(a)
b'<a/>'

As Jayvee has pointed how, to parse XML contained in a string you use the etree.fromstring method (to parse XML content in a file you would use the etree.parse method):

>>> response = requests.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=1593319917&report=XML")
>>> doc = etree.fromstring(response.text)
>>> doc
<Element {https://www.ncbi.nlm.nih.gov/SNP/docsum}ExchangeSet at 0x7f8c9040e180>
>>>

Note that because this XML document sets a default namespace, you'll need properly set namespaces when looking for elements. E.g., this will fail:

>>> doc.find('DocumentSummary')
>>>

But this works:

>>> doc.find('docsum:DocumentSummary', {'docsum': 'https://www.ncbi.nlm.nih.gov/SNP/docsum'})
<Element {https://www.ncbi.nlm.nih.gov/SNP/docsum}DocumentSummary at 0x7f8c8e987200>
Sign up to request clarification or add additional context in comments.

Comments

0

You can check if the xml is well formed by try converting it:

import requests, sys, json
from lxml import etree
response = requests.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=1593319917&report=XML")

text=response.text 
try:
    doc=etree.fromstring(text)
    print("valid")
except:
    print("not a valid xml")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.