3

data.xml

<?xml version="1.0" encoding="UTF-8"?>
<ArticleSet>
    <Article>            
        <LastName>Bojarski</LastName>
        <ForeName>-</ForeName>
        <Affiliation>-</Affiliation>            
    </Article>
    <Article>            
        <LastName>Genç</LastName>
        <ForeName>Yasemin</ForeName>
        <Affiliation>fgjfgnfgn</Affiliation>            
    </Article>
</ArticleSet>

SAMPLE CODE

from lxml import etree

dom = etree.parse('data.xml')
root = dom.getroot()

for article in dom.xpath('Article[Affiliation="-"]'):
    root.remove(article)

dom.write('output.xml')

This code deletes articles whose Affiliation is equal to - i.e. whose affiliation tag looks like <Affliation>-</Affliation> when I store the remaining output into output.xml it parses the Unicode character Genç to Gen&#231; I want to store it as it is.

Code's output

<ArticleSet>
    <Article>            
        <LastName>Gen&#231;</LastName>
        <ForeName>Yasemin</ForeName>
        <Affiliation>fgjfgnfgn</Affiliation>            
    </Article>
</ArticleSet>

Required output

<ArticleSet>
    <Article>            
        <LastName>Genç</LastName>
        <ForeName>Yasemin</ForeName>
        <Affiliation>fgjfgnfgn</Affiliation>            
    </Article>
</ArticleSet>

1 Answer 1

7

There is the encoding parameter in the etree.write method. You may also use xml_declaration=True to declare encoding of the output document.

dom.write('output.xml', encoding='utf-8', xml_declaration=True)

See lxml documentation.

Sign up to request clarification or add additional context in comments.

1 Comment

Wow I tried some stuff and this made sense and it actually worked out fine at first try. I got my åäö, thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.