4

In my S1000D xml, it specifies a DOCTYPE with a reference to a public URL that contains references to a number of other files that contain all the valid character entities. I've used xml.etree.ElementTree and lxml to try to parse it and get a parse error with both indicating:

undefined entity −: line 82, column 652

Even though − is a valid entity according to the ENTITY Reference specfied.

The xml top is as follow:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE dmodule [
<!ENTITY % ISOEntities PUBLIC 'ISO 8879-1986//ENTITIES ISO Character Entities 20030531//EN//XML' 'http://www.s1000d.org/S1000D_4-1/ent/ISOEntities'>
%ISOEntities;]>

If you go out and get http://www.s1000d.org/S1000D_4-1/ent/ISOEntities, it will include 20 other ent files with one called iso-tech.ent which contains the line:

<!ENTITY minus "&#x2212;"> <!-- MINUS SIGN -->

in line 82 of the xml file near column 652 is the following: ....Refer to 70&minus;41....

How can I run a python script to parse this file without get the undefined entity?

Sorry I don't want to specify parser.entity['minus'] = chr(2212) for example. I did that for a quick fix but there are many character entity references. I would like the parser to check Entity reference that is specified in the xml.

I'm surprised but I've gone around the sun and back and haven't found how to do this (or maybe I have but couldn't follow it). if I update my xml file and add <!ENTITY minus "&#x2212;"> It won't fail, so It's not the xml.

It fails on the parse. Here's code I use for ElementTree

 fl = os.path.join(pth, fn)
 try:
     root = ET.parse(fl)
 except ParseError as p:
     print("ParseError : ", p)

Here's the code I use for lxml

fl = os.path.join(pth, fn)
try:
    parser = etree.XMLParser(load_dtd=True, resolve_entities=True)
    root = etree.parse(fl, parser=parser)
except etree.XMLSyntaxError as pe:
    print("lxml XMLSyntaxError: ", pe)

I would like the parser to load the ENTITY reference so that it knows that − and all the other character entities specified in all the files are valid entity characters.

Thank you so much for your advice and help.

1 Answer 1

8

I'm going to answer for lxml. No reason to consider ElementTree if you can use lxml.

I think the piece you're missing is no_network=False in the XMLParser; it's True by default.

Example...

XML Input (test.xml)

<!DOCTYPE doc [
<!ENTITY % ISOEntities PUBLIC 'ISO 8879-1986//ENTITIES ISO Character Entities 20030531//EN//XML' 'http://www.s1000d.org/S1000D_4-1/ent/ISOEntities'>
%ISOEntities;]>
<doc>
    <test>Here's a test of minus: &minus;</test>
</doc>

Python

from lxml import etree

parser = etree.XMLParser(load_dtd=True,
                         no_network=False)

tree = etree.parse("test.xml", parser=parser)

etree.dump(tree.getroot())

Output

<doc>
    <test>Here's a test of minus: −</test>
</doc>

If you wanted the entity reference retained, add resolve_entities=False to the XMLParser.


Also, instead of going out to an external location to resolve the parameter entity, consider setting up an XML Catalog. This will let you resolve public and/or system identifiers to local versions.

Example using same XML input above...

XML Catalog ("catalog.xml" in the directory "catalog test" (space used in directory name for testing))

<!DOCTYPE catalog PUBLIC "-//OASIS//DTD XML Catalogs V1.1//EN" "http://www.oasis-open.org/committees/entity/release/1.1/catalog.dtd">
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
    <!-- The path in @uri is relative to this file (catalog.xml). -->
    <uri name="http://www.s1000d.org/S1000D_4-1/ent/ISOEntities" uri="./ents/ISOEntities_stackoverflow.ent"/>
</catalog>

Entity File ("ISOEntities_stackoverflow.ent" in the directory "catalog test/ents". Changed the value to "BAM!" for testing)

<!ENTITY minus "BAM!">

Python (Changed no_network to True for additional evidence that the local version of http://www.s1000d.org/S1000D_4-1/ent/ISOEntities is being used.)

import os
from urllib.request import pathname2url
from lxml import etree

# The XML_CATALOG_FILES environment variable is used by libxml2 (which is used by lxml).
# See http://xmlsoft.org/catalog.html.
try:
    xcf_env = os.environ['XML_CATALOG_FILES']
except KeyError:
    # Path to catalog must be a url.
    catalog_path = f"file:{pathname2url(os.path.join(os.getcwd(), 'catalog test/catalog.xml'))}"
    # Temporarily set the environment variable.
    os.environ['XML_CATALOG_FILES'] = catalog_path

parser = etree.XMLParser(load_dtd=True,
                         no_network=True)

tree = etree.parse("test.xml", parser=parser)

etree.dump(tree.getroot())

Output

<doc>
    <test>Here's a test of minus: BAM!</test>
</doc>
Sign up to request clarification or add additional context in comments.

11 Comments

Excellent response! Thank you for being so complete such as advice to use lxml and advice on catalogs and great examples. The catalog solution works fine but I could not get the network to work from work. It must be due to work's proxy servers. I tested it from my home pc and the no_network=False worked fine just as you showed. I looked up how to add proxy info in my environment but without success. but the catalog solution will work fine.
@beakerchi - My pleasure. Glad you got it working. :-)
I'm running this exact code and I get "Entity 'minus' not defined".
For my test I created a folder named "so_catalog_testing". Inside of that folder I have the files: "catalog_test.py" and "test.xml". I also have a folder named "catalog test". Inside "catalog test" I have the file "catalog.xml" and a folder named "ents". Inside of the "ents" folder is a file named "ISOEntities_stackoverflow.ent".
It works today after double checking everything and updating lxml. Now I'll try a catalog with "rewritePrefix". Haven't seen anyone solving that, but I'll give it a try.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.