1

I am editing an xml-file with original encoding ASCII in the declaration. In the resulting file I want the encoding to be UTF-8 in order to write Swedish characters like åäö, something I can't do at the moment.

An example file equivalent to my file can be found at archivematica wiki.

The resulting SIP.xml that I get after running my program with a copy of the above example file can be reached at this link. The added tag with the åäö text is in the very end of the document.

As seen in the code below I have tried setting the encoding on the transformer, and also tried to use an OutputStreamWriter to set the encoding. In the end I edited the declaration in the original file to UTF-8 and finally åäö was written out. So the problem seems to be the encoding of the original file. If I'm not mistaken it shouldn't cause any problem to change the declaration from ASCII to UTF-8, the question is, how do I do this within my program? Can I do this after parsing it to a Document object, or do I need to do something before parsing?

package provklasser;

import java.io.File;
import java.io.IOException;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.swing.JOptionPane;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.xml.sax.SAXException;

/**
 *
 * @author 
 */
public class Provklass {

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) {
        try {
            File chosenFile = new File("myFile.xml");
            //parsing the xml file
            DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
            factory.setNamespaceAware(true);
            DocumentBuilder builder = factory.newDocumentBuilder();
            Document metsDoc = builder.parse(chosenFile.getAbsolutePath());

            Element agent = (Element) metsDoc.getDocumentElement().appendChild(metsDoc.createElementNS("http://www.loc.gov/METS/","mets:agent"));
            agent.appendChild(metsDoc.createTextNode("åäö"));

            DOMSource source = new DOMSource(metsDoc);

            // write the content into xml file
            File newFile = new File(chosenFile.getParent(), "SIP.xml");

            TransformerFactory transformerFactory = TransformerFactory.newInstance();
            Transformer transformer = transformerFactory.newTransformer();
            transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");

            StreamResult result = new StreamResult(newFile);

            //Writer out = new OutputStreamWriter(new FileOutputStream("SIP.xml"), "UTF-8");
            //StreamResult result = new StreamResult(out);
            transformer.transform(source, result);

        } catch (ParserConfigurationException ex) {
            Logger.getLogger(Provklass.class.getName()).log(Level.SEVERE, null, ex);
        } catch (SAXException ex) {
            Logger.getLogger(Provklass.class.getName()).log(Level.SEVERE, null, ex);
        } catch (IOException ex) {
            Logger.getLogger(Provklass.class.getName()).log(Level.SEVERE, null, ex);
        } catch (TransformerConfigurationException ex) {
            Logger.getLogger(Provklass.class.getName()).log(Level.SEVERE, null, ex);
        } catch (TransformerException ex) {
            Logger.getLogger(Provklass.class.getName()).log(Level.SEVERE, null, ex);
        }

    }



}

UPDATE: Using metsDoc.getInputEncoding() returns UTF-8, while metsDoc.getXmlEncoding() returns ASCII. If I parse the new file after saving it and make a new Document I get the same result. So the document seems to have the right encoding, but the xml declaration is not right.

Now I edit the xml as a text file before parsing it, replacing the parsing part above with parseXML(chosenFile.getAbsoutePath()); and using the following methods:

private String withEditedDeclaration(String fileName) {
    StringBuilder text = new StringBuilder();
    try {

        String NL = System.getProperty("line.separator");
        try (Scanner scanner = new Scanner(new FileInputStream(fileName))) {
            String line = scanner.nextLine();
            text.append(line.replaceFirst("ASCII", "UTF-8") + NL);
            while (scanner.hasNextLine()) {

                text.append(scanner.nextLine() + NL);
            }
        }

    } catch (FileNotFoundException ex) {
        Logger.getLogger(MetsAdaption.class.getName()).log(Level.SEVERE, null, ex);
    } 
    return text.toString();
}

private void parseXML(String fileName) throws SAXException, IOException, ParserConfigurationException {
    String xmlString = withEditedDeclaration(fileName);

    //parsing the xml file
    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    factory.setNamespaceAware(true);
    DocumentBuilder builder = factory.newDocumentBuilder();
    InputSource is = new InputSource();
    is.setCharacterStream(new StringReader(xmlString));
    metsDoc = builder.parse(is);
}

It works, but it seems like an ugly solution. I'd be most grateful if anyone knew a better way.

11
  • Maybe this could help you : stackoverflow.com/questions/3578395/… Commented Jul 4, 2016 at 15:06
  • @Berger Thank you for your tip. However, I don't think this solves my problem. OutputFormat seems to be deprecated and I've already used transformer.setOutputProperty(OutputKeys.ENCODING, encoding). I think I need to edit the declaration of the document, but I don't know how to do that. Commented Jul 5, 2016 at 6:43
  • It seems like this should work (the DocumentBuilder should honor the xml declaration). Which leads me to think that your document might not be OK. Could you check if your base file really is an ASCII document (not only does it say so in its XML prolog, but this is actually true if you look at the actual bytes ?). Commented Jul 7, 2016 at 13:42
  • @GPI Thank you. According to Firefox the original document is Windows-1252, so maybe that's the problem. Commented Jul 8, 2016 at 6:33
  • @GPI If I change the declaration of the original doc to Windows-1252 I still get input encoding UTF-8 (xml encoding Windows-1252) after parsing, do you know what causes that? Is DocumentBuilder's default encoding UTF-8? I can't find any information on that. Can I somehow set the encoding used by DocumentBuilder? Commented Jul 8, 2016 at 7:36

1 Answer 1

0

I had a similar issue where my xml declaration was originally:

<?xml version="1.0" encoding="windows-1252"?>

But after parsing to a Document and then back to XML as UTF-8 the encoding stayed as windows-1252 even though the bytes themselves where UTF-8. I eventually worked out that the implementation of TransformerFactory was com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl changing that to:

org.apache.xalan.processor.TransformerFactoryImpl

from Apache Xalan Java 2.7.1 resulted in the charset in the XML deceleration being correctly set and now I have:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.