I am editing an xml-file with original encoding ASCII in the declaration. In the resulting file I want the encoding to be UTF-8 in order to write Swedish characters like åäö, something I can't do at the moment.
An example file equivalent to my file can be found at archivematica wiki.
The resulting SIP.xml that I get after running my program with a copy of the above example file can be reached at this link. The added tag with the åäö text is in the very end of the document.
As seen in the code below I have tried setting the encoding on the transformer, and also tried to use an OutputStreamWriter to set the encoding. In the end I edited the declaration in the original file to UTF-8 and finally åäö was written out. So the problem seems to be the encoding of the original file. If I'm not mistaken it shouldn't cause any problem to change the declaration from ASCII to UTF-8, the question is, how do I do this within my program? Can I do this after parsing it to a Document object, or do I need to do something before parsing?
package provklasser;
import java.io.File;
import java.io.IOException;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.swing.JOptionPane;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.xml.sax.SAXException;
/**
*
* @author
*/
public class Provklass {
/**
* @param args the command line arguments
*/
public static void main(String[] args) {
try {
File chosenFile = new File("myFile.xml");
//parsing the xml file
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
DocumentBuilder builder = factory.newDocumentBuilder();
Document metsDoc = builder.parse(chosenFile.getAbsolutePath());
Element agent = (Element) metsDoc.getDocumentElement().appendChild(metsDoc.createElementNS("http://www.loc.gov/METS/","mets:agent"));
agent.appendChild(metsDoc.createTextNode("åäö"));
DOMSource source = new DOMSource(metsDoc);
// write the content into xml file
File newFile = new File(chosenFile.getParent(), "SIP.xml");
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
StreamResult result = new StreamResult(newFile);
//Writer out = new OutputStreamWriter(new FileOutputStream("SIP.xml"), "UTF-8");
//StreamResult result = new StreamResult(out);
transformer.transform(source, result);
} catch (ParserConfigurationException ex) {
Logger.getLogger(Provklass.class.getName()).log(Level.SEVERE, null, ex);
} catch (SAXException ex) {
Logger.getLogger(Provklass.class.getName()).log(Level.SEVERE, null, ex);
} catch (IOException ex) {
Logger.getLogger(Provklass.class.getName()).log(Level.SEVERE, null, ex);
} catch (TransformerConfigurationException ex) {
Logger.getLogger(Provklass.class.getName()).log(Level.SEVERE, null, ex);
} catch (TransformerException ex) {
Logger.getLogger(Provklass.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
UPDATE: Using metsDoc.getInputEncoding() returns UTF-8, while metsDoc.getXmlEncoding() returns ASCII. If I parse the new file after saving it and make a new Document I get the same result. So the document seems to have the right encoding, but the xml declaration is not right.
Now I edit the xml as a text file before parsing it, replacing the parsing part above with parseXML(chosenFile.getAbsoutePath()); and using the following methods:
private String withEditedDeclaration(String fileName) {
StringBuilder text = new StringBuilder();
try {
String NL = System.getProperty("line.separator");
try (Scanner scanner = new Scanner(new FileInputStream(fileName))) {
String line = scanner.nextLine();
text.append(line.replaceFirst("ASCII", "UTF-8") + NL);
while (scanner.hasNextLine()) {
text.append(scanner.nextLine() + NL);
}
}
} catch (FileNotFoundException ex) {
Logger.getLogger(MetsAdaption.class.getName()).log(Level.SEVERE, null, ex);
}
return text.toString();
}
private void parseXML(String fileName) throws SAXException, IOException, ParserConfigurationException {
String xmlString = withEditedDeclaration(fileName);
//parsing the xml file
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
DocumentBuilder builder = factory.newDocumentBuilder();
InputSource is = new InputSource();
is.setCharacterStream(new StringReader(xmlString));
metsDoc = builder.parse(is);
}
It works, but it seems like an ugly solution. I'd be most grateful if anyone knew a better way.
DocumentBuildershould honor the xml declaration). Which leads me to think that your document might not be OK. Could you check if your base file really is an ASCII document (not only does it say so in its XML prolog, but this is actually true if you look at the actual bytes ?).