0

I received an xml file which has been extracted by someone else from a DB. the problem is that it contains some string that are creating problems to read the xml in a correct way. Here it is a small part of it:

<gmd:fileIdentifier xmlns:gmx="http://www.isotc211.org/2005/gmx">\r\n    <gco:CharacterString>0211fa18-e0a4-4d2ed26-7580726e593c</gco:CharacterString>\r\n  </gmd:fileIdentifier>\r\n  <gmd:language>\r\n    <gco:CharacterString>eng</gco:CharacterString>\r\n  </gmd:language>\r\n  <gmd:hierarchyLevel>\r\n    <gmd:MD_ScopeCode codeList="http://standards.iso.org/ittf/PubliclyAvailableStandards/ISO_19139_Schemas/resources/codelist/ML_gmxCodelists.xml#MD_ScopeCode" codeListValue="dataset" />\r\n  </gmd:hierarchyLevel>\r\n  <gmd:contact>\r\n    <gmd:CI_ResponsibleParty>\r\n      <gmd:organisationName>\r\n        <gco:CharacterString>Research</gco:CharacterString>\r\n      </gmd:organisationName>\r\n      <gmd:contactInfo>\r\n        <gmd:CI_Contact>\r\n          <gmd:address>\r\n            <gmd:CI_Address>\r\n              <gmd:electronicMailAddress>\r\n                <gco:CharacterString>[email protected]</gco:CharacterString>\r\n              </gmd:electronicMailAddress>\r\n            </gmd:CI_Address>\r\n          </gmd:address>\r\n        </gmd:CI_Contact>\r\n      </gmd:contactInfo>\r\n

As you can see at the end of each tag there is the string "\r\n" which is the problem. I tried using the following bash command:

string='\r\n'
sed -i 's/$string/''/g' test.xml

but it is not working, no empty string is substituting the $string variable.

could you please tell me what I'm doing wrong?

thanks in advance

5 Answers 5

1

Your string variable contains \r\n as special characters sequence. But you need you to replace it literally as it go within your input file.

Use the following sed approach:

sed 's#\\r\\n##g' test.xml

The output (for your current input fragment):

<gmd:fileIdentifier xmlns:gmx="http://www.isotc211.org/2005/gmx">    <gco:CharacterString>0211fa18-e0a4-4d2ed26-7580726e593c</gco:CharacterString>  </gmd:fileIdentifier>  <gmd:language>    <gco:CharacterString>eng</gco:CharacterString>  </gmd:language>  <gmd:hierarchyLevel>    <gmd:MD_ScopeCode codeList="http://standards.iso.org/ittf/PubliclyAvailableStandards/ISO_19139_Schemas/resources/codelist/ML_gmxCodelists.xml#MD_ScopeCode" codeListValue="dataset" />  </gmd:hierarchyLevel>  <gmd:contact>    <gmd:CI_ResponsibleParty>      <gmd:organisationName>        <gco:CharacterString>Research</gco:CharacterString>      </gmd:organisationName>      <gmd:contactInfo>        <gmd:CI_Contact>          <gmd:address>            <gmd:CI_Address>              <gmd:electronicMailAddress>                <gco:CharacterString>[email protected]</gco:CharacterString>              </gmd:electronicMailAddress>            </gmd:CI_Address>          </gmd:address>        </gmd:CI_Contact>      </gmd:contactInfo>
Sign up to request clarification or add additional context in comments.

Comments

1

Following awk may help you in same.

awk '{gsub(/\\r\\n/,"")} 1'  Input_file

Explanation: Simply using awk's gsub utility which will globally substitute \r\n with NULL, point to be noted here \r and \n is written to eliminate \ special meaning here and it should take it literal character and not it's special meaning. 1 will print the lines.

Comments

1

\r\n are Windows line endings.

I don't know which XML parser you're using or which programming language but try to convert the file first to Unix format by invoking dos2unix your-file.xml and then feed it to your parser. You can also convert it with common text editors.

Hope that helps.

2 Comments

I'm using linux and I tried using dos2unix cmd but it was not enough. Since it might be that I have to make this substitution to a great number of files, unfortunately I have to find an automatic way for doing it. thanks for your hint!!
Without seeing your file it's a little difficult to say which bytes are causing problems but I've fallen across it several times and solved it easily with dos2unix. For doing it on many files there's always the good old pipe and/or for loop.
1

Try This:

sed 's/\\r\\n//g' test       #test has the line


[user@ip check]$ sed 's/\\r\\n//g' test
<gmd:fileIdentifier xmlns:gmx="http://www.isotc211.org/2005/gmx">  <gco:CharacterString>0211fa18-e0a4-4d2ed26-7580726e593c</gco:CharacterString> </gmd:fileIdentifier>  <gmd:language>    <gco:CharacterString>eng</gco:CharacterString>  </gmd:language>  <gmd:hierarchyLevel>    <gmd:MD_ScopeCode codeList="http://standards.iso.org/ittf/PubliclyAvailableStandards/ISO_19139_Schemas/resources/codelist/ML_gmxCodelists.xml#MD_ScopeCode" codeListValue="dataset" />  </gmd:hierarchyLevel>  <gmd:contact>    <gmd:CI_ResponsibleParty>      <gmd:organisationName>        <gco:CharacterString>Research</gco:CharacterString>      </gmd:organisationName>      <gmd:contactInfo>        <gmd:CI_Contact>          <gmd:address>            <gmd:CI_Address>              <gmd:electronicMailAddress>                <gco:CharacterString>[email protected]</gco:CharacterString>              </gmd:electronicMailAddress>            </gmd:CI_Address>          </gmd:address>        </gmd:CI_Contact>      </gmd:contactInfo>

Comments

1

\ must be escaped because \r sequence in sed is changed to carriage return character

string='\\r\\n'

also variable expansions are done between double quotes but not between signle quotes

sed -i "s/$string//g" test.xml

Note in general any string can't be used because of injections if contains /, this is a general problem with code generation.

1 Comment

yes you are right but even using the escape character it was not working if I used my cmd.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.