-5

Lets say I have an XML document like this:

<records>
    <record>
        <name>Jon</name>
        <surname>Doe</surname>
        <dob>2001-02-01</dob>
        <comment>
            <![CDATA[[ Patient with > 2 and < 5 siblings]]]>
        </comment>
    </record>
    <record>
        <name>Jane</name>
        <surname>Doe</surname>
        <dob>2001-02-01</dob>
        <comment>
            <![CDATA[[ Patient with > 2 siblings ]]]>
        </comment>
    </record>
</records>

I need to convert this document to a JSON object using xml2js, but I need to remove the < and > symbols for it to avoid breaking the JSON conversion process.

What I have tried

Since I understand that I need to remove these symbols before passing the XML string to the xml2js parser I have tried variations of the solutions described in the following cases:

I am successfull in matching the entire contents of the CDATA tag but not able to match the specific characters that I need to remove. This has to be accomplished in a single regex so I can pass it to the XML to JSON parser.

Any help or pointers would be greatly appreciated. Thanks in advance.

Additional Info

Adding this since the question was voted down due to lack of research evidence.

I tried modifying a regex rule I found in one of the references I mentioned. This is the rule.

\[CDATA\[(.*?)\]\]>`

This matches the entire contents of teh CDATA tag. This is helpful, but what I need to to replace/remove content within the CDATA tags. Here is how it looks on the regex editor.

enter image description here

I then proceeded to modify the rule to match either < or > Here is the rule that I tried.

\[CDATA\[(.*?)[<>]*\]\]>

This rule matches the following content (not just the <> signs).

    [ Patient with > 2 and < 5 siblings]

Here is how it looks on the regex editor.

enter image description here

I hope this give more clarity about what I am trying to accomplish.

Edit 2:

Here is the error triggered by the code. The relevant error message is invalid closing tag.

enter image description here

Here is line 38 of import.js as referenced in the error trace.

const jsonXml = await parseStringPromise(xml).then((res) => res);

This line uses xml2js to parse the XML document and convert it to a JSON object. Because the CTAG contains the <> symbols, I assume that the parser thinks it is part of an XML tag that is not closed properly.

11
  • In what way does your conversion break? Isn't comment: ["\n [ Patient with > 2 and < 5 siblings]\n "] the content you expect? Commented Sep 5, 2021 at 20:28
  • Hey, I am in the process of editing the question to show what I have tried. I have considered two options: 1) Remove the <> symbos all together 2) Convert them to HTML entities The XM to JSON conversion breaks because the XML tags use theses symbols. I am looking for ways to handle these cases. Commented Sep 5, 2021 at 20:58
  • Are you trying to process the XML document directly with regex? If so, why is this tagged xslt? -- P.S. Do not try to process the XML document directly with regex - see here why: stackoverflow.com/a/1732454 Commented Sep 5, 2021 at 21:22
  • Please show your code using that library and the result you get versus the one you want or explain in which way the "conversion breaks", which error you get. Commented Sep 5, 2021 at 21:41
  • Hello @michael.hor257k. I tagged the question as XSL because this XML file will be styled by an XSL stylesheet. However, the problem is not XSL related Apologies for that. Martin, will edit question with request info in a few minutes. Commented Sep 5, 2021 at 22:07

3 Answers 3

2

In JavaScript, as it is the language you are using to code, you can use

const text = `<comment>
   <![CDATA[[ Patient with > 2 and < 5 siblings]]]>
</comment>`
const re = /\[CDATA\[\[[^]*?]]>/g
console.log( text.replace(re, (x) => x.replace(/[<>]/g, '')) )

The \[CDATA\[\[[^]*?]]> pattern (see its demo) matches all CDATA blocks, even if they span multiple lines because

  • \[CDATA\[\[ matches [CDATA[[ substrings
  • [^]*? matches zero or more chars as few as possible
  • ]]> matches ]]>.

Then, once the match is found, all < and > are removed from these matched texts with x.replace(/[<>]/g, '').

Sign up to request clarification or add additional context in comments.

Comments

0

I can't reproduce the parsing problem (using the current version of the library, 0.4.23):

var xml2js = require("xml2js")

var xml = `<records>
    <record>
        <name>Jon</name>
        <surname>Doe</surname>
        <dob>2001-02-01</dob>
        <comment>
            <![CDATA[[ Patient with > 2 and < 5 siblings]]]>
        </comment>
    </record>
    <record>
        <name>Jane</name>
        <surname>Doe</surname>
        <dob>2001-02-01</dob>
        <comment>
            <![CDATA[[ Patient with > 2 siblings ]]]>
        </comment>
    </record>
</records>`;

const jsResult = await xml2js.parseStringPromise(xml).then((res) => res);

const jsonResult = JSON.stringify(jsResult);

console.dir(jsonResult);

That gives

{"records":{"record":[{"name":["Jon"],"surname":["Doe"],"dob":["2001-02-01"],"comment":["\n            [ Patient with > 2 and < 5 siblings]\n        "]},{"name":["Jane"],"surname":["Doe"],"dob":["2001-02-01"],"comment":["\n            [ Patient with > 2 siblings ]\n        "]}]}}

which validates and formats fine at jsonlint.com as

{
    "records": {
        "record": [{
            "name": ["Jon"],
            "surname": ["Doe"],
            "dob": ["2001-02-01"],
            "comment": ["\n            [ Patient with > 2 and < 5 siblings]\n        "]
        }, {
            "name": ["Jane"],
            "surname": ["Doe"],
            "dob": ["2001-02-01"],
            "comment": ["\n            [ Patient with > 2 siblings ]\n        "]
        }]
    }
}

or you can use const jsonResult = JSON.stringify(jsResult, null, 4); also giving a readable output

{
    "records": {
        "record": [{
            "name": ["Jon"],
            "surname": ["Doe"],
            "dob": ["2001-02-01"],
            "comment": ["\n            [ Patient with > 2 and < 5 siblings]\n        "]
        }, {
            "name": ["Jane"],
            "surname": ["Doe"],
            "dob": ["2001-02-01"],
            "comment": ["\n            [ Patient with > 2 siblings ]\n        "]
        }]
    }
}

1 Comment

Hello Martin. A noted by you previously, this issue has more to do with badly formed XML than an issue with the library. The example given is a contrived case created by me as I cannot put the entire XML message due to privacy concerns. Again, the xml2js library handles this cases as expected. This is not a library isue.
0

I am providing an answer to the question using Wiktor Stribiżew in case this helps anyone with a similar problem.

(?<=<!\[CDATA\[\[(?:(?!<!\[CDATA\[\[|]]>).)*)[<>]

Thanks Wiktor

1 Comment

If my answer helped please consider finalizing it properly.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.