0

I have a xml file which has an En Dash and Em Dash characters in it as part of element text. They are getting converted to UTF-8 codes as following.

<TextValue>This is an En Dash:  \xE2\x80\x93    This is an Em Dash: \xE2\x80\x94.</TextValue>

I would like to address those UTF-8 hex codes using JavaScript and replace them with any free text I want.

Could anyone suggest approaches to do it? I tried to use RegEx but was unable to parse those codes. I could address any other text using RegEx though.

Thank you.

3
  • 1
    Are you in control of the XML? Because I didn't think the \xE2 notation was a good XML thing. The proper XML encoding for these chars (if not raw bytes) would be &#x2013; and &#x2014; respectively. But if you're stuck with that XML I guess you'll need some custom parsing/decoding. Commented Aug 17, 2012 at 5:06
  • I am not in control of that xml. I am receiving it from upstream system. So I need to accept it as is and manipulate myself. Commented Aug 17, 2012 at 5:07
  • Perhaps your editor goofed up and all is well? Commented Aug 17, 2012 at 5:10

2 Answers 2

1

DEMO

var text = "<TextValue>This is an En Dash:  \xE2\x80\x93    This is an Em Dash: \xE2\x80\x94.</TextValue>"

var fromArr = ["\xe2\x80\x98", "\xe2\x80\x99", "\xe2\x80\x9c", "\xe2\x80\x9d", "\xe2\x80\x93", "\xe2\x80\x94", "\xe2\x80\xa6"],
    toArr = ["'", "'", '"', '"', '-', '--', '...'];

    for (var i=0;i<fromArr.length;i++) {
        text = text.replace(fromArr[i],toArr[i],"g")
    }
        alert(text)

Change to

 var fromArr = ["\xe2\x80\x93", "\xe2\x80\x94"], toArr = [ '-', '--'];

if you do not need the smartquotes and ellipsis

Result:

enter image description here

Sign up to request clarification or add additional context in comments.

5 Comments

I am seeing this:<TextValue>This is an En Dash: ? This is an Em Dash: ?.</TextValue>
Also could you please explain why you have so many array entries just to replace two tokens?
Just remove the ones you do not need. I found a list of tokens likely to appear in your code
Yeah, it works in Fiddle. It seems the tool I am using is meddling with standard behavior. This time it is not even replacing.
Even this is working in Fiddle: var text = "<TextValue>This is an En Dash: \xE2\x80\x93 This is an Em Dash: \xE2\x80\x94.</TextValue>" text = text.replace("\xE2\x80\x93","-","g") text = text.replace("\xE2\x80\x94","--","g") alert(text)
0

I finally got away by reading body of the message in UTF-8 and use following lines to replace unicodes.

body = body.replace(/\u00E1/g,"a");  //LATIN SMALL LETTER A WITH ACUTE
body = body.replace(/\u00E2/g,"a");  //LATIN SMALL LETTER A WITH CIRCUMFLEX
body = body.replace(/\u00E3/g,"a");  //LATIN SMALL LETTER A WITH TILDE
body = body.replace(/\u201D/g,"\"");  //RIGHT DOUBLE QUOTATION MARK
body = body.replace(/\u201C/g,"\"");  //LEFT DOUBLE QUOTATION MARK
body = body.replace(/\u2424/g," ");  //NEW LINE \n
body = body.replace(/\u000D/g," ");  //CARRIAGE RETURN \r

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.