PHP parsing xml file error

Question

I am trying to use simpleXML to get data from http://rates.fxcm.com/RatesXML Using simplexml_load_file() i had errors at times as this website always has weird strings/numbers before and after the xml file. Example:

2000<?xml version="1.0" encoding="UTF-8"?>
<Rates>
    <Rate Symbol="EURUSD">
    <Bid>1.27595</Bid>
    <Ask>1.2762</Ask>
    <High>1.27748</High>
    <Low>1.27385</Low>
    <Direction>-1</Direction>
    <Last>23:29:11</Last>
</Rate>
</Rates>
0

I then decided to use file_get_contents and parse it as a string with simplexml_load_string(), afterwards which I use substr() to remove the strings before and after. However, sometimes the random strings will appear in between the nodes like this:

<Rate Symbol="EURTRY">
    <Bid>2.29443</Bid>
    <Ask>2.29562</Ask>
    <High>2.29841</High>
    <Low>2.28999</Low>

137b

 <Direction>1</Direction>
    <Last>23:29:11</Last>
</Rate>

My question is, is there anyway i can deal with all these random strings at a go with any regex functions regardless of where they are placed? (think that will be a better idea rather than to contact the site to get them to broadcast proper xml files)

A regex would be much easier to compile if variable-width assertions were allowed in lookbehinds. — Yes Barry
– Yes Barry, Commented Nov 19, 2012 at 6:15

Community · Accepted Answer · 2017-05-23 10:24:42Z

1

I believe preprocessing XML with regular expressions might be just as bad as parsing it.

But here is a preg replace that removes all non-whitespace characters, from the beginning of the string, from the end of the string, and after closing/self-closing tags:

$string = preg_replace( '~
    (?|           # start of alternation where capturing group count starts from
                  # 1 for each alternative
      ^[^<]*      # match non-< characters at the beginning of the string
    |             # OR
      [^>]*$      # match non-> characters at the end of the string
    |             # OR
      (           # start of capturing group $1: closing tag
        </[^>]++> # match a closing tag; note the possessive quantifier (++); it
                  # suppresses backtracking, which is a convenient optimization,
                  # the following bit is mutually exclusive anyway (this will be
                  # used throughout the regex)
        \s++      # and the following whitespace
      )           # end of $1
      [^<\s]*+    # match non-<, non-whitespace characters (the "bad" ones)
      (?:         # start subgroup to repeat for more whitespace/non-whitespace
                  # sequences
        \s++      # match whitespace
        [^<\s]++  # match at least one "bad" character
      )*          # repeat
                  # note that this will kind of pattern keeps all whitespace
                  # before the first and the last "bad" character
    |             # OR
      (           # start of capturing group $1: self-closing tag
        <[^>/]+/> # match a self-closing tag
        \s++      # and the following whitespace
      )
      [^<]*+(?:\s++[^<\s]++)*
                  # same as before
    )             # end of alternation
    ~x',
    '$1',
    $input);

And then we simply write back the closing or self-closing tag if there was one.

One of the reasons this approach is not safe is that closing or self-closing tags might occur inside comments or attribute strings. But I can hardly suggest you use an XML parser instead, since your XML parser can't parse the XML either.

edited May 23, 2017 at 10:24

CommunityBot

11 silver badge

answered Nov 19, 2012 at 8:46

Martin Ender

44.4k11 gold badges93 silver badges132 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Martin Ender Over a year ago

@MichaelLam what do you mean by "just like the file source"? Have a look at cURL to retrieve data from the web.

Michael Lam Over a year ago

hmm disregarding the previous question, sometimes the xml file(string) i'm using get screwed up within the nodes example: <dire 1371 ction/> is there anyway to parse and check firstly: if any rubbish strings are before/after the string, secondly: if any rubbish strings are within or between nodes and remove them?

Martin Ender Over a year ago

@MichaelLam I don't think that is possible. In this specific case 1371 is not a valid attribute name, but say the rubbish turns out to be b137. Who says this is not a dire tag with boolean attributes b137 and ction?

Michael Lam Over a year ago

Not exactly sure what you meant, this is an example of one of the nodes:<Rate Symbol="NZDCHF"> <Bid>.76829</Bid> <Ask>.76865</Ask> <High>.77085</High> <Low>.76711</Low> <Direction>1</Direction> <Last>05:20:01</Last> </Rate> It does appear fine when i'm viewing the page but when i used file_get_contents after which i view source, sometimes there are weird strings before and after the whole string, sometimes they are inside the nodes like the example of the 1371. This does not allow simplexml to load it

Collectives™ on Stack Overflow

PHP parsing xml file error

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related