4

I am trying to use simpleXML to get data from http://rates.fxcm.com/RatesXML Using simplexml_load_file() i had errors at times as this website always has weird strings/numbers before and after the xml file. Example:

2000<?xml version="1.0" encoding="UTF-8"?>
<Rates>
    <Rate Symbol="EURUSD">
    <Bid>1.27595</Bid>
    <Ask>1.2762</Ask>
    <High>1.27748</High>
    <Low>1.27385</Low>
    <Direction>-1</Direction>
    <Last>23:29:11</Last>
</Rate>
</Rates>
0

I then decided to use file_get_contents and parse it as a string with simplexml_load_string(), afterwards which I use substr() to remove the strings before and after. However, sometimes the random strings will appear in between the nodes like this:

<Rate Symbol="EURTRY">
    <Bid>2.29443</Bid>
    <Ask>2.29562</Ask>
    <High>2.29841</High>
    <Low>2.28999</Low>

137b

 <Direction>1</Direction>
    <Last>23:29:11</Last>
</Rate>

My question is, is there anyway i can deal with all these random strings at a go with any regex functions regardless of where they are placed? (think that will be a better idea rather than to contact the site to get them to broadcast proper xml files)

2
  • 1
    If this helps. Commented Nov 19, 2012 at 6:12
  • A regex would be much easier to compile if variable-width assertions were allowed in lookbehinds. Commented Nov 19, 2012 at 6:15

1 Answer 1

1

I believe preprocessing XML with regular expressions might be just as bad as parsing it.

But here is a preg replace that removes all non-whitespace characters, from the beginning of the string, from the end of the string, and after closing/self-closing tags:

$string = preg_replace( '~
    (?|           # start of alternation where capturing group count starts from
                  # 1 for each alternative
      ^[^<]*      # match non-< characters at the beginning of the string
    |             # OR
      [^>]*$      # match non-> characters at the end of the string
    |             # OR
      (           # start of capturing group $1: closing tag
        </[^>]++> # match a closing tag; note the possessive quantifier (++); it
                  # suppresses backtracking, which is a convenient optimization,
                  # the following bit is mutually exclusive anyway (this will be
                  # used throughout the regex)
        \s++      # and the following whitespace
      )           # end of $1
      [^<\s]*+    # match non-<, non-whitespace characters (the "bad" ones)
      (?:         # start subgroup to repeat for more whitespace/non-whitespace
                  # sequences
        \s++      # match whitespace
        [^<\s]++  # match at least one "bad" character
      )*          # repeat
                  # note that this will kind of pattern keeps all whitespace
                  # before the first and the last "bad" character
    |             # OR
      (           # start of capturing group $1: self-closing tag
        <[^>/]+/> # match a self-closing tag
        \s++      # and the following whitespace
      )
      [^<]*+(?:\s++[^<\s]++)*
                  # same as before
    )             # end of alternation
    ~x',
    '$1',
    $input);

And then we simply write back the closing or self-closing tag if there was one.

One of the reasons this approach is not safe is that closing or self-closing tags might occur inside comments or attribute strings. But I can hardly suggest you use an XML parser instead, since your XML parser can't parse the XML either.

Sign up to request clarification or add additional context in comments.

4 Comments

@MichaelLam what do you mean by "just like the file source"? Have a look at cURL to retrieve data from the web.
hmm disregarding the previous question, sometimes the xml file(string) i'm using get screwed up within the nodes example: <dire 1371 ction/> is there anyway to parse and check firstly: if any rubbish strings are before/after the string, secondly: if any rubbish strings are within or between nodes and remove them?
@MichaelLam I don't think that is possible. In this specific case 1371 is not a valid attribute name, but say the rubbish turns out to be b137. Who says this is not a dire tag with boolean attributes b137 and ction?
Not exactly sure what you meant, this is an example of one of the nodes:<Rate Symbol="NZDCHF"> <Bid>.76829</Bid> <Ask>.76865</Ask> <High>.77085</High> <Low>.76711</Low> <Direction>1</Direction> <Last>05:20:01</Last> </Rate> It does appear fine when i'm viewing the page but when i used file_get_contents after which i view source, sometimes there are weird strings before and after the whole string, sometimes they are inside the nodes like the example of the 1371. This does not allow simplexml to load it

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.