1

Please consider the following line from an XML file (generated from a third-party source):

<record ObTime="2017-05-10T23:30" data_value="Ocean Park "The Sea WX"  WA US" />

As you can see, the attribute data_value has quoted string inside the value, which causes XML validators to giggle and explode.

Any given XML file could have thousands of lines. Is there a way to apply REGEX to a whole file? And, what would the REGEX be to replace quotes with something more benign?

2 Answers 2

2

There might be other, and better, solutions, but this is how I made it works:

  • Using preg_match_all with certain regex to capture all matches, and store them in an array $matches[0].
  • The regex: (?<=data_value=").*(?=" \/>) will capture everything between data_value=" and " />), by making use of positive lookbehind and lookahead, precisely match the values of each of the data_value attributes.
  • Loop through items in $matches[0] and we do the following:
    1. Replace every double qoutes string " with % [could be any other string, even blank, that doesn't cause further problems] in every single match, and store it in a temporary variable $str.
    2. Then replace the value of each match in the whole data string with the value of the modified version of the match, the $str string.

PHP code:
remember that because the data is xml tags, you need to use "view source" in order to see the output, alternatively, you can use var_dump instead of echo

<?php
$data = '<record ObTime="2017-05-10T23:30" data_value="Ocean Park "The Sea WX"  WA US" />
<record ObTime="2017-11-10T23:30" data_value="Some Other "Demo Text"  In Here" />';

$data_valueVal = preg_match_all('#(?<=data_value=").*(?=" \/>)#i', $data, $matches);

foreach($matches[0] as $match) {
    $str = str_replace('"', "%", $match);
    $data = str_replace($match, $str, $data);
}
echo $data;
?>

Output:

<record ObTime="2017-05-10T23:30" data_value="Ocean Park %The Sea WX% WA US" /> <record ObTime="2017-11-10T23:30" data_value="Some Other %Demo Text% In Here" />

Sign up to request clarification or add additional context in comments.

4 Comments

thank you very much. I'll look at applying this to each file before parsing.
You're welcome and I'm glad it helped.. enjoy coding!
for the new xml data sample, use this regex (?<=data_value=")[^=]+(?=" (?:\w+=)?) Regex Demo instead of the one in my answer, it can capture both data_value attributes
And if you only want to capture the second data_value , use (?<=data_value=")[^=]+(?=" \/>) instead, regex Demo
1

Using Regex below, you are able to match those double quotes separately for further modifications:

(?:="|"\s+(?:\w+="|\/>))(*SKIP)(?!)|"

By using (*SKIP)(?!) you force engine to jump over first side of alternation after each successful match.

Live demo

PHP code (removing quotes):

echo preg_replace('~(?:="|"\s+(?:\w+="|\/>))(*SKIP)(?!)|"~', '', $xml);

8 Comments

That's a nice one, never heard of this (*SKIP)(?!) before, Up Voted!
Wow... SKIP is cool. I'm brand new to REGEX so everything is like magic to me... but this is the first I've seen of SKIP. Thanks for your answer!
@revo, It looks like the quote in the <xml> opening tag is caught as well. This demo has more verbose data to look at. regex101.com/r/toFV9f/4
You may want to change \/> part in regex to [\/?]>.
@TomSawyer Change \s+ to \s*.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.