0

I have an XML document with text in attribute values. I can't change how the the XML file is generated, but need to extract the attribute values without loosing \r\n. The XML parser of course strips them out.

So I'm trying to replace \r\n in attribute values with entity references I'm using perl to do this because of it's non-greedy matching. But I need help getting the replace to happen only within the match. Or I need an easier way to do this :)

Here's is what I have so far:

perl -i -pe 'BEGIN{undef $/;} s/m_description="(.*?)"/m_description="$1"/smg' tmp.xml

This matches what I need to work with: (.*?). But I don't know to expand that pattern to match \r\n inside it, and do the replacement in the results. If I knew how many \r\n I have I could do it, but it seems I need a variable number of capture groups or something like that? There's a lot to regex I don't understand and it seems like there should be something do do this.

Example:

preceding lines 
stuff m_description="Over
any number
of lines" other stuff
more lines

Should go to:

preceding lines 
stuff m_description="Over
any number
of lines" other stuff
more lines

Solution

Thanks to Ikegam and ysth for the solution I used, which for 5.14+ is:

perl -i -0777 -pe's/m_description="\K(.*?)(?=")/ $1 =~ s!\n!
!gr =~ s!\r!
!gr /sge' tmp.xml
5
  • 1
    show sample data? what you show isn't XML Commented Dec 18, 2016 at 18:57
  • You probably want something like that perl -i -p0e 's/m_description="\K([^"]*)/$1=~s%\r\n%
%gr/ge' (-0 is roughly the same as BEGIN{undef $/}). Commented Dec 18, 2016 at 19:08
  • I think you need a rolled up copy of the XML spec to prod people with. Almost like XML, but not quite is pretty filthy. A perl one liner will be hard to read. Writing it as a script in which you extract and reformat the description would be easier. Commented Dec 18, 2016 at 19:54
  • Get an XML parser. Using a regex for XML is just ugly. Commented Dec 18, 2016 at 22:50
  • @Robert, They can't. The XML was incorrectly built, and they are trying to fix it so that an XML parser can be used. Commented Dec 19, 2016 at 3:47

2 Answers 2

2

. should already match \n (because you specify the /s flag) and \r.

To do the replacement in the results, use /e:

perl -i -0777 -pe's/(?<=m_description=")(.*?)(?=")/ my $replacement=$1; $replacement=~s!\n!&#10;!g; $replacement=~s!\r!&#13;!g; $replacement /sge' tmp.xml

I've also changed it to use lookbehind/lookahead to make the code simpler and to use -0777 to set $/ to slurp mode and to remove the useless /m.

Sign up to request clarification or add additional context in comments.

4 Comments

m_description="\K is more efficient and less noisy than (?<=m_description="). Requires 5.10+
my $replacement=$1; $replacement=~s!\n!&#10;!g; $replacement=~s!\r!&#13;!g; $replacement can also be written as $1 =~ s!\n!&#10;!gr =~ s!\r!&#13;!gr if you have 5.14+
what ikegami said. I was just too lazy to look up those two required perl versions.
Awesome! Thanks so much.
0

OK, so whilst this looks like an XML problem, it isn't. The XML problem is the person generating it. You should probably give them a prod with a rolled up copy of the spec as your first port of call for "fixing" this.

But failing that - I'd do a two pass approach, where I read the text, find all the 'blobs' that match a description, and then replace them all.

Something like this:

#!/usr/bin/env perl

use strict;
use warnings;

use Data::Dumper;

my $text = do { local $/ ;  <DATA> }; 

#filter text for 'description' text: 
my @matches = $text =~ m{m_description=\"([^\"]+)\"}gms;

print Dumper \@matches; 

#Generate a search-and-replace hash
my %replace = map { $_ => s/[\r\n]+/&#13;&#10;/gr } @matches; 
print Dumper \%replace;

#turn the keys of that hash into a search regex
my $search = join ( "|", keys %replace ); 
   $search = qr/\"($search)\"/ms; 

print "Using search regex: $search\n";
#search and replace text block
$text =~ s/m_description=$search/m_description="$replace{$1}"/mgs;

print "New text:\n";
print $text;

__DATA__
preceding lines 
stuff m_description="Over
any number
of lines" other stuff
more lines

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.