parsing xml in perl program using regex

Question

I have so many various names

Input:

Depsai P.R.N.
D&#x00EA;nis De Castro
John D.J. 
Andrew E.
D.J. JOHN 
JOHN Mical D.J.

I need output like this.

D. P.R.N.
D. C.
J. D.J. 
A. E.
D.J. J.
J. M. D.J.

If the name like Dênis De Castro i need the output: D. C. If the name contains theses cases (De|Di|Le|La|Van|Der) in between should not capture the first word.

 use strict;
    use warnings;
    my $gn = qq(<name>Depsai P.R.N.</name>
                <name>D&#x00EA;nis De Castro</name>
                <name>Andrew E.</name>
                <name>John D.J.</name>
                <name>D.J. John</name>
                <name>John Mical D.J.</name>);
        my @int = $gn =~ m{<name>(.*?)</name>}ig;
        my $ini=();
        foreach my $initial(@int){
            $ini .= "$1\. " while($initial =~ s/(?:^|[ \.\,\;]+)([A-Z])\w*(\b|$)//s);
            $ini =~ s/ $//mi;
            print join("\n",$ini);exit;
        }

  Please give some regex pattern.
  Thanks advance.

removing the lowecase letters will give you the desired output. — Avinash Raj
– Avinash Raj, Commented Nov 4, 2014 at 4:43

Praveen · Accepted Answer · 2014-11-27 04:46:22Z

1

You can try below one liner :

InputFile:

<name>Depsai P.R.N.</name>
<name>D&#x00EA;nis De Castro</name>
<name>John D.J.</name> 
<name>Andrew E.</name>
<name>D.J. JOHN</name> 
<name>JOHN Mical D.J.</name>
<name>Roc&#x00ED;o</name>

On Windows cmd prompt:

perl -lne "if($_ =~ /<name(>.*?<)\/name>/) {$result = $1; $result =~ s/(\s)(De|Di|Le|La|Van|Der)(\s)/$1$3/g; $result =~ s/((?:>|\s)[A-Z])[^\.]/$1\./g; $result =~ s/.*?(\s*[A-Z]\.\s*).*?/$1/g;$result =~ s/([a-z]|[A-Z][A-Z]).*?<//g;$result =~ s/<//g;print $result;}" InputFile

On Unix:

perl -lne 'if($_ =~ /<name(>.*?<)\/name>/) {$result = $1; $result =~ s/(\s)(De|Di|Le|La|Van|Der)(\s)/$1$3/g; $result =~ s/((?:>|\s)[A-Z])[^\.]/$1\./g; $result =~ s/.*?(\s*[A-Z]\.\s*).*?/$1/g;$result =~ s/([a-z]|[A-Z][A-Z]).*?<//g;$result =~ s/<//g;print $result;}' InputFile

Output:

D. P.R.N.
D. C.
J. D.J. 
A. E.
D.J. J.
J. M. D.J.
R.

edited Nov 27, 2014 at 4:46

answered Nov 4, 2014 at 5:18

Praveen

9026 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Praveen Over a year ago

You said do not capture first words if they are (De|Di|Le|La|Van|Der) in between then can you tell me what is the expected output of this Dênis Van Castro and even for this Dênis John La Castro ?

depsai Over a year ago

For this case not working @Praveen <name>Rocío</name> should come like <name>R.</name>

vks · Accepted Answer · 2014-11-04 05:33:27Z

0

(?<=[a-zA-Z])[a-zA-Z]+

You can try this.Replace by ..See demo.

http://regex101.com/r/bB8jY7/12

import re
p = re.compile(ur'(?<=[a-zA-Z])[a-zA-Z]')
test_str = u"Depsai P.R.N. \nJohn D.J. \nAndrew E."
subst = u"."

result = re.sub(p, subst, test_str)

edited Nov 4, 2014 at 5:33

answered Nov 4, 2014 at 4:39

vks

68.1k11 gold badges96 silver badges132 bronze badges

6 Comments

depsai Over a year ago

am not downvoted. now i edit the question. i need space if the initials have otherwise no need. i working in perl the lookbehind regex not working for me it shows error like not support for lookbehind. if space present in the name should come.

vks Over a year ago

@depsai try now.See demo.

depsai Over a year ago

your code is working in regex101.com and regex buddy. but in perl program lookbehind regex not working please give some regex without using ?<=.

vks Over a year ago

@depsai try [a-z]+ replace by ..

depsai Over a year ago

thanks working i used this ([A-Z])([a-zA-Z]+) replace with $1.

|

Collectives™ on Stack Overflow

parsing xml in perl program using regex

2 Answers 2

2 Comments

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related