how to match string using regular expression

Question

I have a string which contains multiple occurrences of the " ... " where ... is different text.

I am using "(.*)" regex pattern to split the text into chunks. but this is not working. what would be the correct regex for this?

P.S. the same regex pattern is working in iOS using NSRegularExpression but not working in android using Pattern.

To explain my problem more : i am doing the following

Pattern regex3 = Pattern.compile("(?s)<P Class=ENCC>(.*?)</P>", CASE_INSENSITIVE);
String[] result = p.split(str);

result array contains only 1 item and it is the whole string

and the following is a portion of the file that i am reading :

<BODY>
    <SYNC Start=200>
      <P Class=ENCC><i>Cerita, Watak, Adegan dalam</i><br/><i>Drama Ini Rekaan Semata-Mata.</i></P>
    </SYNC>
    <SYNC Start=2440>
      <P Class=ENCC>&nbsp;</P>
    </SYNC>
    <SYNC Start=2560>
      <P Class=ENCC><i>Kami Tidak Berniat</i><br/><i>Melukakan Hati Sesiapa.</i></P>
    </SYNC>
    <SYNC Start=4560>
      <P Class=ENCC>&nbsp;</P>
    </SYNC>
    <SYNC Start=66160>
      <P Class=ENCC>Hai kawan-kawan.<br/>Inilah bandaraya Banting.</P>
    </SYNC>

UPDATE ::::

hi everybody, I have got the problem. the problem was actually with the encoding of the file that i was reading. the file was UTF-16 (Little Endian) encoded. that was causing the all problem of regex not working. i changed it to UTF-8 and everything started working .. thanx everybody for your support.

@sivaCharan :: if this is the string <sync start=200> xxxxxxx </sync> <sync start=2440> yyyyyyy </sync> then yyyyyyy and xxxxxx should be a match. — g.revolution
– g.revolution, Commented Jun 26, 2012 at 9:08
How exactly are you using the regex? Please paste the code that fails. — Tim Pietzcker
– Tim Pietzcker, Commented Jun 26, 2012 at 9:23
@TimPietzcker :: i have edited the question and pasted the code that i am using also pasting a portion of the file as the file is too big. — g.revolution
– g.revolution, Commented Jun 26, 2012 at 9:34

Community · Accepted Answer · 2017-05-23 12:27:51Z

2

Parsing HTML with regular expressions is not really a good idea (reason here). What you should use in an HTML parser such as this.

That being said, your issue is most likely the fact that the * operator is greedy. In your question you just say that it is not working, so I think that your problem is because it is matching the first  and the very last . Making the regular expression non greedy, like so: (.*?) (notice the extra ? to make the * operator non greedy) should solve the problem (assuming that your problem is the one I have stated earlier).

That being said, I would really recommend you ditch the regular expression approach and use appropriate HTML Parsers.

edited May 23, 2017 at 12:27

CommunityBot

11 silver badge

answered Jun 26, 2012 at 9:04

npinti

52.2k5 gold badges74 silver badges98 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

g.revolution Over a year ago

thi is not actually a html file (although it is using html tags). it is sort or a custom subtitle file that is using html tags. and it is not validated also (because of other non html stuff in the html file). also i have used the (.*?) too, and it is not working either.

npinti Over a year ago

@g.revolution: If that is the case then I would recommend you provide more information, such as what you actually have, what you are after and what you actually getting.

Tim Pietzcker · Accepted Answer · 2012-06-26 09:46:31Z

1

EDIT:

Now that you've posted the code and the text you're matching against, one thing immediately leaps to mind:

You're matching <p class..., but your string contains <P Class.... Regexes are case-sensitive.

Then, . does not match newlines. And it's quite likely that your paragraphs do contain newlines.

Therefore, try "(?si)(.*?)". The (?s) modifier allows the dot to match newlines, too, and the (?i) modifier makes the regex case-insensitive.

edited Jun 26, 2012 at 9:46

answered Jun 26, 2012 at 9:20

Tim Pietzcker

337k59 gold badges520 silver badges572 bronze badges

3 Comments

g.revolution Over a year ago

i have tried the expression you mentioned too. and it is not working. and i am using Pattern class like this : Pattern p = Pattern.compile("(?s)(.*?)"); String[] result = p.split(str); result contains only 1 item and it is the whole string .. thats what i am getting

g.revolution Over a year ago

i have tried with the Pattern.CASE_INSENSITIVE and still not working.

Tim Pietzcker Over a year ago

It works fine here. Are you aware that when using split(), the regex match will not be part of the result?

Arcadien · Accepted Answer · 2012-06-26 09:05:53Z

0

The .* may match <. You can try :

<p class=a>([^<]*)</p>

answered Jun 26, 2012 at 9:05

Arcadien

2,27816 silver badges26 bronze badges

2 Comments

Tim Pietzcker Over a year ago

This only works if no other tags occur within a paragraph, which is unlikely.

Arcadien Over a year ago

Are you absolutely sure of your input string ? "(.*?)" also works with Javascript regex

flec · Accepted Answer · 2012-06-26 09:07:22Z

0

I guess the problem is that your pattern is greedy. You should use this instead.

"<p class=a>(.*?)</p>"

If you have this string:

"<p class=a>fist</p><p class=a>second</p>"

Your pattern ("(.*)") will match this

"<p class=a>fist</p><p class=a>second</p>"

While "(.*?)" only matches

"<p class=a>fist</p>"

edited Jun 26, 2012 at 9:07

answered Jun 26, 2012 at 9:01

flec

3,0191 gold badge24 silver badges30 bronze badges

Collectives™ on Stack Overflow

how to match string using regular expression

4 Answers 4

2 Comments

3 Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

3 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related