0

I have a string which contains multiple occurrences of the "<p class=a> ... </p>" where ... is different text.

I am using "<p class=a>(.*)</p>" regex pattern to split the text into chunks. but this is not working. what would be the correct regex for this?

P.S. the same regex pattern is working in iOS using NSRegularExpression but not working in android using Pattern.

To explain my problem more : i am doing the following

Pattern regex3 = Pattern.compile("(?s)<P Class=ENCC>(.*?)</P>", CASE_INSENSITIVE);
String[] result = p.split(str); 

result array contains only 1 item and it is the whole string

and the following is a portion of the file that i am reading :

<BODY>
    <SYNC Start=200>
      <P Class=ENCC><i>Cerita, Watak, Adegan dalam</i><br/><i>Drama Ini Rekaan Semata-Mata.</i></P>
    </SYNC>
    <SYNC Start=2440>
      <P Class=ENCC>&nbsp;</P>
    </SYNC>
    <SYNC Start=2560>
      <P Class=ENCC><i>Kami Tidak Berniat</i><br/><i>Melukakan Hati Sesiapa.</i></P>
    </SYNC>
    <SYNC Start=4560>
      <P Class=ENCC>&nbsp;</P>
    </SYNC>
    <SYNC Start=66160>
      <P Class=ENCC>Hai kawan-kawan.<br/>Inilah bandaraya Banting.</P>
    </SYNC>

UPDATE ::::

hi everybody, I have got the problem. the problem was actually with the encoding of the file that i was reading. the file was UTF-16 (Little Endian) encoded. that was causing the all problem of regex not working. i changed it to UTF-8 and everything started working .. thanx everybody for your support.

5
  • Can you provide list of match cases and non-match cases? Commented Jun 26, 2012 at 9:02
  • @nikola : (.*?) is also not working Commented Jun 26, 2012 at 9:06
  • @sivaCharan :: if this is the string <sync start=200> <p class=a>xxxxxxx</p> </sync> <sync start=2440> <p class=a>yyyyyyy</p> </sync> then <p class=a>yyyyyyy</p> and <p class=a>xxxxxx</p> should be a match. Commented Jun 26, 2012 at 9:08
  • How exactly are you using the regex? Please paste the code that fails. Commented Jun 26, 2012 at 9:23
  • @TimPietzcker :: i have edited the question and pasted the code that i am using also pasting a portion of the file as the file is too big. Commented Jun 26, 2012 at 9:34

4 Answers 4

2

Parsing HTML with regular expressions is not really a good idea (reason here). What you should use in an HTML parser such as this.

That being said, your issue is most likely the fact that the * operator is greedy. In your question you just say that it is not working, so I think that your problem is because it is matching the first <p class=a> and the very last </p>. Making the regular expression non greedy, like so: <p class=a>(.*?)</p> (notice the extra ? to make the * operator non greedy) should solve the problem (assuming that your problem is the one I have stated earlier).

That being said, I would really recommend you ditch the regular expression approach and use appropriate HTML Parsers.

Sign up to request clarification or add additional context in comments.

2 Comments

thi is not actually a html file (although it is using html tags). it is sort or a custom subtitle file that is using html tags. and it is not validated also (because of other non html stuff in the html file). also i have used the <p class=a>(.*?)</p> too, and it is not working either.
@g.revolution: If that is the case then I would recommend you provide more information, such as what you actually have, what you are after and what you actually getting.
1

EDIT:

Now that you've posted the code and the text you're matching against, one thing immediately leaps to mind:

You're matching <p class..., but your string contains <P Class.... Regexes are case-sensitive.

Then, . does not match newlines. And it's quite likely that your paragraphs do contain newlines.

Therefore, try "(?si)<p class=a>(.*?)</p>". The (?s) modifier allows the dot to match newlines, too, and the (?i) modifier makes the regex case-insensitive.

3 Comments

i have tried the expression you mentioned too. and it is not working. and i am using Pattern class like this : Pattern p = Pattern.compile("(?s)<p class=a>(.*?)</p>"); String[] result = p.split(str); result contains only 1 item and it is the whole string .. thats what i am getting
i have tried with the Pattern.CASE_INSENSITIVE and still not working.
It works fine here. Are you aware that when using split(), the regex match will not be part of the result?
0

The .* may match <. You can try :

<p class=a>([^<]*)</p>

2 Comments

This only works if no other tags occur within a paragraph, which is unlikely.
Are you absolutely sure of your input string ? "<p class=a>(.*?)</p>" also works with Javascript regex
0

I guess the problem is that your pattern is greedy. You should use this instead.

"<p class=a>(.*?)</p>"

If you have this string:

"<p class=a>fist</p><p class=a>second</p>"

Your pattern ("<p class=a>(.*)</p>") will match this

"<p class=a>fist</p><p class=a>second</p>"

While "<p class=a>(.*?)</p>" only matches

"<p class=a>fist</p>"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.