1

I have an html string parsed in android froom a spannable string. :-

<p dir="ltr"><b><b><b><b><b>qwert</b></b></b></b></b><b><b><b><b><b><b>y</b></b></b></b></b></b></p>

As you can see, there are multiple occurences of tags.

Now i have done hit and trials ,user methods like replaceAll(), but they replace all occurences.

What i want is that, when i pass a substring to find, lets say "<b>", and then it should replace, lets say the first five consecutive bold tags in the above string with a single "<b>" tag.

Any Suggestions

Required Result :- <p dir="ltr"><b>qwert</b><b>y</b></p>

4
  • Link does not work. I have no issue with android to html parsing. Its just that i want to process this above string and remove duplicates Commented Mar 21, 2014 at 6:09
  • What is the output you'd like to get from your sample input? What is the regex you're currently using? Commented Mar 21, 2014 at 6:13
  • I am not familiar with Matcher Class. Please see my edit. I have updated my question Commented Mar 21, 2014 at 6:16
  • why two <b> after qwert ? Commented Mar 21, 2014 at 6:18

2 Answers 2

5

If I understand your problem correctly, you can try this regex then:

(<[^>]+>)\\1+

And replace with:

\\1

In code...

String test = "<p dir=\"ltr\"><b><b><b><b><b>qwert</b></b></b></b></b><b><b><b><b><b><b>y</b></b></b></b></b></b></p>";
String out = test.replaceAll("(<[^>]+>)\\1+", "$1");

Output:

<p dir="ltr"><b>qwert</b><b>y</b></p>

(<[^>]+>) matches and catches in group 1, the first tag that it finds.

\\1 in the regex refers to the first captured tag. The + indicates unlimited repetition (well, the limit is a big number I don't think you need to worry about).

The replacement $1 then also refers to the first captured tag.

ideone demo

Sign up to request clarification or add additional context in comments.

4 Comments

I am new to this pattern thing. Your code works fine above. Can you explain me the process and all those square bracket meanings in the above patter
Okay, < and > mean these symbols themselves. [^>]+ is a character class. It means any character except >, repeated at least once. If I had [^a]+, that would mean any character except a, repeated at least once. Does that help? Is there more you want to ask about?
Yes. Thanks, If my string has this :- <b><i><b><i><b>. Can i pattern match alternate "<b>" and replace them ?
@RahulGupta That could be a problem... which (if it works), will make your example input become: <p dir="ltr"><b>qwert</b>y</p> and I'm not sure that's something you want.
2

you want somehting like this

find : (<b>)\1+|(<\/b>)\2+

replace: \1\2

demo here : http://regex101.com/r/aC6iP4

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.