0

I crawl a site and i get some prices from it. I get a price with its currency (21,00 TL) i should remove currency(TL) and the left whitespace on it for convert double to string. In short i should get 21.00 . Whatevery i did , i couldnt remove that whitespace.

I got from crawler :

<b>21,00&nbsp;TL</b>

What i try:

price_lower_str = price_lower_str.replace("&nbsp;TL","");

and 

price_lower_str = price_lower_str.replace(" TL","");

price_lower_str = price_lower_str.replace("TL","");
price_lower_str = price_lower_trim();

but i couldnt get only 21.00 . Who can help me?

Thanks

3
  • 1
    How about trim() ? price.trim() will remove last whitespace. Commented Apr 17, 2014 at 13:19
  • price_lower_str = price_lower_trim(); i already did it. Commented Apr 17, 2014 at 13:21
  • o, it was java syntax, I didn't get it. Commented Apr 17, 2014 at 13:26

3 Answers 3

1

Quick and dirty, but working :-)

public static void main(String[] args) {
    String str = "<b>21,00&nbsp;TL</b>";
    Matcher matcher = Pattern.compile(".*?([\\d]+,[\\d]+).*").matcher(str);
    if (matcher.matches()) System.out.println(matcher.group(1).replace(',', '.'));
}

OUTPUT:

21.00
Sign up to request clarification or add additional context in comments.

8 Comments

This fails to remove &nbsp; specified by OP's input.
It does not fail, because it matches only digits, the comma and digits. Never the less, it is ugly. It is easy for you to proof it right?
I think there was a glitch in your code, probably a copy-paste error. Otherwise, it seems to give the proper requested output.
That's better :) Glitch fixed.
Except the question was how to remove the whitespace, not how to extract the digits. OP hasn't specified that those digits and commas are the only valid values in those cells. What happens when the next input value is BLAHBLAH&ampnbsp;TL
|
1

You're just using the wrong regular expression. Try this:

price_lower_str.replaceAll("(\\&nbsp;|\\s)+TL", "")

First, I'm using replaceAll and not just replace as you are. Second, notice the parens - I'm replacing EITHER &nbsp; OR \s which matches any whitespace character. Finally, I'm escaping via backslashes the ampersand in &nbsp; Escaping backslashes when backslash itself is a meta-character in regex is a pain, but welcome to java regex.

1 Comment

I tested this on your input string 21,00&nbsp;TL and it does work. You need to provide which input you're using that it doesn't work on. What do you have, and what do you expect?
1

Using regexes sound to heavy for this simple processing. It's not really efficient in that case. What you could do is to locate the > from the < b > tag and do a substring up to the amperstand.

System.out.println(test.substring(test.indexOf(">")+1, test.indexOf("&")));

You will get your answer 21,00

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.