1

I know there is lots of topics for my question but I couldnt find helpful solution for my answer. I could connect to website and read line by line in Java, now here is my problem. I want to parse a specific info from a html page. This page includes 5-day weather forecast. for example the date of forecast tag is like this ;

//date of forecast
< th id="ctl00_mpBody_thmGun1" class="arkaTrh">19 April< /th >

//Min weather:
< td id="ctl00_mpBody_thmMin1" class="minS">8< /td>

//Max weather
 < td id="ctl00_mpBody_thmMax1" class="maxS">17< /td>

second day and others tags continue like this,
< th id="ctl00_mpBody_thmGun2" class="arkaTrh">20 April< /th >
.
.
.

according these tags, I need to parse 19 April, 17 and 8.

3

3 Answers 3

4

FOR THE LOVE OF GOD DO NOT USE A REGEX. I don't know how many times this has to be repeated on SO. You'll end up in a world of pain. Use a parser, there are loads available in java. Here are some of them:

Jericho

Dom4j

htmlparser

But there are dozens more. Just Google "html parser java" or "java dom parser" or something. Please.

Sign up to request clarification or add additional context in comments.

1 Comment

yeah I gave up using regex just solve my problem with JSoup.Elements link = doc.select("th[id=ctl00_mpBody_thmGun"+i+"]");
1

you could craft some regex like this:

id="ctl00_mpBody_thmGun1"[^>]*?>(.*?)<

But if you want a more robust solution it would be better to sanitize the HTML and select the data with XPath: http://www.ibm.com/developerworks/library/x-javaxpathapi.html

Comments

0

You can use HtmlUnit. It was designed for unit testing web pages but you can use it to parse HTML code. You can get your forecast data using something like this:

final WebClient webClient = new WebClient();
final HtmlPage page = webClient.getPage("http://url.to.your.data");

//get temperatures
HtmlTableDataCell minTemp = page.getByXPath("//td[@class='minS']").get(0);
HtmlTableDataCell maxTemp = page.getByXPath("//td[@class='maxS']").get(0);
HtmlTableHeaderCell date = page.getByXPath("//th[@class='arkaTrh']").get(0);

System.out.println("Forecast for " + date.asText() + " - Min: " + minTemp.asText() + ", Max: " + maxTemp.asText()); 

1 Comment

I wrote my answer before you told you need to use regexp. I think HtmlUnit is way easier than using regexp but if you need it so my answer is not for you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.