2

I've some issues with a program which is fetching information out of an html table in Java. To fetch information out of every column I use the following RegEx:

<td>([^<]*)</td>

This works very nice for me. For fetching the Linknames I use this:

<a[^>]*>(.*?)</a>

This is also working very very good. But sometimes I need informations from a column where a link is in. Therefore I wanted to combine these Regex with:

<td>([^<]*)</td>|<a[^>]*>(.*?)</a>

I thought that it would work like this:

  • It get every thing which is between <td> and </td>

  • If the thing is a link it get also just the linkname

But this is not working. I'm not the best at RegEx so I need help to combine these two steps.

Thanks very very very much.

3
  • 4
    "I've some issues with a program which is fetching information out of an html table in Java." Don't parse html with a regex Commented Nov 3, 2014 at 20:01
  • What does this is not working mean? Please give us the code you are using, and a short reproducible example that shows your problem clearly. Commented Nov 3, 2014 at 20:02
  • I know that a lot of guys don't prefer to parse HTML with regex. But it was always working for me. And I know that there must be a possibility to combine it. Commented Nov 3, 2014 at 20:02

1 Answer 1

1

The code I'm using:

Pattern pattern = Pattern.compile("<td>([^<]*)</td>|<a[^>]*>(.*?)</a>");

String line = "Here are the lines saved from the HTML downloader";

Matcher matcher = pattern.matcher(line);
for (int startPoint = 0; matcher.find(startPoint); startPoint = matcher.end())
   {
        System.out.prinln(matcher.group(1));
   }

This is just a snippet - but thats how it works in general. (Normally the String is saved in an array).

Sign up to request clarification or add additional context in comments.

3 Comments

matcher.group(1) returns null if a link was found.
They are already combined. To conjoin into a single capture group, Java would have to do Branch Reset, which is doesn't do. On every match, one of the 2 groups will be null, one won't. All you have to do is check which one. And don't confuse null with the empty string.
I tried now to work with JSoup to parse all these things and it works a lot better - but thank you guys for your help!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.