Getting table inside String - Regex java

Question

I have a html inside a java string. In this string I have many tables and some has div tags inside. I'm trying to get the tables with div tags using regex, but I'm having difficulty with it.

Example of string:

<table>
  Normal table
</table>

<table>   <--- I want to get this table
  <tr>
    <td>
      <div> 
        ...
      </div>
    </td>
  </tr>
  ...
</table>

I tried <table.*<div.*</div>.*</table> as regex, but it gives me the whole string and not just the second table. I tried something like <table(.^(</table>))*<div.*</div>.*</table>, but it doesnt work :(

**** EDIT **** A simple code

     String test =  "<table>Normal table</table><table>   <--- I want to get this table<tr>" +
                   "<td><div>...</div></td></tr>...</table>";

    Pattern pattern = Pattern.compile("<table.*<div.*</div>.*</table>", Pattern.DOTALL);
    Matcher matcher = pattern.matcher(test);
    if( matcher.find())
        System.out.println("Teste " + matcher.group());

you have a problem so you've decided to use regexes. Now you have two problems. — Alnitak
– Alnitak, Commented May 6, 2015 at 19:02

jakub.petr · Accepted Answer · 2015-05-06 19:53:41Z

How about using xpath? This should work alright.

public class TableParse {

    private static final String HTML = "<table>\n"+
            "  Normal table\n"+
            "</table>\n"+
            "\n"+
            "<table> \n"+
            "  <tr>\n"+
            "    <td>\n"+
            "      <div> \n"+
            "        ...\n"+
            "      </div>\n"+
            "    </td>\n"+
            "  </tr>\n"+
            "</table>";

    public static void main(String[] args) throws Exception {
        xpath();
    }

    public static void xpath() throws Exception {
        TagNode tagNode = new HtmlCleaner().clean(HTML);
        Document doc = new DomSerializer(new CleanerProperties()).createDOM(tagNode);

        XPath xpath = XPathFactory.newInstance().newXPath();
        Node tableNode = (Node) xpath.evaluate("//table[.//div]", doc, XPathConstants.NODE);

        StringWriter writer = new StringWriter();
        Transformer transformer = TransformerFactory.newInstance().newTransformer();
        transformer.transform(new DOMSource(tableNode), new StreamResult(writer));
        String xml = writer.toString();

        System.out.println(xml);
    }

}

Guillaume · Accepted Answer · 2015-05-06 19:28:28Z

1

Regular expression are meant to parse regular languages, based on a regular grammar. HTML is not defined by a regular grammar, so please do not use use regex to parse HTML.

There are a lot of good and simple HTML parsers for Java, have a look into them. JSoup is a good starting point.

answered May 6, 2015 at 19:28

Guillaume

19k8 gold badges56 silver badges76 bronze badges

Comments

karthik manchala · Accepted Answer · 2015-05-06 19:38:12Z

0

If you still want to use regex for your task even after reading the comments, you can use the following:

<table>(?=(?:(?!</table>)[\\s\\S])*?<div>)[\\s\\S]*?</table>

Explanation:

look forward for <div> tag while making sure not to look forward after </table> (table end) tag.

Java code:

String test =  "<table>Normal table</table><table>   <--- I want to get this table<tr>" +
               "<td><div>...</div></td></tr>...</table>";

Pattern pattern = Pattern.compile("<table>(?=(?:(?!</table>)[\\s\\S])*?<div>)[\\s\\S]*?</table>");
Matcher matcher = pattern.matcher(test);
if( matcher.find())
    System.out.println("Teste " + matcher.group());

See working demo

edited May 6, 2015 at 19:38

answered May 6, 2015 at 19:24

karthik manchala

13.7k1 gold badge34 silver badges55 bronze badges

Collectives™ on Stack Overflow

Getting table inside String - Regex java

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related