0

I have a html inside a java string. In this string I have many tables and some has div tags inside. I'm trying to get the tables with div tags using regex, but I'm having difficulty with it.

Example of string:

<table>
  Normal table
</table>

<table>   <--- I want to get this table
  <tr>
    <td>
      <div> 
        ...
      </div>
    </td>
  </tr>
  ...
</table>

I tried <table.*<div.*</div>.*</table> as regex, but it gives me the whole string and not just the second table. I tried something like <table(.^(</table>))*<div.*</div>.*</table>, but it doesnt work :(

**** EDIT **** A simple code

     String test =  "<table>Normal table</table><table>   <--- I want to get this table<tr>" +
                   "<td><div>...</div></td></tr>...</table>";

    Pattern pattern = Pattern.compile("<table.*<div.*</div>.*</table>", Pattern.DOTALL);
    Matcher matcher = pattern.matcher(test);
    if( matcher.find())
        System.out.println("Teste " + matcher.group());
2
  • 1
    you have a problem so you've decided to use regexes. Now you have two problems. Commented May 6, 2015 at 19:02
  • 1
    Please, don't do this - parse the html instead. Commented May 6, 2015 at 19:19

3 Answers 3

2

How about using xpath? This should work alright.

public class TableParse {

    private static final String HTML = "<table>\n"+
            "  Normal table\n"+
            "</table>\n"+
            "\n"+
            "<table> \n"+
            "  <tr>\n"+
            "    <td>\n"+
            "      <div> \n"+
            "        ...\n"+
            "      </div>\n"+
            "    </td>\n"+
            "  </tr>\n"+
            "</table>";

    public static void main(String[] args) throws Exception {
        xpath();
    }

    public static void xpath() throws Exception {
        TagNode tagNode = new HtmlCleaner().clean(HTML);
        Document doc = new DomSerializer(new CleanerProperties()).createDOM(tagNode);

        XPath xpath = XPathFactory.newInstance().newXPath();
        Node tableNode = (Node) xpath.evaluate("//table[.//div]", doc, XPathConstants.NODE);

        StringWriter writer = new StringWriter();
        Transformer transformer = TransformerFactory.newInstance().newTransformer();
        transformer.transform(new DOMSource(tableNode), new StreamResult(writer));
        String xml = writer.toString();

        System.out.println(xml);
    }

}
Sign up to request clarification or add additional context in comments.

Comments

1

Regular expression are meant to parse regular languages, based on a regular grammar. HTML is not defined by a regular grammar, so please do not use use regex to parse HTML.

There are a lot of good and simple HTML parsers for Java, have a look into them. JSoup is a good starting point.

Comments

0

If you still want to use regex for your task even after reading the comments, you can use the following:

<table>(?=(?:(?!</table>)[\\s\\S])*?<div>)[\\s\\S]*?</table>

Explanation:

  • look forward for <div> tag while making sure not to look forward after </table> (table end) tag.

Java code:

String test =  "<table>Normal table</table><table>   <--- I want to get this table<tr>" +
               "<td><div>...</div></td></tr>...</table>";

Pattern pattern = Pattern.compile("<table>(?=(?:(?!</table>)[\\s\\S])*?<div>)[\\s\\S]*?</table>");
Matcher matcher = pattern.matcher(test);
if( matcher.find())
    System.out.println("Teste " + matcher.group());

See working demo

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.