3


Lets say I copy a complete HTML table (when each and every tr and td has extra attributes) into a String. How can I take all the contents (what is between the tags) and create an 2D array that is organized like the original table?

For example for this table:

<table border="1">
    <tr align= "center">
        <td align="char">TD1</td>
        <td>td1</td>
        <td align="char">TD1</td>
        <td>td1</td>
    </tr>
    <tr>
        <td>TD2</td>
        <td>tD2</td>
        <td class="bold>Td2</td>
        <td>td2</td>
    </tr>
</table>

I want this array: array

PS: I know I can use regex but it would be extremely complicated. I want a tool like JSoup that can do all the work automatically without much code writing

3
  • If HTML is valid you can use SAX XML parser or HTMLCleaner htmlcleaner.sourceforge.net. And there are a lot of other libs that helps to parse html. Just check this list: java-source.net/open-source/html-parsers Commented Aug 15, 2012 at 10:41
  • You are actually asking for the algorithm that will parse your table string to data array? Commented Aug 15, 2012 at 10:41
  • I have just added that I want a simple tool like JSoup that does the work automatically without much code writing and analyzing Commented Aug 15, 2012 at 10:42

5 Answers 5

13

This is how it could be done using JSoup (srsly, don't use regexp for HTML).

Document doc = Jsoup.parse(html);
Elements tables = doc.select("table");
for (Element table : tables) {
    Elements trs = table.select("tr");
    String[][] trtd = new String[trs.size()][];
    for (int i = 0; i < trs.size(); i++) {
        Elements tds = trs.get(i).select("td");
        trtd[i] = new String[tds.size()];
        for (int j = 0; j < tds.size(); j++) {
            trtd[i][j] = tds.get(j).text(); 
        }
    }
    // trtd now contains the desired array for this table
}

Also, the class attribute value is not closed properly here in your example:

<td class="bold>Td2</td>

it should be

<td class="bold">Td2</td>
Sign up to request clarification or add additional context in comments.

1 Comment

if we have colspan and rowspan attributes in HTML table?
5

Maybe String.split('<whateverhtmltabletag>') can help you?

Also StringTokenizer class can be useful. Example:

String data = "one<br>two<br>three";  
StringTokenizer tokens = new StringTokenizer(data, "<br>");  
while (tokens.hasMoreElements()) {  
   System.out.println(tokens.nextElement());  // prints one, then two, then three
}

Also, using indexOf("<tag"), example here: http://forums.devshed.com/java-help-9/parse-html-table-into-2d-arrays-680614.html

You can also use an HTML parser (like jsoup) and then copy the contents from the table to an array. Here's an example in javascript: JavaScript to parse HTML table of numbers into an array

Comments

0

Nevermind, I saw this code in the internet: HtmlTableParser

It actually seems that now I have another problem, but it is not exactly related to this question, so I will open another one.

Comments

0

what i have so far, it is not the best one, but I hope it's helpful... simple with string

public void read_data() {
    try {
        file = new File("_result.xml");
        FileReader fileReader = new FileReader(file);
        BufferedReader bufferedReader = new BufferedReader(fileReader);
        String line = "";
        String output = "";
        int a = 0, b = 0;
        boolean _write = false;

        while ((line = bufferedReader.readLine()) != null) {
            if(line.trim().startsWith("<td")) { _write = true; } else { _write = false; }

            if(_write) {
                a = line.indexOf('>')+1;
                b = line.lastIndexOf('<');
                output += line.substring(a,b) + "|";
            }

            if(line.trim().equals("</tr>")) {
                System.out.println(output);
                output = "";
            }

        }
        fileReader.close();
    } catch (IOException e) {
        e.printStackTrace();
    }

Comments

0

For my own needs, I found a way that javascript automatically converts a table into something like a 2D array. Consider the following code:

document.querySelector("#table").children[0].children[r].children[c].innerText

In the above, r = the row index and c = the column index. Data can be accessed just like a 2D array using the row and column indices, automatically.

Here is yet another way, similar to the 2D-array access, but with CSS selectors:

document.querySelector("tr:nth-child(5) td:nth-child(4)")

finding the 4th column in the 5th row

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.