1

I am scraping a table that will ultimately be exported into CSV format. There are several cases I may need to consider, such as nested tables, spanned rows/cells, etc. but for now I'm just going to ignore those cases and assume I have a very simple table. By "simple" I mean we just have rows and cells, possibly an unequal number of cells per row, but it's still a fairly basic in structure.

<table>
  <tr>
    <td>text </td>
    <td>text </td>
  </tr>
  <tr>
    <td>text </td>
  </tr>
</table>

My approach is to simply iterate over the rows and columns

String[] rowTxt;
WebElement table = driver.findElement(By.xpath(someLocator));
for (WebElement rowElmt : table.findElements(By.tagName("tr")))
{
    List<WebElement> cols = rowElmt.findElements(By.tagName("td"));
    rowTxt = new String[cols.size()];
    for (int i = 0; i < rowTxt.length; i++)
    {
        rowTxt[i] = cols.get(i).getText();
    }
}

However, this is quite slow. For a CSV file with 218 lines (which means, my table has 218 rows), each line having no more than 5 columns, it took 45 seconds to scrape the table.

I had tried to avoid iterating over each cell by using getText on the row element hoping that the output would be delimited by something, but it wasn't.

Is there a better way to scrape a table?

1
  • Alternatively, I may consider using selenium to get the page source, and then use Jsoup to do the actual HTML parsing, since I liked Jsoup's performance. Commented Jan 20, 2014 at 21:18

3 Answers 3

6

Rather than using selenium to parse the HTML, I use Jsoup. While Selenium provides functionality for traversing through a table, Jsoup is much more efficient. I've decided to use Selenium only for webpage automation, and delegate all parsing tasks to Jsoup.

My approach is as follows

  1. Get the HTML source for the required element
  2. Pass that to Jsoup as a string to parse

The code that I ended up writing was very similar to the selenium version

String source = "<table>" + driver.findElement(By.xpath(locator)).getAttribute("innerHTML") + "<table>";
Document doc = Jsoup.parse(source, "UTF-8");
for (Element rowElmt : doc.getElementsByTag("tr"))
{
    Elements cols = rowElmt.getElementsByTag("th");
    if (cols.size() == 0 )
        cols = rowElmt.getElementsByTag("td");

    rowTxt = new String[cols.size()];
    for (int i = 0; i < rowTxt.length; i++)
    {
        rowTxt[i] = cols.get(i).text();
    }
    csv.add(rowTxt);
}

The Selenium parser takes 5 minutes to read a 1000 row table, while the Jsoup parser takes less than 10 seconds. While I did not spend much time on benchmarking, I am pretty satisfied with the results.

Sign up to request clarification or add additional context in comments.

2 Comments

does it provide any feature to login to a site like selenium do?
one more this is the web site i connect to cant be queried with just moving to the URL, it need to be clicked on the link, and just sending URL just take you to the wrong page.
2

It most definetly is slow, no matter whether you use xpath, id or css to do your location. That said, if you were to use the pageObject pattern, you could make use of the @CacheLookup annotation. From the source:

  • By default, the element or the list is looked up each and every time a method is called upon it.
  • To change this behaviour, simply annotate the field with the {@link CacheLookup}.

I did a test using table of 100 rows and 6 columns, the test queried the text of each and every td element. Without the @CacheLookup the time taken (element was located by XPath as in your case) approx. 40sec. Using cache lookup, it dropped down to approx. 20sec, but it is still too much.

Anyway, if you would lose the firefox driver and run you tests headless (using htmlUnit), the speed would increase drastically. Running the same test headless, the times were between 100-200ms, so it could even be faster than Jsoup.

You can check/try my test code here.

1 Comment

I'll have to see whether HtmlUnitDriver supports the site I'm using it on, since I have had a number of javascript-related issues that I had not figured out how to get around. So I went with a browser to handle the javascript for me.
2

I'm using HtmlAgilityPack installed as a Nuget to parse dynamic html tables. its very fast and as per this answer you can query the results using linq. I've used this to store the result as a DataTable. Here's the public extension method class:-

public static class HtmlTableExtensions
{
    private static readonly ILog Log = LogManager.GetLogger(typeof(HtmlTableExtensions));

    /// <summary>
    ///     based on an idea from https://stackoverflow.com/questions/655603/html-agility-pack-parsing-tables
    /// </summary>
    /// <param name="tableBy"></param>
    /// <param name="driver"></param>
    /// <returns></returns>
    public static HtmlTableData GetTableData(this By tableBy, IWebdriverCore driver)
    {
        try
        {
            var doc = tableBy.GetTableHtmlAsDoc(driver);
            var columns = doc.GetHtmlColumnNames();
            return doc.GetHtmlTableCellData(columns);
        }
        catch (Exception e)
        {
            Log.Warn(String.Format("unable to get table data from {0} using driver {1} ",tableBy ,driver),e);
            return null;
        }
    }

    /// <summary>
    ///     Take an HtmlTableData object and convert it into an untyped data table,
    ///     assume that the row key is the sole primary key for the table,
    ///     and the key in each of the rows is the column header
    ///     Hopefully this will make more sense when its written!
    ///     Expecting overloads for swichting column and headers,
    ///     multiple primary keys, non standard format html tables etc
    /// </summary>
    /// <param name="htmlTableData"></param>
    /// <param name="primaryKey"></param>
    /// <param name="tableName"></param>
    /// <returns></returns>
    public static DataTable ConvertHtmlTableDataToDataTable(this HtmlTableData htmlTableData,
        string primaryKey = null, string tableName = null)
    {
        if (htmlTableData == null) return null;
        var table = new DataTable(tableName);

        foreach (var colName in htmlTableData.Values.First().Keys)
        {
            table.Columns.Add(new DataColumn(colName, typeof (string)));
        }
        table.SetPrimaryKey(new[] { primaryKey });
        foreach (var values in htmlTableData
            .Select(row => row.Value.Values.ToArray<object>()))
        {
            table.Rows.Add(values);
        }

        return table;
    }


    private static HtmlTableData GetHtmlTableCellData(this HtmlDocument doc, IReadOnlyList<string> columns)
    {
        var data = new HtmlTableData();
        foreach (
            var rowData in doc.DocumentNode.SelectNodes(XmlExpressions.AllDescendants + HtmlAttributes.TableRow)
                .Skip(1)
                .Select(row => row.SelectNodes(HtmlAttributes.TableCell)
                    .Select(n => WebUtility.HtmlDecode(n.InnerText)).ToList()))
        {
            data[rowData.First()] = new Dictionary<string, string>();
            for (var i = 0; i < columns.Count; i++)
            {
                data[rowData.First()].Add(columns[i], rowData[i]);
            }
        }
        return data;
    }

    private static List<string> GetHtmlColumnNames(this HtmlDocument doc)
    {
        var columns =
            doc.DocumentNode.SelectNodes(XmlExpressions.AllDescendants + HtmlAttributes.TableRow)
                .First()
                .SelectNodes(XmlExpressions.AllDescendants + HtmlAttributes.TableHeader)
                .Select(n => WebUtility.HtmlDecode(n.InnerText).Trim())
                .ToList();
        return columns;
    }

    private static HtmlDocument GetTableHtmlAsDoc(this By tableBy, IWebdriverCore driver)
    {
        var webTable = driver.FindElement(tableBy);
        var doc = new HtmlDocument();
        doc.LoadHtml(webTable.GetAttribute(HtmlAttributes.InnerHtml));
        return doc;
    }
}

The html data object is just an extension of dictionary:-

public class HtmlTableData : Dictionary<string,Dictionary<string,string>>
{

}

IWebdriverCore driver is a wrapper on IWebDriver or IRemoteWebdriver which exposes either of these interfaces as a readonly property, but you could just replace this with IWebDriver.

HtmlAttributes is a static lass holding const values for common html attributes to save on typos when referring to html elements/attributes/tags etc. in c# code:-

/// <summary>
/// config class holding common Html Attributes and tag names etc
/// </summary>
public static class HtmlAttributes
{
    public const string InnerHtml = "innerHTML";
    public const string TableRow = "tr";
    public const string TableHeader = "th";
    public const string TableCell = "th|td";
    public const string Class = "class";

... }

and SetPrimaryKey is an extension of DataTable which allows easy setting of the primary key for a datatable:-

    public static void SetPrimaryKey(this DataTable table,string[] primaryKeyColumns)
    {
        int size = primaryKeyColumns.Length;
        var keyColumns = new DataColumn[size];
        for (int i = 0; i < size; i++)
        {
            keyColumns[i] = table.Columns[primaryKeyColumns[i]];
        }
        table.PrimaryKey = keyColumns;
    }

I found this to be pretty performant - < 2 ms to parse a 30*80 table, and its a doddle to use.

2 Comments

can this provide with login feature to a web site like selenium do?
one more this is the web site i connect to cant be queried with just moving to the URL, it need to be clicked on the link, and just sending URL just take you to the wrong page.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.