Optimized Algorithm for matching between two (unsorted) data tables?

Question

I am working on some functionality in my application that queries our database and pulls the data to one datatable, then opens an excel file and populates another datatable.

Because the excel file contains no usable ID, I cannot sort the data, and probably cannot use DataTable.Merge().

Here is the code for the matching algorithm I have created.

 private void RunMatchingAlgorithm()
    {
        // Initialize variables
        string partNumber = "";
        DateTime expiration_date = DateTime.Now;
        decimal contract_cost = 0;
        string contract_no = "";

        string partNumber2 = "";
        DateTime expiration_date2 = DateTime.Now;
        decimal contract_cost2 = 0;
        string contract_no2 = "";

        //Get values from DataBase
        for (int i = 0; i < dtFromTableContracts.Rows.Count; i++)
        {
            partNumber2 = dtFromTableContracts.Rows[i]["supplier_part_no"].ToString();
            contract_no2 = dtFromTableContracts.Rows[i]["contract_no"].
            expiration_date2 = Convert.ToDateTime(dtFromTableContracts.Rows[i]["con_end_date"]).Date;

            //Get Values from converted Excel table
            for (int j = 0; j < dtConversion.Rows.Count; j++)
            {
                contract_no = dtConversion.Rows[j]["vend_contract_no"].ToString();

                //If we have even a partial match, check for a part number match
                if (contract_no2.StartsWith(contract_no))
                {
                    partNumber = dtConversion.Rows[j]["vend_item_id"].ToString();

                    //If the values match, populate from both tables
                    if (partNumber == partNumber2)
                    {
                        dtConversion.Rows[j]["wpd_expiration_date"] = expiration_date2.Date;
                        dtConversion.Rows[j]["wpd_cont_cost"] = dtFromTableContracts.Rows[i]["contract_cost"];
                        dtConversion.Rows[j]["wpd_contract_no"] = dtFromTableContracts.Rows[i]["contract_no"];
                        dtConversion.Rows[j]["wpd_item_id"] = dtFromTableContracts.Rows[i]["supplier_part_no"];
                        dtConversion.Rows[j]["wpd_item_no"] = dtFromTableContracts.Rows[i]["item_id"];
                        dtConversion.Rows[j]["discontinued"] = dtFromTableContracts.Rows[i]["discontinued"];
                        dtConversion.Rows[j]["job_no"] = dtFromTableContracts.Rows[i]["job_no"];
                    }
                }
            }
        }
    }

If you're curious, a later method removes any unmatched lines and we display only the matched records in a DGV.

This currently works as expected but if my Big O notation is correct, i'm dealing with O(m*n) which gets quite slow with larger data sets, and is extremely processor intensive.

I am looking for a more efficient way to accomplish this than looping over every single row as some of the excel spreadsheets we work with are close to 40,000 rows. This algorithm takes about 6 minutes to complete with that size of a set.

Of course you can sort - just sort on all the fields rather than just one. — Mark Ransom
– Mark Ransom, Commented Mar 4, 2013 at 19:19
@MarkRansom this would still produce slower than linear algorithm as sorting itself can be slow — Sten Petrov
– Sten Petrov, Commented Mar 4, 2013 at 19:23
Nobody is asking for a linear algorithm. But using a sorted structure to find matching contract_no is faster than this approach. Note: Sorting is O(n log n) — alzaimar
– alzaimar, Commented Mar 4, 2013 at 19:33

Craig Gidney · Accepted Answer · 2013-03-04 19:37:29Z

Oh boy, code with lots of opportunities for simplification. You can reduce the scope of local variables, removing any temptation to assign them unused values. You can also convert For loops to ForEach loops when you don't use the index except to access a collection.

Initial simplification:

private void RunMatchingAlgorithm() {
    foreach (var databaseRow in dtFromTableContracts.Rows) {
        string partNumber2 = databaseRow["supplier_part_no"].ToString();
        string contract_no2 = databaseRow["contract_no"].ToString();
        DateTime expiration_date2 = Convert.ToDateTime(databaseRow["con_end_date"]).Date;

        foreach (var excelRow in dtConversion.Rows) {
            string contract_no = excelRow["vend_contract_no"].ToString();

            //If we have even a partial match, check for a part number match
            if (contract_no2.StartsWith(contract_no)) {
                string partNumber = excelRow["vend_item_id"].ToString();

                //If the values match, populate from both tables
                if (partNumber == partNumber2) {
                    excelRow["wpd_expiration_date"] = expiration_date2.Date;
                    excelRow["wpd_cont_cost"] = databaseRow["contract_cost"];
                    excelRow["wpd_contract_no"] = databaseRow["contract_no"];
                    excelRow["wpd_item_id"] = databaseRow["supplier_part_no"];
                    excelRow["wpd_item_no"] = databaseRow["item_id"];
                    excelRow["discontinued"] = databaseRow["discontinued"];
                    excelRow["job_no"] = databaseRow["job_no"];
                }
            }
        }
    }
}

Come to think of it, this is pretty much the exact case linq queries were designed for. We can transform most of the code into a query:

private void RunMatchingAlgorithm() {
    var matches = from databaseRow in dtFromTableContracts.Rows
                  let partNumber2 = databaseRow["supplier_part_no"].ToString()
                  let contract_no2 = databaseRow["contract_no"].ToString()
                  let expiration_date2 = Convert.ToDateTime(databaseRow["con_end_date"]).Date
                  from excelRow in dtConversion.Rows
                  let contract_no = excelRow["vend_contract_no"].ToString()
                  where contract_no2.StartsWith(contract_no)
                  let partNumber = excelRow["vend_item_id"].ToString()
                  where partNumber == partNumber2
                  select new { databaseRow, excelRow, expiration_date2 }
    foreach (var m in matches) {
        var dst = m.excelRow;
        var src = m.databaseRow;

        dst["wpd_expiration_date"] = m.expiration_date2.Date;
        dst["wpd_cont_cost"] = src["contract_cost"];
        dst["wpd_contract_no"] = src["contract_no"];
        dst["wpd_item_id"] = src["supplier_part_no"];
        dst["wpd_item_no"] = src["item_id"];
        dst["discontinued"] = src["discontinued"];
        dst["job_no"] = src["job_no"];
    }
}

and now I see where an optimization can be applied. We're doing a nested 'from' with a 'where', and that's equivalent to a cross-join. Also, we can cut most of the now-only-used-once temporaries:

private void RunMatchingAlgorithm() {
    var matches = from databaseRow in dtFromTableContracts.Rows
                  join excelRow in dtConversion.Rows
                  on excelRow["vend_item_id"].ToString() equals databaseRow["supplier_part_no"].ToString()
                  where databaseRow["contract_no"].ToString().StartsWith(excelRow["vend_contract_no"].ToString())
                  select new { databaseRow, excelRow }
    foreach (var m in matches) {
        var dst = m.excelRow;
        var src = m.databaseRow;

        dst["wpd_expiration_date"] = Convert.ToDateTime(src["con_end_date"]).Date;
        dst["wpd_cont_cost"] = src["contract_cost"];
        dst["wpd_contract_no"] = src["contract_no"];
        dst["wpd_item_id"] = src["supplier_part_no"];
        dst["wpd_item_no"] = src["item_id"];
        dst["discontinued"] = src["discontinued"];
        dst["job_no"] = src["job_no"];
    }
}

I actually haven't used cross-joins much, but I assume they use a hash table under the hood to have O(n+m) complexity instead of O(n*m). If both tables were in the database, then the database could take advantage of already-constructed hash tables / indexes.

You might also want to consider some sort of generated Linq2SQL class, so you can have type-safe access to row fields.

Thank you very much, I can see now how I over complicated this. After implementation it certainly returns the desired results faster and more efficiently.

Sten Petrov · Accepted Answer · 2013-03-04 19:24:48Z

0

You can hash the values you want to match and have O(m+n) complexity.

You'll have to make a function that parses both sides and creates a pair (or set) of uniform results that can then be hashed. For example if your contract_no has some known-format prefix on one side and no prefix on the other you can remove the prefix and hash that.

answered Mar 4, 2013 at 19:24

Sten Petrov

11.1k1 gold badge45 silver badges63 bronze badges

Comments

Jim Mischel · Accepted Answer · 2013-03-04 19:25:51Z

You should be able to cut the time approximately in half by exiting the inner loop after finding a matching part number. That is, put a break as the last statement in the code that's executed when the part numbers match.

Still, half of forever is still forever.

With just 40,000 rows, you could easily populate a dictionary that contains the contract number and part number as the key, and the conversion table row index as the value. Something like:

Dictionary<string, int> conversionLookup = new Dictionary<string, int>();
for (int i = 0; i < dtConversion.Rows.Count; ++j)
{
    conversionLookup.Add(string.Format("{0}:{1}", 
        dtConversion.Rows[j]["vend_contract_no"].ToString(),
        dtConversion.Rows[j]["vend_item_id"].ToString()), j);
}

Then, your outer loop becomes:

for (int i = 0; i < dtFromTableContracts.Rows.Count; i++)
{
    partNumber2 = dtFromTableContracts.Rows[i]["supplier_part_no"].ToString();
    contract_no2 = dtFromTableContracts.Rows[i]["contract_no"].ToString();
    string lookup = string.Format("{0}:{1}", contract_no2, partNumber2);
    int ix;
    if (conversionLookup(lookup, out ix))
    {
        // update dtConversion.Rows[ix]
    }
}

I'm not familiar with the restrictions on your part numbers (you use StartsWith rather than equals in your inner loop), so you might have to adjust the indexes somewhat.

That should be very fast.

If you can't create keys that use the contract number for a direct lookup, you could do something quite similar by creating a List with the contract numbers and part numbers, and binary search it. That would be O(m log n) rather than O(m * n). Still quite a bit faster.

alzaimar · Accepted Answer · 2013-03-04 19:35:37Z

0

A Trie would be the fastest structure for finding substrings.

answered Mar 4, 2013 at 19:35

alzaimar

4,6321 gold badge22 silver badges31 bronze badges

Comments

Pieter Geerkens · Accepted Answer · 2013-03-04 19:36:23Z

0

If ContractTable.ContractNo is of known constant length, then you have (de facto) a PK-FK relationship between the two tables on:

ContractTable.ContractNo = substring(Conversion.VendContractNo,1,K)
ContractTable.PartNumber = Conversion.PartNumber

where K == Length(ContractTabel.ContractNo).

Indexing both tables on this key structure will allow matches to be made in O(Log N) + O(Log M) time, with the index creation taking O(N * Log N) + O(M * Log M) time. Hashing could improve this even more, depending on the construction of a good hash.

answered Mar 4, 2013 at 19:36

Pieter Geerkens

11.9k2 gold badges35 silver badges54 bronze badges

Collectives™ on Stack Overflow

Optimized Algorithm for matching between two (unsorted) data tables?

5 Answers 5

1 Comment

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related