2

I'm loading text files into my db and trying to do some quick matching between a table that lists names of organizations, and a table that holds the text file and potential matches to those organizations.

I load the file using LOAD INFILE CONCURRENT and don't have any problems with that.

The twist comes from the fact that the field I'm trying to match in the raw text table (occupationoraffiliation) has more than just organization names in it. So I'm trying to use LIKE with wildcards to match the strings.

To match the text, I'm trying to use this query:

UPDATE raw_faca JOIN orgs AS o
    ON raw_faca.org_id IS NULL AND raw_faca.occupationoraffiliation LIKE CONCAT('%',o.org_name,'%')
SET raw_faca.org_id = o.org_id;

I've also tried without CONCAT:

UPDATE raw_faca JOIN orgs AS o
    ON raw_faca.org_id IS NULL AND raw_faca.occupationoraffiliation LIKE ('%' + o.org_name + '%')
SET raw_faca.org_id = o.org_id;

The raw_faca table has ~40,000 rows and the orgs table has ~ 20,000 rows. I have indexes on all the The query has been running for a couple of hours or so -- this seems like way too long for the operation. Is the comparison I'm trying to run just that inefficient or am I doing something spectacularly stupid here? I was hoping to avoid going line-by-line with an external php or python script.

In response to comments below about using Match . . . Against, I've tried the following query as well:

UPDATE raw_faca JOIN orgs AS o ON raw_faca.org_id IS NULL AND MATCH(raw_faca.occupationoraffiliation) AGAINST (o.org_name IN NATURAL LANGUAGE MODE)
SET raw_faca.org_id = o.org_id; 

And it's giving me this error:

incorrect arguments to AGAINST

Any thoughts?

1 Answer 1

3

A LIKE clause with a leading wild card is not going to be able to take advantage of any indexes.

Sign up to request clarification or add additional context in comments.

6 Comments

In other words, @tchaymore, your query has to examine 40k * 20k = 800M combinations which explains why it is so slow.
Got it -- I have been using external scripts, which face basically the same problem. Any ideas on how to do matching within a field, like I'm trying to do here, in a way that would take advantage of indexes?
@tchaymore: If this is a MyISAM table, you could look into setting up a full-text index.
I have a full-text index on occupationoraffiliation field -- should also put one on the two org_id fields? Right now they're just indexes.
@tchaymore: Then you should be able to convert your LIKE into something along the lines of ...AND MATCH(raw_faca.occupationoraffiliation) AGAINST (o.org_name IN NATURAL LANGUAGE MODE)...
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.