1

So I've been searching for a solution and reading books, and havent been able to figure it out, the question is rather simple, I have 2 tables. On one table I have 2 fields:

table_1:"chromosome" and "position" both of the being integers.

table_2:"chromosome" "start" and "end", all being integers as well.

I want a query that gives me back all rows from table_1 that are between the start and end of table_2. The query looks like this:

SELECT 
    table_1 . *
FROM
    table_1,
    table_2
WHERE
    table_1.chromosome = table_2.chromosome
        AND table_1.position > table_2.start
        AND table_1.position < table_1.end;

So this query works fine, but my tables are many millions of rows (7092713) and (215909) respectvely. I indexed chromosome, pos and chromosome, start, end. The weird part is that if I do the query one by one (perl DBI, do one statement for every row of table_2), this runs a lot faster. Not sure where am I screwing up. Any help would be appreciated.

Jorge Kageyama

2 Answers 2

1

For the sake of clarity, let's start by recasting your query using the standard JOIN syntax. The query is equivalent but easier to read.

SELECT table_1 . *
  FROM table_1 
  JOIN table_2 ON (     table_1.chromosome = table_2.chromosome
                    AND table_1.position > table_2.start
                    AND table_1.position < table_1.end)

Second, it's smart when searching large tables (or any tables for that matter) to avoid * in your SELECT clauses. Using * denies useful data to the optimizer about what you do, or don't, need in your result set. So let us say

SELECT table_1.chromosome, table_1.position

for SELECT.

So, it becomes clear that your result set, and your join, need chromosome and position, and nothing else, from your larger table. Try creating a compound BTREE index on that table, as follows.

CREATE INDEX ON table_1(chromosome,position) USING BTREE

Similarly, try creating an index on table_2 as follows.

CREATE INDEX ON table_2(chromosome,start, end) USING BTREE

These are called covering indexes. They contain enough columns that the query can be satisfied from the index without having to bounce back to the original table.

BTREE indexes (the default by the way) are inherently ordered. Appropriate records in table_1 can be found by range scans on the index starting with (chromosome,start) and ending with (chromosome,end).

Third, it's possible you're getting a massive combinatorial explosion of rows from table_1 in your result set. You'll get a row for every combination of rows in the two tables that matches your ON() clause. It's hard to know whether that's the case without knowing a lot about your data.

You could try to reduce that combinatorial explosion using

SELECT DISTINCT table_1.chromosome, table_1.position

Give this a try. If you're still not getting anywhere, maybe another question with complete table definitions and the results of EXPLAIN will be helpful.

Sign up to request clarification or add additional context in comments.

3 Comments

Hi, So first of all, thank you for the reply! I already added indexes to the tables, through position and chromosome, and chromosome start end (start is always lower than end, and it is my understanding if I index this way, I can use any most left index alone), I only used the * for this example, but actually as you said I only need pos and chromosome so Im already using you sugestions :)
I added something about SELECT DISTINCT.
Hey,Ok, so I tried that and still takes forever to run, my data is unique, there are no 2 singles identical positions, and the data doesnt overlap, the other weird thing is that if I do 1 query per entry from table 2, it is blazing fast, I mean, I can keep dong it this way, just wnated to handle everythign directly with mysql :( but thnx!
0

Interesting question. Without knowing more about the quantities contained in "position," I would still approach it generally in this way:

Select for position generally from table_1 (with 7.0mm entities) so that the resulting table is a bin of a smaller amount of data. Let's say, for instance, that the "position" quantity is a set of discrete integers from 2-9. Select from table_1 where position is equal to 2, then select from table_2 where "start" is less than 2 and "end" is greater than 2. Iterate over this query selection 8 times updating a new table_3 with results.

I am assuming here that table_2 is unique on chromosome, and table_1 is not. Therefore, you end up with chromosomes that could have multiple positions within the same range (a chromosome has one range, but can appear anywhere within that range). You also, then, can't tell how large the resulting join table is going to be, but it could be quite large as each of the 7mm entities in table_1 could be within all ranges in table_2.

Iterating would allow you to "grow" your results while observing the quality at each point experimentally before committing to the entire loop.

Here is an idea of the query I have in mind (untested):

SELECT table_1.chromosome, table_1.position, table_2.start, table_2.end
FROM 
(SELECT table_1.chromosome, table_1.position
  from table_1 where table_1.position = 2)
JOIN
(SELECT table_2.chromosome, table_2.start, table_2.end
  from table_2 where table_2.start < 2 AND table_2.end > 2)
ON
table_1.chromosome = table_2.chromosome

Good luck, and I hope you find your answer!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.