1

So I have a tables that look something like this:

Communication: (Calls made)

Timestamp            FromIDNumber ToIDNumber GeneralLocation 
2012-03-02 09:02:30  878          674        Grasslands 
2012-03-02 11:30:01  456          213        Tundra 
2012-03-02 07:02:12  789          654        Mountains
2012-03-02 08:06:08  458          789        Tundra 

And I want to create a new table that has all the distinct FromIDNumber and ToIDNumber's.

This is the SQL Fiddle for it.

This works:

INSERT INTO CommIDTemp (`ID`)
SELECT DISTINCT Communication.FromIDNumber
FROM Communication
UNION DISTINCT 
SELECT DISTINCT Communication.ToIDNumber
FROM Communication;

and I got:

 ID  
 878
 456
 789
 674
 213
 654
 365

But I wonder if there is more efficient way, because the dataset that I have has millions and millions of lines and I didn't know about the performance of UNION DISTINCT.

I originally tried something like

INSERT INTO CommIDTemp (`ID`) 
SELECT DISTINCT Communication.FromIDNumber
AND Communication.ToIDNumber 
FROM Communication; 

but that didn't work... is there any other way to do this more efficiently? I'm pretty new to SQL, so any help would be greatly appreciated, thank you!!

2
  • 1
    A and B will try to insert the logical AND result of two strings. select 'a' and 'b' -> result = 0. Commented Jun 2, 2015 at 21:50
  • This is a one-time task? So it does not really matter how long it takes? What will you do about adding new values as more data comes in? Commented Jun 5, 2015 at 5:51

2 Answers 2

3

First thing: I do not have experience with this big tables. So you have to test out the following tipps yourself to find out if they are really working in your situation:

1. Create index in the source table

Make sure that both columns FromIDNumber and ToIDNumber have an index, i.e.

ALTER TABLE Communication ADD INDEX (FromIDNumber);
ALTER TABLE Communication ADD INDEX (ToIDNumber);

2. Try to remove DISTINCT

I could not find a faster query for your example, though you might try the query without the DISTINCT keyword - using UNION returns only distinct values by definition. So this SQL gives us the same result as your current query:

INSERT INTO CommIDTemp (`ID`)
SELECT FromIDNumber FROM Communication
UNION 
SELECT ToIDNumberFROM Communication;

3. Use a primary key in the temp table

Also try another approach by setting the CommIDTemp.ID column as a primary key and use INSERT IGNORE - this is especially useful if you want to update the table frequently without deleting the contents:

CREATE TABLE CommIDTemp (ID INT PRIMARY KEY);

INSERT IGNORE INTO CommIDTemp (`ID`)
SELECT FromIDNumber FROM Communication
UNION
SELECT ToIDNumber FROM Communication;
Sign up to request clarification or add additional context in comments.

1 Comment

UNION defaults to DISTINCT, so that won't make any difference. The other option is UNION ALL, but that may give you duplicates.
2

Performance is mainly going to depend on how the table is indexed. I don't see a way to do everything in one pass so I would suggest separate indexes on FromIDNumber and ToIDNumber. That should make each statement in your union very fast even for a lot of rows.

You can make this faster by only using one DISTINCT statement. EachDISTINCT requires a sort/temp table. You can drop the DISTINCT from each statement and the UNION DISTINCT will make sure you get distinct values.

INSERT INTO CommIDTemp (`ID`)
SELECT Communication.FromIDNumber
FROM Communication
UNION DISTINCT 
SELECT Communication.ToIDNumber
FROM Communication;

Side Note: UNION ALL is faster than UNION DISTINCT but based on your requirements you need UNION DISTINCT which can be written as simply UNION.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.