Read large csv file with many duplicate values, drop duplicates while reading

Question

I have the following pandas code snippet that reads all the values found in a specific column of my .csv file.

sample_names_duplicates = pd.read_csv(infile, sep="\t", 
                                      engine="c", usecols=[4],
                                      squeeze=True)

That particualr column of my file contains perhaps 20 values at most (sample names), so it would probably be faster if I could drop the duplicates on the fly instead of storing them and then deleting the duplicates afterwards. Is this possible to delete duplicates as they are found in some way?

If not, is there a way to do this more quickly, without having to make the user explicitly name what the sample names in her file are?

Burhan Khalid · Accepted Answer · 2015-03-04 08:50:10Z

3

Not "on the fly", although drop_duplicates should be fast enough for most needs.

If you want to do this on the fly, you'll have to manually track duplicates on the particular column:

import csv

seen = [] # or set()
dup_scan_col = 3
uniques = []

with open('yourfile.csv', 'r') as f:
    reader = csv.reader(f, delimiter='\t')
    for row in reader:
       if row[dup_scan_col] not in seen:
          uniques.append(row)
          seen.append(row[dup_scan_col])

answered Mar 4, 2015 at 8:50

Burhan Khalid

175k20 gold badges254 silver badges291 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Caumons · Accepted Answer · 2015-03-04 08:51:49Z

1

As the result returned by read_csv() is an iterable, you could just wrap this in a set() call to remove duplicates. Note that using a set will loose any ordering you may have. If you then want to sort, you should use list() and sort()

Unique unordered set example:

sample_names_duplicates = set(pd.read_csv(infile, sep="\t", engine="c", usecols=[4], squeeze=True))

Ordered list example:

sample_names = list(set(pd.read_csv(infile, sep="\t", engine="c", usecols=[4], squeeze=True)))
sample_names.sort()

edited Mar 4, 2015 at 8:51

answered Mar 4, 2015 at 8:37

Caumons

9,64514 gold badges71 silver badges85 bronze badges

3 Comments

The Unfun Cat Over a year ago

Will try this, wonder if it does remove the duplicates on the fly.

Caumons Over a year ago

Although read_csv() will actually return duplicated values, this way we are removing duplicated values returned.

The Unfun Cat Over a year ago

Accepting this until someone finds a faster way.

Collectives™ on Stack Overflow

Read large csv file with many duplicate values, drop duplicates while reading

2 Answers 2

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related