1

I have the following pandas code snippet that reads all the values found in a specific column of my .csv file.

sample_names_duplicates = pd.read_csv(infile, sep="\t", 
                                      engine="c", usecols=[4],
                                      squeeze=True)

That particualr column of my file contains perhaps 20 values at most (sample names), so it would probably be faster if I could drop the duplicates on the fly instead of storing them and then deleting the duplicates afterwards. Is this possible to delete duplicates as they are found in some way?

If not, is there a way to do this more quickly, without having to make the user explicitly name what the sample names in her file are?

2 Answers 2

3

Not "on the fly", although drop_duplicates should be fast enough for most needs.

If you want to do this on the fly, you'll have to manually track duplicates on the particular column:

import csv

seen = [] # or set()
dup_scan_col = 3
uniques = []

with open('yourfile.csv', 'r') as f:
    reader = csv.reader(f, delimiter='\t')
    for row in reader:
       if row[dup_scan_col] not in seen:
          uniques.append(row)
          seen.append(row[dup_scan_col])
Sign up to request clarification or add additional context in comments.

Comments

1

As the result returned by read_csv() is an iterable, you could just wrap this in a set() call to remove duplicates. Note that using a set will loose any ordering you may have. If you then want to sort, you should use list() and sort()

Unique unordered set example:

sample_names_duplicates = set(pd.read_csv(infile, sep="\t", engine="c", usecols=[4], squeeze=True))

Ordered list example:

sample_names = list(set(pd.read_csv(infile, sep="\t", engine="c", usecols=[4], squeeze=True)))
sample_names.sort()

3 Comments

Will try this, wonder if it does remove the duplicates on the fly.
Although read_csv() will actually return duplicated values, this way we are removing duplicated values returned.
Accepting this until someone finds a faster way.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.