Python pandas duplicate values error

Question

I have a large tab delimited data file, and I want to read it in python using pandas "read_csv or 'read_table' function. When I am reading this large file it is showing me the following error, even after turning off the "index_col" value.

>>> read_csv("test_data.txt", sep = "\t", header=0, index_col=None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/parsers.py", line 187, in read_csv
    return _read(TextParser, filepath_or_buffer, kwds)
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/parsers.py", line 160, in _read
    return parser.get_chunk()
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/parsers.py", line 613, in get_chunk
    raise Exception(err_msg)
Exception: Implicit index (columns 0) have duplicate values [372, 1325, 1497, 1636, 2486,<br> 2679, 3032, 3125, 4261, 4669, 5215, 5416, 5569, 5783, 5821, 6053, 6597, 6835, 7485, 7629, 7684, 7827, 8590, 9361, 10194, 11199, 11707, 11782, 12397, 15134, 15299, 15457, 15637, 16147, 17448,<br> 17659, 18146, 18153, 18398, 18469, 19128, 19433, 19702, 19830, 19940, 20284, 21724, 22764, 23514, 25095, 25195, 25258, 25336, 27011, 28059, 28418, 28637, 30213, 30221, 30574, 30611, 30871, 31471, .......

I thought I might have duplicate values in my data and thus used grep to redirect some of these values into a file.

 grep "9996744\|9965107\|740645\|9999752" test_data.txt > delnow.txt

Now, when I read this file, it is read correctly as you can see below.

>>> read_table("delnow.txt", sep = "\t", header=0, index_col=None)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20 entries, 0 to 19
Data columns:
0740645                                                                 20  non-null values
M                                                                       20  non-null values
BLACK/CAPE VERDEAN                                                      20  non-null values

What is going on here? I am struggling for a solution but to no avail.

I also tried 'uniq' command in unix to see if duplicate lines exist but could not find any.

Does it has to do something with chunk-size?

I am using the following version of pandas

>>> pandas.__version__
'0.7.3'
>>>

What version of pandas are you on? Can you post some of the data? I think that the newest version of Pandas allows for duplicate index values. . . — reptilicus
– reptilicus, Commented Sep 19, 2012 at 16:34
Is it tab separated? Perhaps you could include the first couple of lines? — Andy Hayden
– Andy Hayden, Commented Sep 19, 2012 at 16:38
@hayden, I included a sample file. It is a normal tab delimited file. nothing special. — Curious
– Curious, Commented Sep 19, 2012 at 17:06

Curious · Accepted Answer · 2012-09-19 17:29:42Z

1

Installed pandas latest version.

I am able to read now.

>>> import pandas
>>> pandas.__version__
'0.8.1'

answered Sep 19, 2012 at 17:29

Curious

3,6178 gold badges32 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python pandas duplicate values error

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related