4

I have a large tab delimited data file, and I want to read it in python using pandas "read_csv or 'read_table' function. When I am reading this large file it is showing me the following error, even after turning off the "index_col" value.

>>> read_csv("test_data.txt", sep = "\t", header=0, index_col=None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/parsers.py", line 187, in read_csv
    return _read(TextParser, filepath_or_buffer, kwds)
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/parsers.py", line 160, in _read
    return parser.get_chunk()
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/parsers.py", line 613, in get_chunk
    raise Exception(err_msg)
Exception: Implicit index (columns 0) have duplicate values [372, 1325, 1497, 1636, 2486,<br> 2679, 3032, 3125, 4261, 4669, 5215, 5416, 5569, 5783, 5821, 6053, 6597, 6835, 7485, 7629, 7684, 7827, 8590, 9361, 10194, 11199, 11707, 11782, 12397, 15134, 15299, 15457, 15637, 16147, 17448,<br> 17659, 18146, 18153, 18398, 18469, 19128, 19433, 19702, 19830, 19940, 20284, 21724, 22764, 23514, 25095, 25195, 25258, 25336, 27011, 28059, 28418, 28637, 30213, 30221, 30574, 30611, 30871, 31471, .......

I thought I might have duplicate values in my data and thus used grep to redirect some of these values into a file.

 grep "9996744\|9965107\|740645\|9999752" test_data.txt > delnow.txt

Now, when I read this file, it is read correctly as you can see below.

>>> read_table("delnow.txt", sep = "\t", header=0, index_col=None)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20 entries, 0 to 19
Data columns:
0740645                                                                 20  non-null values
M                                                                       20  non-null values
BLACK/CAPE VERDEAN                                                      20  non-null values

What is going on here? I am struggling for a solution but to no avail.

I also tried 'uniq' command in unix to see if duplicate lines exist but could not find any.

Does it has to do something with chunk-size?

I am using the following version of pandas

>>> pandas.__version__
'0.7.3'
>>> 
3
  • 2
    What version of pandas are you on? Can you post some of the data? I think that the newest version of Pandas allows for duplicate index values. . . Commented Sep 19, 2012 at 16:34
  • Is it tab separated? Perhaps you could include the first couple of lines? Commented Sep 19, 2012 at 16:38
  • @hayden, I included a sample file. It is a normal tab delimited file. nothing special. Commented Sep 19, 2012 at 17:06

1 Answer 1

1

Installed pandas latest version.

I am able to read now.

>>> import pandas
>>> pandas.__version__
'0.8.1'
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.