1

When I run the following code:

import pandas as pd

with open('data/training.csv', 'r') as f:

    data2 = pd.read_csv(f, sep='\t', index_col=0)
    EventID = pd.date_range('1/1/2000', periods=250000)
    df = pd.DataFrame(data2, index=EventID, columns=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31])

print df[:3]

print(data2)

I get the following output:

            1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  \
2000-01-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN  
2000-01-02 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN  
2000-01-03 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN  

            17  18  19  20     
2000-01-01 NaN NaN NaN NaN ... 
2000-01-02 NaN NaN NaN NaN ... 
2000-01-03 NaN NaN NaN NaN ... 

I know the values within the CSV are not all "NaN" so why does the output looks like this and how can I get the correct output with the numbers in reach of the rows?

When I comment out the "EventID" and the line that adds the "columns" as such:

import pandas as pd

with open('data/training.csv', 'r') as f:

    df = pd.read_csv(f, sep='\t', index_col=0)
    # EventID = pd.date_range('1/1/2000', periods=250000)
    # df = pd.DataFrame(data2, index=EventID, columns=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31])

print df[:3]

I get the following output in the terminal:

/usr/bin/python2.7 /home/amit/PycharmProjects/HB/Read.py
Empty DataFrame
Columns: []
Index: [100000,138.47,51.655,97.827,27.98,0.91,124.711,2.666,3.064,41.928,197.76,1.582,1.396,0.2,32.638,1.017,0.381,51.626,2.273,-2.414,16.824,-0.277,258.733,2,67.435,2.15,0.444,46.062,1.24,-2.475,113.497,0.00265331133733,s, 100001,160.937,68.768,103.235,48.146,-999.0,-999.0,-999.0,3.473,2.078,125.157,0.879,1.414,-999.0,42.014,2.039,-3.011,36.918,0.501,0.103,44.704,-1.916,164.546,1,46.226,0.725,1.158,-999.0,-999.0,-999.0,46.226,2.23358448717,b, 100002,-999.0,162.172,125.953,35.635,-999.0,-999.0,-999.0,3.148,9.336,197.814,3.776,1.414,-999.0,32.154,-0.705,-2.093,121.409,-0.953,1.052,54.283,-2.186,260.414,1,44.251,2.053,-2.028,-999.0,-999.0,-999.0,44.251,2.34738894364,b]

[3 rows x 0 columns]

Process finished with exit code 0

I'm not sure what to make of the "3 rows by 0 columns" part.

6
  • what happens if you just do pd.read_csv('data/training.csv', sep='\t', index_col=0) does this read the values correctly?, also what happens if you drop the index_col param, it looks like it is unnecessary seeing as you are assigning a new one later, by default it will assume your csv has no index column so if that is your intention then remove the param Commented Jul 7, 2014 at 22:08
  • @EdChum Doing that displays the numerical values, but the dimensions of this output are [3 rows x 0 columns]. Is this to be expected? If so, how would I get the values in one row, or one column, or a subset, etc... Commented Jul 7, 2014 at 22:12
  • I don't understand, how can there be 0 columns and somehow 3 rows? Once the data is loaded you can manipulate the shape however you want, could you post the raw data and some code that reproduces your error, at the moment it is a bit vague and difficult to understand if we cannot reproduce your problem Commented Jul 7, 2014 at 22:15
  • @EdChum, I edited my original post to include the "3 rows by 0 columns" part Commented Jul 7, 2014 at 22:22
  • it looks like you have just a single column, is that correct? If so then pass index_col=None to read_csv. You then still manipulate the data, but that should get you over the first hurdle Commented Jul 7, 2014 at 22:26

1 Answer 1

1

Don't know how you data looks like exactly, but I will just take whatever in the OP:

In [76]:

%%file temp.csv
100000,138.47,51.655,97.827,27.98,0.91,124.711,2.666,3.064,41.928,197.76,1.582,1.396,0.2,32.638,1.017,0.381,51.626,2.273,-2.414,16.824,-0.277,258.733,2,67.435,2.15,0.444,46.062,1.24,-2.475,113.497,0.00265331133733,s, 100001,160.937,68.768,103.235,48.146,-999.0,-999.0,-999.0,3.473,2.078,125.157,0.879,1.414,-999.0,42.014,2.039,-3.011,36.918,0.501,0.103,44.704,-1.916,164.546,1,46.226,0.725,1.158,-999.0,-999.0,-999.0,46.226,2.23358448717,b, 100002,-999.0,162.172,125.953,35.635,-999.0,-999.0,-999.0,3.148,9.336,197.814,3.776,1.414,-999.0,32.154,-0.705,-2.093,121.409,-0.953,1.052,54.283,-2.186,260.414,1,44.251,2.053,-2.028,-999.0,-999.0,-999.0,44.251,2.34738894364,b

In [77]:
#make sure it is tab delimited rather than , delimited
#Change pd.DataFrame(data2 to pd.DataFrame(data2.values
with open('temp.csv', 'r') as f:
    data2 = pd.read_csv(f, sep=',', index_col=0, header=None)
    EventID = pd.date_range('1/1/2000', periods=1)
    df = pd.DataFrame(data2.values, index=EventID, columns=range(98))

print df[:3]

                0       1       2      3     4        5      6      7   \
2000-01-01  138.47  51.655  97.827  27.98  0.91  124.711  2.666  3.064   

                8       9    ...   88      89     90     91   92   93   94  \
2000-01-01  41.928  197.76   ...    1  44.251  2.053 -2.028 -999 -999 -999   

                95        96 97  
2000-01-01  44.251  2.347389  b  

[1 rows x 98 columns]

pd.DataFrame(data2.values is the key here. data2 is a DataFrame and has its own index. Now you want to wrap it in a new DataFrame with new timeseries index, pandas will try to match and align the original index with the new one, but there are no matches.

Therefore, pd.DataFrame(data2... will result in a DataFrame full of nan. The solution is to pass the values, in numpy.array, to the constructor, by pd.DataFrame(data2.value....

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.