0

I have some large csv and xlsx files which I need to set up pandas DataFrames for. I have code which locates these files within the directory (when printed, these show correct pathnames). These paths are then passed to a helper function which is meant to set up the required DataFrames for the files, then the data will be passed to other functions for some manipulation. I intend to have the data written to a file (by loading a template, writing the data to it, and saving this file) once this is completed.

I currently have code like:

import pandas
# some set-up functions (which work; verified using print statements)

def createDataFrame(filename):
    if filename.endswith('.csv'):
        df = pandas.read_csv(StringIO(filename), skip_blank_lines=True, index_col=False,
                             encoding="utf-8", skipinitialspace=True)

When I try print(df), I get:

Empty DataFrame

Columns: [a.csv]

Index: []

and print(StringIO(filename)) gives me:

<_io.StringIO object at 0x004D1990>

However, when I leave out the StringIO() around filename in the function, I get this error:

OSError: File b'a.csv' does not exist

Everywhere that I've been able to find information on this has either just said import and start using, or talks about using read_csv() rather than from_csv() (from this question, which wasn't very helpful here), and even the current pandas docs basically say that it should be as easy as passing the file to pandas.read_csv().

1) I've checked that I have full permissions and that the file is valid and exists. Why am I getting the OSError?

2) When I use StringIO(), why am I still getting an empty DataFrame here? How can I fix this?

Thanks in advance.

2
  • Why do you need StringIO? won't it just work without this? i.e. pandas.read_csv(filename,.....) Commented Apr 13, 2016 at 16:59
  • As posted in my question, without StringIO I am getting an OSError. I have been unable to discover why, and would appreciate any pointers that could solve that issue (and then maybe it will all work) Commented Apr 13, 2016 at 17:00

1 Answer 1

0

I have solved this.

StringIO was the root cause of this problem. Because I'm on Windows, os.path.is_file() was returning False, and I got the error:

OSError: File b'a.csv' does not exist

It wasn't until I stumbled upon this page from the Python 2.5 docs that I discovered that the call should actually be os.path.isfile() on Windows because it uses ntpath behind the scenes. This is to better handle the difference in pathnames between systems (Windows uses '\', Unix uses '/').

Because I had something weird going on in my paths, pandas was unable to properly load the CSV files into DataFrames.

By simply changing my code from this:

import pandas
# some set-up functions (which work; verified using print statements)

def createDataFrame(filename):
    if filename.endswith('.csv'):
        df = pandas.read_csv(StringIO(filename), skip_blank_lines=True, index_col=False,
                             encoding="utf-8", skipinitialspace=True)

to this:

import pandas
# some set-up functions (which have been updated)

def createDataFrame(filename):
    basepath = config.complete_datasets_dir
    fullpath = os.path.join(basepath, filename)

    if filename.endswith('.csv'):
        df = pandas.read_csv(fullpath, skip_blank_lines=True, index_col=False,
                             encoding="utf-8", skipinitialspace=True)

and appropriately updating the function which calls that function:

def somefunc():
    dfs = []
    data_lists = getInputFiles() # checks data directory for files containing info
    for item in data_lists:
        tdata = createDataFrames(item)
        dfs.append(tdata)
    print(dfs)

I was able to get the output I was looking for:

[    1   2   3   4   5   6   7   8   9   10
0  11  12  13  14  15  16  17  18  19  20
1  21  22  23  24  25  26  27  28  29  30
2  31  32  33  34  35  36  37  38  39  40,     1  2  3  4  5  6  7  8  9  10
0  11  12  13  14  15  16  17  18  19  20
1  21  22  23  24  25  26  27  28  29  30]

which is a list of two DataFrames, the first of which came from a CSV containing only the numbers 1-40 (on 4 rows total, no headers); the second file contains only the numbers 1-30 (formatted the same way).

I hope this helps someone in the future.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.