0

I have a CSV file which looks like

K1
,Value
M1,0
M2,10
M3,3
K2
,Value,Value,Value
M1,4,6,3
M2,7,3,4
M3,10,2,6
K1
,Value,Value
M1,0,4
M2,10,2
M3,3,7

enter image description here

The file is grouped by 5 rows. For example, the name of the first group is K1, followed by a dataframe with fixed 3 rows and 1 columns. The number of rows in groups are fixed but the number of columns is variable. K1 has 1 column, K2 has 3 columns and K3 has two columns. I want to read that to form a dictionary where the key is the name of the group, K1, K2 or K3 and the value is dataframe associated with the group name.

The simple read_csv like df = pd.read_csv('test.batch.csv') fails with the following error

Traceback (most recent call last):
  File "test.py", line 8, in <module>
    df = pd.read_csv('test.batch.csv')
  File "/home/mahmood/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 610, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/mahmood/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 468, in _read
    return parser.read(nrows)
  File "/home/mahmood/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 1057, in read
    index, columns, col_dict = self._engine.read(nrows)
  File "/home/mahmood/.local/lib/python3.8/site-packages/pandas/io/parsers.py", line 2061, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 756, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 771, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 827, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 814, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 1951, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 2 fields in line 7, saw 4

I know the file is not properly formatted for read_csv(), so I would like to know if there is any other read function to use like that. Any idea about that?

9
  • Will you please add a sample dataframe containing your expected output? Commented Nov 26, 2021 at 18:49
  • I added a figure. Does that help? Commented Nov 26, 2021 at 19:11
  • That's what you want the dataframe to look like? Commented Nov 26, 2021 at 19:13
  • 1
    So that would really have to be 3 separate dfs...right? Commented Nov 26, 2021 at 19:13
  • I would probably use the csv module to read in a list of lists, then do my own parsing, since you have 3 dataframes in one file, as @user17242583 says. Commented Nov 26, 2021 at 19:26

2 Answers 2

1

My idea is to start from an empty intermediate dictionary.

Then read a single line from the input file (the key) and following 4 lines as the value and add them to the dictionary.

The final step is to "map" this dictionary, using a dictionary comprehension, changing each (string) value into a DataFrame.

To do it, you can use read_csv, passing the value of the current key as the source content.

So the source code can be e.g.:

wrk = {}
with open('Input.csv') as fp:
    while True:
        cnt1 += 1
        line = fp.readline()
        if not line:
            break
        key = line.strip()
        txt = [ fp.readline().strip() for i in range(4) ]
        txt = '\n'.join(txt)
        wrk[key] = txt
result = { k: pd.read_csv(io.StringIO(v), index_col=[0]) for k, v in wrk.items() }

Note however a side effect, resulting from how read_csv works:

If column names are not unique then Pandas adds a dot and consecutive numbers to such "repeating" columns.

So e.g. the content of K2 key in result is:

    Value  Value.1  Value.2
M1      4        6        3
M2      7        3        4
M3     10        2        6

Or maybe actual column names in each input "section" are not the same?

To sum up, at least this code allows you to circumvent the limitation concerning same number of columns while reading a single DataFrame.

Sign up to request clarification or add additional context in comments.

2 Comments

With python 3.8, I get name 'io' is not defined. I also tried from io import StringIO and got the same error.
In my environment I use import io. Then StringIO is called just from io package.
0

Here's my implementation, using the csv.reader as I mentioned in my comment, and parsing the resulting list of lists. You will notice, I had to temporarily name the index columns, so I could set it as the index of the dataframe.

import csv
import pandas as pd
import numpy as np

INPUT_FILE = r"multidf.csv"

my_df_source = {}
rowcount = 0 # no data parsed

# read file into dictionary entries (key: (headers, datarows))
for row in csv.reader(open(INPUT_FILE,"r")):
    if len(row) == 1: # single entry lists are df keys
        if rowcount: # store in-process df, if any
            my_df_source[df_key] = df_headers, df_rows
        df_key = row[0]
        rowcount = 0
    else:
        if rowcount: # rows after row 0 are data
            df_rows.append(row)
        else: # row 0 is headers
            row[0]='index' # create temporary index label
            df_headers = row
            df_rows = []
        rowcount += 1

if rowcount: # store last df
    my_df_source[df_key] = df_headers, df_rows
        
# create dataframes             
my_dfs = {}    
for key,(headers,data) in my_df_source.items():
    my_dfs[key] = pd.DataFrame(np.array(data),columns=headers).set_index('index')
    my_dfs[key].index.name=None # remove temporary label
    
for key, df in my_dfs.items():
    print(key)
    print(df)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.