Reading chunks of csv file in Python using pandas

Question

I have a question concerning reading bits and pieces of a csv file. When just reading the file using

pd.read_csv(path,sep=';',na_values=[''],thousands='.',decimal=',',date_parser=[0])

I get:

     EUR     1Y     2Y     3Y
0  2013-09-25  0,198  0,307  0,485
1  2013-09-26  0,204  0,318  0,497
2  2013-09-27  0,204  0,306  0,487
3  2013-09-28  0,204  0,306  0,487
4         USD     1Y     2Y     3Y
5  2013-09-25  0,462  0,571  0,749
6  2013-09-26  0,468  0,582  0,761
7  2013-09-27  0,468   0,57  0,751
8  2013-09-28  0,468   0,57  0,751

As you can see, the data is arranged date-wise, and each data set is in chunks after one another (in this case the USD-data comes straight after the EUR-data). The currency-label tricks up things a bit, and the data becomes one single data frame.

What I would like to have are two separate data frames, as

     EUR     1Y     2Y     3Y
0  2013-09-25  0,198  0,307  0,485
1  2013-09-26  0,204  0,318  0,497
2  2013-09-27  0,204  0,306  0,487
3  2013-09-28  0,204  0,306  0,487

     USD     1Y     2Y     3Y
0  2013-09-25  0,462  0,571  0,749
1  2013-09-26  0,468  0,582  0,761
2  2013-09-27  0,468   0,57  0,751
3  2013-09-28  0,468   0,57  0,751

That is, I would like to separate each currency data set from one another.

Any suggestions?

unutbu · Accepted Answer · 2014-02-14 10:18:51Z

3

Here is an alternative approach to the problem. It reads the csv into a single DataFrame and then uses a bit of data-wrangling to create a currency column:

           currency     1Y     2Y     3Y
date                                    
2013-09-25      EUR  0,198  0,307  0,485
2013-09-26      EUR  0,204  0,318  0,497
2013-09-27      EUR  0,204  0,306  0,487
2013-09-28      EUR  0,204  0,306  0,487
2013-09-25      USD  0,462  0,571  0,749
2013-09-26      USD  0,468  0,582  0,761
2013-09-27      USD  0,468   0,57  0,751
2013-09-28      USD  0,468   0,57  0,751

You could then "split" the DataFrame into smaller DataFrames according to the currency using groupby:

groups = df.groupby(['currency'])
for key, grp in groups:
    print(grp)

import numpy as np
import pandas as pd

df = pd.read_table('data',sep=';',na_values=[''],thousands='.',decimal=',',
                   names=['date', '1Y', '2Y', '3Y'])
mask = df['date'].str.contains('^\s*\D')              # 1
df['currency'] = (df['date']
                  .where(mask, np.nan)                # 2
                  .fillna(method='ffill'))            # 3
df = df.loc[~mask]                                    # 4

print(df)    

groups = df.groupby(['currency'])
for key, grp in groups:
    print(grp)

Use str.contains to find values in df['date'] which begin with a non-digit. The values are presumed to be currencies. The mask is True on these rows.

In [120]: mask
Out[120]: 
0     True
1    False
2    False
3    False
4    False
5     True
6    False
7    False
8    False
9    False
Name: date, dtype: bool

df['date'].where(mask, np.nan) returns a Series, equal to df['date'] where the mask is True, and np.nan otherwise.

Forward-fill in the nans with the currency values

In [123]: df['date'].where(mask, np.nan).fillna(method='ffill')
Out[123]: 
0    EUR
1    EUR
2    EUR
3    EUR
4    EUR
5    USD
6    USD
7    USD
8    USD
9    USD
Name: date, dtype: object

Select only those rows where the mask is False, thus removing the header rows.

edited Feb 14, 2014 at 10:18

answered Feb 13, 2014 at 14:45

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

gussilago Over a year ago

Really nice solution as well. Actually more inline to what I need, as I will eventually need to transfer everything into an Excel Pivot-table (so the currency column makes things a lot things easier). @unutbu However, I can't seem to get the line mask = data['date'].str.match('^\s*\D') to work... i only get () and [] as output for 'mask'...

gussilago Over a year ago

Solved it by using mask = data['date'].str.CONTAINS('^\s*\D') instead of MATCH. Thanks again @unutbu

EdChum · Accepted Answer · 2014-02-13 12:17:23Z

1

Use the nrows and skiprows parameter to read_csv

So for first dataframe read just the first 4 rows:

eur = pd.read_csv(path,sep=';',na_values=[''],thousands='.',decimal=',',date_parser=[0], nrows=4)

and subsequent dataframe skip first 5 rows:

usd = pd.read_csv(path,sep=';',na_values=[''],thousands='.',decimal=',',date_parser=[0], skiprows=5)

should work

answered Feb 13, 2014 at 12:17

EdChum

397k204 gold badges836 silver badges583 bronze badges

3 Comments

gussilago Over a year ago

Very nice solution! @EdChum. You don't happen to know a quick fix to automate the nrows, skiprows? e.g. "find first occurence of non-date -> start from there"?

EdChum Over a year ago

You would have to read the csv iteratively and detect when you didn't receive a date string unfortunately, there is no built in method for pandas. You could pass the dtype param to explicitly tell pandas what the dtypes are the columnsn, it should fail and then note when the error occurs.

EdChum Over a year ago

The other possibility is to slice the incorrect dataframe to yield 2 correct dataframes e.g eur = df.iloc[0:4] usd = df.iloc[5:10] for example, you would have to rename the column for the usd df though

Collectives™ on Stack Overflow

Reading chunks of csv file in Python using pandas

2 Answers 2

2 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related