3

I have a question concerning reading bits and pieces of a csv file. When just reading the file using

pd.read_csv(path,sep=';',na_values=[''],thousands='.',decimal=',',date_parser=[0])

I get:

     EUR     1Y     2Y     3Y
0  2013-09-25  0,198  0,307  0,485
1  2013-09-26  0,204  0,318  0,497
2  2013-09-27  0,204  0,306  0,487
3  2013-09-28  0,204  0,306  0,487
4         USD     1Y     2Y     3Y
5  2013-09-25  0,462  0,571  0,749
6  2013-09-26  0,468  0,582  0,761
7  2013-09-27  0,468   0,57  0,751
8  2013-09-28  0,468   0,57  0,751

As you can see, the data is arranged date-wise, and each data set is in chunks after one another (in this case the USD-data comes straight after the EUR-data). The currency-label tricks up things a bit, and the data becomes one single data frame.

What I would like to have are two separate data frames, as

     EUR     1Y     2Y     3Y
0  2013-09-25  0,198  0,307  0,485
1  2013-09-26  0,204  0,318  0,497
2  2013-09-27  0,204  0,306  0,487
3  2013-09-28  0,204  0,306  0,487

     USD     1Y     2Y     3Y
0  2013-09-25  0,462  0,571  0,749
1  2013-09-26  0,468  0,582  0,761
2  2013-09-27  0,468   0,57  0,751
3  2013-09-28  0,468   0,57  0,751

That is, I would like to separate each currency data set from one another.

Any suggestions?

2 Answers 2

3

Here is an alternative approach to the problem. It reads the csv into a single DataFrame and then uses a bit of data-wrangling to create a currency column:

           currency     1Y     2Y     3Y
date                                    
2013-09-25      EUR  0,198  0,307  0,485
2013-09-26      EUR  0,204  0,318  0,497
2013-09-27      EUR  0,204  0,306  0,487
2013-09-28      EUR  0,204  0,306  0,487
2013-09-25      USD  0,462  0,571  0,749
2013-09-26      USD  0,468  0,582  0,761
2013-09-27      USD  0,468   0,57  0,751
2013-09-28      USD  0,468   0,57  0,751

You could then "split" the DataFrame into smaller DataFrames according to the currency using groupby:

groups = df.groupby(['currency'])
for key, grp in groups:
    print(grp)

import numpy as np
import pandas as pd

df = pd.read_table('data',sep=';',na_values=[''],thousands='.',decimal=',',
                   names=['date', '1Y', '2Y', '3Y'])
mask = df['date'].str.contains('^\s*\D')              # 1
df['currency'] = (df['date']
                  .where(mask, np.nan)                # 2
                  .fillna(method='ffill'))            # 3
df = df.loc[~mask]                                    # 4

print(df)    

groups = df.groupby(['currency'])
for key, grp in groups:
    print(grp)

  1. Use str.contains to find values in df['date'] which begin with a non-digit. The values are presumed to be currencies. The mask is True on these rows.

    In [120]: mask
    Out[120]: 
    0     True
    1    False
    2    False
    3    False
    4    False
    5     True
    6    False
    7    False
    8    False
    9    False
    Name: date, dtype: bool
    
  2. df['date'].where(mask, np.nan) returns a Series, equal to df['date'] where the mask is True, and np.nan otherwise.
  3. Forward-fill in the nans with the currency values

    In [123]: df['date'].where(mask, np.nan).fillna(method='ffill')
    Out[123]: 
    0    EUR
    1    EUR
    2    EUR
    3    EUR
    4    EUR
    5    USD
    6    USD
    7    USD
    8    USD
    9    USD
    Name: date, dtype: object
    
  4. Select only those rows where the mask is False, thus removing the header rows.
Sign up to request clarification or add additional context in comments.

2 Comments

Really nice solution as well. Actually more inline to what I need, as I will eventually need to transfer everything into an Excel Pivot-table (so the currency column makes things a lot things easier). @unutbu However, I can't seem to get the line mask = data['date'].str.match('^\s*\D') to work... i only get () and [] as output for 'mask'...
Solved it by using mask = data['date'].str.CONTAINS('^\s*\D') instead of MATCH. Thanks again @unutbu
1

Use the nrows and skiprows parameter to read_csv

So for first dataframe read just the first 4 rows:

eur = pd.read_csv(path,sep=';',na_values=[''],thousands='.',decimal=',',date_parser=[0], nrows=4)

and subsequent dataframe skip first 5 rows:

usd = pd.read_csv(path,sep=';',na_values=[''],thousands='.',decimal=',',date_parser=[0], skiprows=5)

should work

3 Comments

Very nice solution! @EdChum. You don't happen to know a quick fix to automate the nrows, skiprows? e.g. "find first occurence of non-date -> start from there"?
You would have to read the csv iteratively and detect when you didn't receive a date string unfortunately, there is no built in method for pandas. You could pass the dtype param to explicitly tell pandas what the dtypes are the columnsn, it should fail and then note when the error occurs.
The other possibility is to slice the incorrect dataframe to yield 2 correct dataframes e.g eur = df.iloc[0:4] usd = df.iloc[5:10] for example, you would have to rename the column for the usd df though

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.