Splitting a Dataframe Column in Python

Question

I am trying to drop some rows from my Pandas Dataframe df. It looks like this and has 180 rows and 2745 columns. I want to get rid of those rows which have a curv_typ of PYC_RT and YCIF_RT. I also want to get rid of the geo\time column. I am extracting this data from a CSV File and have to realize that curv_typ,maturity,bonds,geo\time and the characters below it like PYC_RT,Y1,GBAAA,EA are all in one column:

 curv_typ,maturity,bonds,geo\time  2015M06D16   2015M06D15   2015M06D11   \
0                 PYC_RT,Y1,GBAAA,EA        -0.24        -0.24        -0.24   
1               PYC_RT,Y1,GBA_AAA,EA        -0.02        -0.03        -0.10   
2                PYC_RT,Y10,GBAAA,EA         0.94         0.92         0.99   
3              PYC_RT,Y10,GBA_AAA,EA         1.67         1.70         1.60   
4                PYC_RT,Y11,GBAAA,EA         1.03         1.01         1.09

I decided to try and split this Column and then drop the resulting individual columns, but I am getting the error KeyError: 'curv_typ,maturity,bonds,geo\time' in the last line of the code df_new = pd.DataFrame(df['curv_typ,maturity,bonds,geo\time'].str.split(',').tolist(), df[1:]).stack()

import os
import urllib2
import gzip
import StringIO
import pandas as pd

baseURL = "http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file="
filename = "data/irt_euryld_d.tsv.gz"
outFilePath = filename.split('/')[1][:-3]

response = urllib2.urlopen(baseURL + filename)
compressedFile = StringIO.StringIO()
compressedFile.write(response.read())

compressedFile.seek(0)

decompressedFile = gzip.GzipFile(fileobj=compressedFile, mode='rb') 

with open(outFilePath, 'w') as outfile:
    outfile.write(decompressedFile.read())

#Now have to deal with tsv file
import csv

with open(outFilePath,'rb') as tsvin, open('ECB.csv', 'wb') as csvout:
    tsvin = csv.reader(tsvin, delimiter='\t')
    writer = csv.writer(csvout)
    for data in tsvin:
        writer.writerow(data)


csvout = 'C:\Users\Sidney\ECB.csv'
#df = pd.DataFrame.from_csv(csvout)
df = pd.read_csv('C:\Users\Sidney\ECB.csv', delimiter=',', encoding="utf-8-sig")
print df
df_new = pd.DataFrame(df['curv_typ,maturity,bonds,geo\time'].str.split(',').tolist(), df[1:]).stack()

Edit: From reptilicus's Answer I used the code below:

#Now have to deal with tsv file
import csv

outFilePath = filename.split('/')[1][:-3] #As in the code above, just put here for reference
csvout = 'C:\Users\Sidney\ECB.tsv'
outfile = open(csvout, "w")
with open(outFilePath, "rb") as f:
    for line in f.read():
        line.replace(",", "\t")
        outfile.write(line)
outfile.close()

df = pd.DataFrame.from_csv("ECB.tsv", sep="\t", index_col=False)

I still get the same exact output as before.

Thank You

Just looks like you need to read it in differently. It looks like curve_type, maturity, bonds, geo\time should all have own columns. Try DataFrame.from_csv() also — reptilicus
– reptilicus, Commented Jun 18, 2015 at 20:34
@reptilicus Thank You. However, I get the same error when using df = pd.DataFrame.from_csv(csvout) instead of pd.read_csv. I am lost as to how to handle this. — user131983
– user131983, Commented Jun 18, 2015 at 20:42
Oh, I think its the \t in geo\time perhaps, when you read it in it might me messing up that column — reptilicus
– reptilicus, Commented Jun 18, 2015 at 20:47
Try to edit the CSV file and replace geo\time with geo_time or something perhaps? — reptilicus
– reptilicus, Commented Jun 18, 2015 at 20:48
@reptilicus Thanks! I think that might be the way to go. After making this change geo_time manually in the CSV File, I now get the error ValueError: Shape of passed values is (4, 180), indices imply (4, 179). Do you know why this might be? — user131983
– user131983, Commented Jun 18, 2015 at 21:08

reptilicus · Accepted Answer · 2015-06-18 21:50:21Z

1

The format of that CSV is awful, there are comma and tab separated data in there.

Get rid of the commas first:

tr ',' '\t' < irt_euryld_d.tsv > test.tsv

If you can't use tr can just do it in python:

outfile = open("outfile.tsv", "w")
with open("irt_euryld_d.tsz", "rb") as f:
    for line in f.read():
        line.replace(",", "\t")
        outfile.write(line)
outfile.close()

Then can load it up nicely in pandas:

In [9]: df = DataFrame.from_csv("test.tsv", sep="\t", index_col=False)

In [10]: df
Out[10]:
    curv_typ maturity    bonds geo\time  2015M06D17   2015M06D16   \
0     PYC_RT       Y1    GBAAA       EA        -0.23        -0.24
1     PYC_RT       Y1  GBA_AAA       EA        -0.05        -0.02
2     PYC_RT      Y10    GBAAA       EA         0.94         0.94
3     PYC_RT      Y10  GBA_AAA       EA         1.66         1.67
In [11]: df[df["curv_typ"] != "PYC_RT"]
Out[11]:
    curv_typ maturity    bonds geo\time  2015M06D17   2015M06D16   \
60   YCIF_RT       Y1    GBAAA       EA        -0.22        -0.23
61   YCIF_RT       Y1  GBA_AAA       EA         0.04         0.08
62   YCIF_RT      Y10    GBAAA       EA         2.00         1.97

edited Jun 18, 2015 at 21:50

answered Jun 18, 2015 at 21:41

reptilicus

10.4k6 gold badges59 silver badges80 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user131983 Over a year ago

Thank You. But is there a way to replace the commas with tabs in a script as I need to automate the whole process?

user131983 Over a year ago

Thank You. I used the code you provided and only made changes to the names of the files, but I still get the Output in the exact same format as before. I've edited the question to show the code that I used. Do you know why this might be?

ozinos · Accepted Answer · 2024-08-28 13:58:59Z

maybe this code might work for this issue. I worked with a similar csv and my data was messed up.

def parse_file(input_file_name,output_file_name):
    with open(input_file_name) as file:
        lines = file.readlines()
        keys=lines[0]
        num_rows=len(lines)
        lines=lines[1:num_rows]
        sku_lst=[]  # 'sku' is my header
        for line in lines:
            line=line.replace('"','')
            line=line.replace('\n','')
            line_splt=line.split(',')
            sku_dict={}
            sku_dict['sku']=line_splt[0]
            for elm in line_splt[1:-1]:
                if(elm!=''):
                    elm_splt=elm.split('=')
                    try:
                        sku_dict[elm_splt[0]]=elm_splt[1]
                    except Exception as e:
                        print(elm,'\n')
                        print(line)
                        print(elm_splt,'\n')
                        print(e,'\n\n\n')
            sku_lst.append(sku_dict)
            output=pd.DataFrame(sku_lst).fillna('')
            output.to_csv(output_file_name)
    return(output)

Then let's run this function.

df=parse_file('data_mixed.csv','data_cleaned.csv')

I haven't tried it on other file types except csv, but if it doesn't work, I can improve it if you inform me.

Collectives™ on Stack Overflow

Splitting a Dataframe Column in Python

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related