2

I am trying to drop some rows from my Pandas Dataframe df. It looks like this and has 180 rows and 2745 columns. I want to get rid of those rows which have a curv_typ of PYC_RT and YCIF_RT. I also want to get rid of the geo\time column. I am extracting this data from a CSV File and have to realize that curv_typ,maturity,bonds,geo\time and the characters below it like PYC_RT,Y1,GBAAA,EA are all in one column:

 curv_typ,maturity,bonds,geo\time  2015M06D16   2015M06D15   2015M06D11   \
0                 PYC_RT,Y1,GBAAA,EA        -0.24        -0.24        -0.24   
1               PYC_RT,Y1,GBA_AAA,EA        -0.02        -0.03        -0.10   
2                PYC_RT,Y10,GBAAA,EA         0.94         0.92         0.99   
3              PYC_RT,Y10,GBA_AAA,EA         1.67         1.70         1.60   
4                PYC_RT,Y11,GBAAA,EA         1.03         1.01         1.09 

I decided to try and split this Column and then drop the resulting individual columns, but I am getting the error KeyError: 'curv_typ,maturity,bonds,geo\time' in the last line of the code df_new = pd.DataFrame(df['curv_typ,maturity,bonds,geo\time'].str.split(',').tolist(), df[1:]).stack()

import os
import urllib2
import gzip
import StringIO
import pandas as pd

baseURL = "http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file="
filename = "data/irt_euryld_d.tsv.gz"
outFilePath = filename.split('/')[1][:-3]

response = urllib2.urlopen(baseURL + filename)
compressedFile = StringIO.StringIO()
compressedFile.write(response.read())

compressedFile.seek(0)

decompressedFile = gzip.GzipFile(fileobj=compressedFile, mode='rb') 

with open(outFilePath, 'w') as outfile:
    outfile.write(decompressedFile.read())

#Now have to deal with tsv file
import csv

with open(outFilePath,'rb') as tsvin, open('ECB.csv', 'wb') as csvout:
    tsvin = csv.reader(tsvin, delimiter='\t')
    writer = csv.writer(csvout)
    for data in tsvin:
        writer.writerow(data)


csvout = 'C:\Users\Sidney\ECB.csv'
#df = pd.DataFrame.from_csv(csvout)
df = pd.read_csv('C:\Users\Sidney\ECB.csv', delimiter=',', encoding="utf-8-sig")
print df
df_new = pd.DataFrame(df['curv_typ,maturity,bonds,geo\time'].str.split(',').tolist(), df[1:]).stack()

Edit: From reptilicus's Answer I used the code below:

#Now have to deal with tsv file
import csv

outFilePath = filename.split('/')[1][:-3] #As in the code above, just put here for reference
csvout = 'C:\Users\Sidney\ECB.tsv'
outfile = open(csvout, "w")
with open(outFilePath, "rb") as f:
    for line in f.read():
        line.replace(",", "\t")
        outfile.write(line)
outfile.close()

df = pd.DataFrame.from_csv("ECB.tsv", sep="\t", index_col=False)

I still get the same exact output as before.

Thank You

8
  • Just looks like you need to read it in differently. It looks like curve_type, maturity, bonds, geo\time should all have own columns. Try DataFrame.from_csv() also Commented Jun 18, 2015 at 20:34
  • @reptilicus Thank You. However, I get the same error when using df = pd.DataFrame.from_csv(csvout) instead of pd.read_csv. I am lost as to how to handle this. Commented Jun 18, 2015 at 20:42
  • Oh, I think its the \t in geo\time perhaps, when you read it in it might me messing up that column Commented Jun 18, 2015 at 20:47
  • 1
    Try to edit the CSV file and replace geo\time with geo_time or something perhaps? Commented Jun 18, 2015 at 20:48
  • 1
    @reptilicus Thanks! I think that might be the way to go. After making this change geo_time manually in the CSV File, I now get the error ValueError: Shape of passed values is (4, 180), indices imply (4, 179). Do you know why this might be? Commented Jun 18, 2015 at 21:08

2 Answers 2

1

The format of that CSV is awful, there are comma and tab separated data in there.

Get rid of the commas first:

tr ',' '\t' < irt_euryld_d.tsv > test.tsv

If you can't use tr can just do it in python:

outfile = open("outfile.tsv", "w")
with open("irt_euryld_d.tsz", "rb") as f:
    for line in f.read():
        line.replace(",", "\t")
        outfile.write(line)
outfile.close()

Then can load it up nicely in pandas:

In [9]: df = DataFrame.from_csv("test.tsv", sep="\t", index_col=False)

In [10]: df
Out[10]:
    curv_typ maturity    bonds geo\time  2015M06D17   2015M06D16   \
0     PYC_RT       Y1    GBAAA       EA        -0.23        -0.24
1     PYC_RT       Y1  GBA_AAA       EA        -0.05        -0.02
2     PYC_RT      Y10    GBAAA       EA         0.94         0.94
3     PYC_RT      Y10  GBA_AAA       EA         1.66         1.67
In [11]: df[df["curv_typ"] != "PYC_RT"]
Out[11]:
    curv_typ maturity    bonds geo\time  2015M06D17   2015M06D16   \
60   YCIF_RT       Y1    GBAAA       EA        -0.22        -0.23
61   YCIF_RT       Y1  GBA_AAA       EA         0.04         0.08
62   YCIF_RT      Y10    GBAAA       EA         2.00         1.97
Sign up to request clarification or add additional context in comments.

2 Comments

Thank You. But is there a way to replace the commas with tabs in a script as I need to automate the whole process?
Thank You. I used the code you provided and only made changes to the names of the files, but I still get the Output in the exact same format as before. I've edited the question to show the code that I used. Do you know why this might be?
0

maybe this code might work for this issue. I worked with a similar csv and my data was messed up.

def parse_file(input_file_name,output_file_name):
    with open(input_file_name) as file:
        lines = file.readlines()
        keys=lines[0]
        num_rows=len(lines)
        lines=lines[1:num_rows]
        sku_lst=[]  # 'sku' is my header
        for line in lines:
            line=line.replace('"','')
            line=line.replace('\n','')
            line_splt=line.split(',')
            sku_dict={}
            sku_dict['sku']=line_splt[0]
            for elm in line_splt[1:-1]:
                if(elm!=''):
                    elm_splt=elm.split('=')
                    try:
                        sku_dict[elm_splt[0]]=elm_splt[1]
                    except Exception as e:
                        print(elm,'\n')
                        print(line)
                        print(elm_splt,'\n')
                        print(e,'\n\n\n')
            sku_lst.append(sku_dict)
            output=pd.DataFrame(sku_lst).fillna('')
            output.to_csv(output_file_name)
    return(output)

Then let's run this function.

df=parse_file('data_mixed.csv','data_cleaned.csv')

I haven't tried it on other file types except csv, but if it doesn't work, I can improve it if you inform me.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.