2

I have a multiindex csv with the following format:

 ; ;2000;2001;2002;2003;2004;2005;2006;2007;2008;2009;2010;2011;2012;2013;2014;2015;2016;2017
CO2;;;;;;;;;;;;;;;;;;;
010000 Agriculture and horticulture;AZZ;2312;2249;2165;2102;2034;2095;2106;2067;2060;1935;1985;1983;1893;1865;1750;1728;1777;1736
020000 Forestry;AZZ;40;42;39;43;46;50;49;49;46;52;62;62;67;60;63;66;67;66
030000 Fishing;AZZ;785;767;746;722;645;655;629;580;501;485;472;441;351;384;352;382;387;377
 ; ;2000;2001;2002;2003;2004;2005;2006;2007;2008;2009;2010;2011;2012;2013;2014;2015;2016;2017
More CO2;;;;;;;;;;;;;;;;;;;
010000 Agriculture and horticulture;AZZ;2312;2249;2165;2102;2034;2095;2106;2067;2060;1935;1985;1983;1893;1865;1750;1728;1777;1736
020000 Forestry;AZZ;40;42;39;43;46;50;49;49;46;52;62;62;67;60;63;66;67;66
030000 Fishing;AZZ;785;767;746;722;645;655;629;580;501;485;472;441;351;384;352;382;387;377

So both levels of the MultiIndex are actually on the same column.

I am trying to import it as follows:

df=pd.read_csv('my.csv',sep=";",header=[0],index_col=[0])

But this returns the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 24: invalid start byte

I am not sure where position 24 is referring to and how to proceed to import the file.

Here is a link to the file: https://wetransfer.com/downloads/338c3aa2ef68052b45d29c509d5bf82120191009073413/88bc558e72adc48e8683d8af2792d51d20191009073413/81d59b

Desired Output

                                                        2000    2001    2002    2003    ...

CO2         010000 Agriculture and horticulture   AZZ  2312.0  2249.0  2165.0  2102.0   ...
            020000 Forestry                       AZZ    40.0    42.0    39.0    43.0   ...
            030000 Fishing                        AZZ   785.0   767.0   746.0   722.0   ... 
            060000 Extraction of oil and gas      BZ1  2174.0  2190.0  2184.0  2188.0   ... 
            080090 Extraction of gravel and stone BZ2   295.0   332.0   304.0   277.0   ...

                                                       2000    2001    2002    2003     ...

More CO2    010000 Agriculture and horticulture   AZZ  2312.0  2249.0  2165.0  2102.0   ...
            020000 Forestry                       AZZ    40.0    42.0    39.0    43.0   ...
            030000 Fishing                        AZZ   785.0   767.0   746.0   722.0   ... 
            060000 Extraction of oil and gas      BZ1  2174.0  2190.0  2184.0  2188.0   ... 
            080090 Extraction of gravel and stone BZ2   295.0   332.0   304.0   277.0   ... 
2
  • 1
    Not easy debug file from text, is possible upload your file (few rows but with error) to gdocs, dropbox, wetransfer or similar and share link? Commented Oct 9, 2019 at 7:27
  • @jezrael here it is: wetransfer.com/downloads/… Commented Oct 9, 2019 at 7:35

2 Answers 2

2

you can encoding gbk to read

df=pd.read_csv('./AirEmissions117.csv',sep=';',encoding='gbk')
Sign up to request clarification or add additional context in comments.

Comments

1

For me working set encoding and then is necessary some processing:

df = pd.read_csv('AirEmissions117.csv',
                 sep=";",
                 encoding = "ISO-8859-1",
                 )

#check if last 5 columns contains only NaN
m = df.iloc[:, -5:].isna().all(1)
#create new column in first position by types
df.insert(0, 'type', df.iloc[:, 0].where(m).ffill())
#remove NaNs rows and create MultiIndex
df = df[~m].set_index(df.columns[:3].tolist())

1 Comment

Hi, it imports it but the multiindex problem is unsolved. The row you skip is actually the first level of the multiindex. The issue is that both the first level and the second level are on the same column [0]

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.