0

I have two tables (as DataFrames) in Python. One is as follows:

Country     Year    totmigrants
Afghanistan 2000    
Afghanistan 2001    
Afghanistan 2002    
Afghanistan 2003    
Afghanistan 2004    
Afghanistan 2005    
Afghanistan 2006    
Afghanistan 2007    
Afghanistan 2008    
Algeria     2000    
Algeria     2001    
Algeria     2002
...
Zimbabwe    2008

the other one is for each single year (9 seperate DataFrames overall 2000-2008):

Year=2000
---------------------------------------
Country    totmigrants  Gender  Total
Afghanistan 73           M     70
Afghanistan              F     3
Albania     11           M     5
Albania                  F     6
Algeria     52           M     44
...
Zimbabwe                 F     1

I want to join them together, the first table being outer join. I had this in my mind but this only works for merging by columns:

new=pd.merge(table1,table2,how='left',on=['Country', 'Year'])

What I wanted to see is, from each data frame for each year total number of migrants, F and M occur in new columns in the first table as:

Country     Year    totmigrants F  M
Afghanistan 2000      73       3  70
Afghanistan 2001    table3
Afghanistan 2002    table4
Afghanistan 2003    ...
Afghanistan 2004    
Afghanistan 2005    
Afghanistan 2006    
Afghanistan 2007    
Afghanistan 2008    
Algeria     2000    52          8 44
Algeria     2001    table3      ...
Algeria     2002    table4      ...
...
Zimbabwe    2008     ...        ...

Is there a specific method for this merging, or what function do I need to use?

3
  • Before you merge, try groupby country and sum totalmigrants, or only select rows which have a totalmigrant integer value. Commented Feb 5, 2018 at 22:43
  • The first table does not have any useful data (unless you are not showing us all columns). Why do you need it at all? Commented Feb 5, 2018 at 22:43
  • @DYZ There are 10 more variables in that table. I can not copy paste all. Other variables are ready but only totmigrants, F and M are missing. I need it for running a regression. Commented Feb 5, 2018 at 22:54

2 Answers 2

1

Here's how to combine the data from the yearly dataframes. Let's assume that the yearly dataframes somehow have been stored in a dictionary:

df = {2000: ..., 2001: ..., ..., 2008: ...}
yearly = []

for N in df.keys():
    tmp = df[N].pivot('Country','Gender','Total').fillna(0).astype(int)
    tmp['Year'] = N # Store the year
    tmp['totmigrants'] = tmp['M'] + tmp['F']
    yearly.append(tmp)

df = pd.concat(yearly)
print(df)
#Gender       F   M  Year  totmigrants
#Country                              
#Afghanistan  3  70  2000           73
#Albania      6   5  2000           11
#Algeria      0  44  2000           44
#Zimbabwe     1   0  2000            1

Now you can merge df with the first dataframe using ['Country','Year'] as the keys.

Sign up to request clarification or add additional context in comments.

3 Comments

this was helpful. When converting data to dataframe, I am encountering this problem: in some of tables gender rows are merged. e.g. Afghanistan 108.0 M\nF 98\n10 Albania 18.0 M\nF 10\n8
how can I separate them into two distinct rows before I can convert them into Dataframe?
This is a different question and requires it's own code and sample data.
1

I am not sure you need the first table. I did the following, I hope it helps.

data2000 = np.array([['','Country','totmigrants','Gender', 'Total'],
['1','Afghanistan', 73, 'M', 70],
['2','Afghanistan', None, 'F', 3],
['3','Albania', 11, 'M', 5],
['4','Albania', None ,'F', 6]])

data2001 = np.array([['','Country','totmigrants','Gender', 'Total'],
['1','Afghanistan', 75, 'M', 60],
['2','Afghanistan', None, 'F', 15],
['3','Albania', 15, 'M', 11],
['4','Albania', None ,'F', 4]])

# and so on
datas = {'2000':data2000, '2001':data2001}
reg_dfs = []
for year,data in datas.items():
    df = pd.DataFrame(data=data[1:,1:],
              index=data[1:,0],
              columns=data[0,1:])

    new=pd.merge(df,df,how='inner',on=['Country'])
    reg_df = new.query('Gender_x == "M" & Gender_y == "F"' )[['Country', 'Total_x', 'Total_y', 'totmigrants_x']]
    reg_df.columns = ['Country', 'M', 'F', 'Total']
    reg_df['Year'] = year
    reg_dfs.append(reg_df)

print(pd.concat(reg_dfs).sort(['Country']))

#       Country   M   F Total  Year
#1  Afghanistan  70   3    73  2000
#1  Afghanistan  60  15    75  2001
#5      Albania   5   6    11  2000
#5      Albania  11   4    15  2001

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.