join missing rows from another table columns in Python

Question

I have two tables (as DataFrames) in Python. One is as follows:

Country     Year    totmigrants
Afghanistan 2000    
Afghanistan 2001    
Afghanistan 2002    
Afghanistan 2003    
Afghanistan 2004    
Afghanistan 2005    
Afghanistan 2006    
Afghanistan 2007    
Afghanistan 2008    
Algeria     2000    
Algeria     2001    
Algeria     2002
...
Zimbabwe    2008

the other one is for each single year (9 seperate DataFrames overall 2000-2008):

Year=2000
---------------------------------------
Country    totmigrants  Gender  Total
Afghanistan 73           M     70
Afghanistan              F     3
Albania     11           M     5
Albania                  F     6
Algeria     52           M     44
...
Zimbabwe                 F     1

I want to join them together, the first table being outer join. I had this in my mind but this only works for merging by columns:

new=pd.merge(table1,table2,how='left',on=['Country', 'Year'])

What I wanted to see is, from each data frame for each year total number of migrants, F and M occur in new columns in the first table as:

Country     Year    totmigrants F  M
Afghanistan 2000      73       3  70
Afghanistan 2001    table3
Afghanistan 2002    table4
Afghanistan 2003    ...
Afghanistan 2004    
Afghanistan 2005    
Afghanistan 2006    
Afghanistan 2007    
Afghanistan 2008    
Algeria     2000    52          8 44
Algeria     2001    table3      ...
Algeria     2002    table4      ...
...
Zimbabwe    2008     ...        ...

Is there a specific method for this merging, or what function do I need to use?

Before you merge, try groupby country and sum totalmigrants, or only select rows which have a totalmigrant integer value. — ConorSheehan1
– ConorSheehan1, Commented Feb 5, 2018 at 22:43
The first table does not have any useful data (unless you are not showing us all columns). Why do you need it at all? — DYZ
– DYZ, Commented Feb 5, 2018 at 22:43
@DYZ There are 10 more variables in that table. I can not copy paste all. Other variables are ready but only totmigrants, F and M are missing. I need it for running a regression. — Said Akbar
– Said Akbar, Commented Feb 5, 2018 at 22:54

DYZ · Accepted Answer · 2018-02-05 22:55:40Z

1

Here's how to combine the data from the yearly dataframes. Let's assume that the yearly dataframes somehow have been stored in a dictionary:

df = {2000: ..., 2001: ..., ..., 2008: ...}
yearly = []

for N in df.keys():
    tmp = df[N].pivot('Country','Gender','Total').fillna(0).astype(int)
    tmp['Year'] = N # Store the year
    tmp['totmigrants'] = tmp['M'] + tmp['F']
    yearly.append(tmp)

df = pd.concat(yearly)
print(df)
#Gender       F   M  Year  totmigrants
#Country                              
#Afghanistan  3  70  2000           73
#Albania      6   5  2000           11
#Algeria      0  44  2000           44
#Zimbabwe     1   0  2000            1

Now you can merge df with the first dataframe using ['Country','Year'] as the keys.

answered Feb 5, 2018 at 22:55

DYZ

57.3k10 gold badges73 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Said Akbar Over a year ago

this was helpful. When converting data to dataframe, I am encountering this problem: in some of tables gender rows are merged. e.g. Afghanistan 108.0 M\nF 98\n10 Albania 18.0 M\nF 10\n8

Said Akbar Over a year ago

how can I separate them into two distinct rows before I can convert them into Dataframe?

DYZ Over a year ago

This is a different question and requires it's own code and sample data.

gtato · Accepted Answer · 2018-02-06 00:35:13Z

I am not sure you need the first table. I did the following, I hope it helps.

data2000 = np.array([['','Country','totmigrants','Gender', 'Total'],
['1','Afghanistan', 73, 'M', 70],
['2','Afghanistan', None, 'F', 3],
['3','Albania', 11, 'M', 5],
['4','Albania', None ,'F', 6]])

data2001 = np.array([['','Country','totmigrants','Gender', 'Total'],
['1','Afghanistan', 75, 'M', 60],
['2','Afghanistan', None, 'F', 15],
['3','Albania', 15, 'M', 11],
['4','Albania', None ,'F', 4]])

# and so on
datas = {'2000':data2000, '2001':data2001}
reg_dfs = []
for year,data in datas.items():
    df = pd.DataFrame(data=data[1:,1:],
              index=data[1:,0],
              columns=data[0,1:])

    new=pd.merge(df,df,how='inner',on=['Country'])
    reg_df = new.query('Gender_x == "M" & Gender_y == "F"' )[['Country', 'Total_x', 'Total_y', 'totmigrants_x']]
    reg_df.columns = ['Country', 'M', 'F', 'Total']
    reg_df['Year'] = year
    reg_dfs.append(reg_df)

print(pd.concat(reg_dfs).sort(['Country']))

#       Country   M   F Total  Year
#1  Afghanistan  70   3    73  2000
#1  Afghanistan  60  15    75  2001
#5      Albania   5   6    11  2000
#5      Albania  11   4    15  2001

Collectives™ on Stack Overflow

join missing rows from another table columns in Python

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related