Create DataFrame using lists with missing column data

Question

CONTEXT

I am trying to create a DataFrame and fill out columns in that DataFrame based on whether or not the inserted lists have those columns.

Example Data:
Name    Height   Hair Color   Eye Color
Bob     72           Blonde       Blue
George  64                        Green
John                 Brown        Brown

The columns in the DataFrame would contain all the variables I want recorded but if a person does not have information for each column I'd like to fill out what I can in the DataFrame.

Sample Data / Code

name = ['Name', 'Bob']    <----- Each element has the associated column name and the value in a list.
height = ['Height', '72'] <----- Possible to search for height[0] in columns and place height[1] in there?
eye_color = ['Eye Color', 'Brown']

person = [name, height, eye_color]
columns = ['Name', 'Height', 'Hair Color', 'Eye Color'] 

df = pd.DataFrame(person, columns = columns)

Expected Outcome

Name    Height    Hair   Eye Color
Bob     72               Brown

PROBLEM

I want to be able to pass a person through and fill out a column based on the information that is there and leave any columns that aren't there blank. And append people to the DataFrame in the same fashion. Is this possible?

Please let me know if any additional details would help in answering this question!

And I should have asked if all of the data (all the various rows) are in one structure or does it get added to the DataFrame one at a time? — wwii
– wwii, Commented Oct 7, 2020 at 23:55
@wwii all the data is contained in an object. Slightly more complex than the data provided but in this case I have a person_list with "person" objects and person = [name, [variable_list ] ]. This contains the person's name and variable_name/value in the list. Ideally I guess I would use a for loop to parse through each person and append to the dataframe. Let me know if I need to clarify anything more please! Thanks — KL_
– KL_, Commented Oct 8, 2020 at 0:02

wwii · Accepted Answer · 2020-10-08 14:08:16Z

1

You can make an empty DataFrame and just specify the columns.

In [21]: df = pd.DataFrame(columns=['name','a','b','c'])

In [22]: df
Out[22]: 
Empty DataFrame
Columns: [name, a, b, c]
Index: []

Then you can append

In [23]: df = df.append({'name':'bob','c':0},ignore_index=True)

In [24]: df
Out[24]: 
  name    a    b  c
0  bob  NaN  NaN  0

In [25]: df = df.append({'name':'geo','b':'foo'},ignore_index=True)

In [26]: df
Out[26]: 
  name    a    b    c
0  bob  NaN  NaN    0
1  geo  NaN  foo  NaN

Multiple rows:

In [32]: more = [{'name':'qq','b':'apples'},
                 {'name':'wildbill','a':'nickels'},
                 {'name':'lastone','b':'potatoes','c':16}]

In [33]: df = df.append(more,ignore_index=True)

In [33]: 

In [34]: df
Out[34]: 
       name        a         b    c
0       bob      NaN       NaN    0
1       geo      NaN       foo  NaN
2        qq      NaN    apples  NaN
3  wildbill  nickels       NaN  NaN
4   lastone      NaN  potatoes   16

Or if you can ensure all the columns are covered:

In [36]: more
Out[36]: 
[{'b': 'apples', 'name': 'qq'},
 {'a': 'nickels', 'name': 'wildbill'},
 {'b': 'potatoes', 'c': 16, 'name': 'lastone'}]

In [37]: pd.DataFrame(more)
Out[37]: 
         a         b     c      name
0      NaN    apples   NaN        qq
1  nickels       NaN   NaN  wildbill
2      NaN  potatoes  16.0   lastone

Looks like DataFrame will consume a generator.

In [3]: more
Out[3]: 
[{'b': 'apples', 'name': 'qq'},
 {'a': 'nickels', 'name': 'wildbill'},
 {'b': 'potatoes', 'c': 16, 'name': 'lastone'}]

In [4]: def f():
   ...:     for d in more:
   ...:         yield d
   ...:         

In [5]: pd.DataFrame(f())
Out[5]: 
         a         b     c      name
0      NaN    apples   NaN        qq
1  nickels       NaN   NaN  wildbill
2      NaN  potatoes  16.0   lastone

There is probably a better way.

edited Oct 8, 2020 at 14:08

answered Oct 8, 2020 at 0:04

wwii

23.9k7 gold badges42 silver badges80 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

KL_ Over a year ago

Is it possible to use the column name associated with the variable (in this case, height[0]) as the input for variable name? I'd like to loop through a list of Person objects and dynamically fill in the columns based on what columns that person may have. Like a for loop for variables in the list in that append section

KL_ Over a year ago

I guess the better question is if I converted to a dictionary of key,value pairs, could I just replace everything between the { } with the dictionary?

wwii Over a year ago

@Yahtzee - see my last edit - it appends three dictionaries at once.

noah · Accepted Answer · 2020-10-08 00:21:00Z

Are you open to rethinking what a person object is? If so you should consider dict for each person like below. It makes your life much easier.

import pandas as pd

columns = ['Name', 'Height', 'Hair Color', 'Eye Color'] 
df = pd.DataFrame(columns = columns)

person = {'Name':['Bob'], 'Height':['72'], 'Eye Color': ['Brown']}
person2 = {'Name':['Sue'], 'Height':['48'], 'Eye Color': ['Blue'], 'Hair Color': ['Blonde']}
person3 = {'Name':['Hank'], 'Height':['74'], 'Hair Color': ['Black']}

#add persons... could loop through
df = df.append(pd.DataFrame(person))
df = df.append(pd.DataFrame(person2))
df = df.append(pd.DataFrame(person3))
print(df)

   Name Height Hair Color Eye Color
0   Bob     72        NaN     Brown
0   Sue     48     Blonde      Blue
0  Hank     74      Black       NaN

If you don't want to change person you can also just make a simple function to convert it:

def person_to_dict(person):
    person_dict = {}
    for attr in person:
        person_dict[attr[0]]=[attr[1]]
    return person_dict
person = person_to_dict(person)

David Erickson · Accepted Answer · 2020-10-08 00:43:16Z

0

Here is a dynamic list comprehension method using the lists you have created in this example:

name = ['Name', 'Bob']
height = ['Height', '72']
eye_color = ['Eye Color', 'Brown']

person = [name, height, eye_color]
columns = ['Name', 'Height', 'Hair Color', 'Eye Color'] 

df = pd.DataFrame([{i:j} for (i,j) in zip([name[0], height[0], eye_color[0]],
                                          [name[1], height[1], eye_color[1]])
                         for col in df.columns if i == col], columns=columns)
df = df.apply(lambda x: pd.Series(x.dropna().values))
df

    Name    Height  Hair Color  Eye Color
0    Bob        72         NaN      Brown

answered Oct 8, 2020 at 0:43

David Erickson

16.7k2 gold badges21 silver badges37 bronze badges

Collectives™ on Stack Overflow

Create DataFrame using lists with missing column data

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related