0

I am in need of help!

I have an Excel file that contains data I am trying to get into a dataframe, but the data is in a table form that is not easy to work with. Example:

enter image description here

I want to ultimately get it into a dataframe of this form:

Meal               Food                              Calories
Breakfast          English Muffins                   120
Breakfast          Peanut Butter Spread              190
Morning Snack      Banana                            90
Morning Snack      Nectarine                         59
...                ...                               ...

And a separate dataframe for the Daily Totals in this form (ignore the 'Date' column for now):

Date          Calories     Fat     Fiber     Carbs     Sodium     Protein     Water
2017-07-01    699          26      16        93        612        32          0
2017-07-02    ...          ...     ...       ...       ...        ...         ...

I am struggling with getting this into a dataframe. Looking at the screenshot of the dataset, it made sense to first store the data into a dictionary, but that leaves me with a bunch of NaN values, due to all the blank cells.

My thoughts on getting the 'Meal' column the way I want it to look is to do a forward fill, but that would mean I would have to use a Series or Dataframe, which I haven't gotten to yet.

This is what I currently have:

df = pd.read_excel('filename.xls', 'Foods')

# create a list to store the dictionaries
food_logs = []

# this is code to reformat the string values in a certain column 
# to get the name of the sheets I need to use in the Excel. This can be ignored
for day in df.values:
    if day[1] != '0':
        foodLogSheetName = 'Food Log ' + day[0].replace('-', '')
        food_logs.append(foodLogSheetName)

# 'foods' is now a list of nested dictionaries (think of everything in the 
# first screenshot as the outer dictionary, and each of the column as the 
# inner dictionary)
foods = [xls.parse(food_log).to_dict() for food_log in food_logs]

This is what 'foods' currently is, if I printed it out with a line between each of the outer dictionaries: enter image description here

I have the option of working with a CSV file instead, but instead of multiple sheets, I would have multiple 'tables' stacked vertically, if that makes sense

I would greatly appreciate any tips anyone can provide, please!

1 Answer 1

1

I think you're on the right track with ffill-ing data. It sounds like you're might just have trouble with working with missing data. From the sample you posted, it looks like you can read the whole thing into a dataframe, drop all empty rows, ffill on the meal column, and then drop any rows that are partially empty (or on a subset.)

import pandas as pd

df = pd.read_excel(file_path_or_buffer, sheet_name=my_sheet_name, **other_kwargs)
# You should have a dataframe that looks like
# Meal               Food                              Calories
# Breakfast          
#                    English Muffins                   120
#                    Peanut Butter Spread              190
# ...
# Next drop totally NaN/empty rows
df.dropna(how='all', inplace=True)
df['Meal'] = df['Meal'].fillna(method='ffill')
# Now you should have something that looks like
# Meal               Food                              Calories
# Breakfast          
# Breakfast          English Muffins                   120
# Breakfast          Peanut Butter Spread              190
# ...
# Drop empty rows, if you need to allow for some sparse data, use the subset argument
df.dropna(how='any', inplace=True)
Sign up to request clarification or add additional context in comments.

1 Comment

With your suggestions, I was able to figure out how to answer my question! Thank you so much!!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.