0

So, I have this script in python using pandas that does a few things. It: combines two excel sheets together and makes a new one and it also adds another column to those sheets that shows where the original file came from. Here is the script:

import pandas as pd
import numpy as np
import os
from os.path import basename

df = []

#enter your file names via terminal

file1 = raw_input("Enter the path to the first file):")

file2 = raw_input("Enter the path to the second file):")


for f in [file1, file2]:
    data = pd.read_excel(f, 'Sheet1')
    data.index = [os.path.basename(f)] * len(data)
    df.append(data)

#add the column that includes the original file

data.index = [basename(f)] * len(data)

#set the path and name of your final product file

final = raw_input('Where do you want the file, and what do you want to name it? (C:\path_to_file\name_of_file.xlsx):')


df = pd.concat(df)

df.to_excel(final)

Now, my question is, let's say we combine two excel files, such that they look like this:

                Item           Inv  Price   Sold
dbtest1.xlsx    Banana         50      1    27
dbtest1.xlsx    Grapes         100     3    68
dbtest2.xlsx    Oranges        68      3    17
dbtest2.xlsx    Apples         22      1.5  9
dbtest2.xlsx    Strawberries   245     4    122

And I want to add this excel file, now called dbtestfinal.xlsx to another excel file. The results I'd get are:

                  Item      Inventory   Price   Sold
dbtest3.xlsx      Pork      49          2.99    47
dbtest3.xlsx      Beef      27          1.5     78
dbtest3.xlsx      Chicken   245         1.99    247
dbtestfinal.xlsx  Banana    50          1       27
dbtestfinal.xlsx  Grapes    100         3       68
dbtestfinal.xlsx  Oranges   68          3       17
dbtestfinal.xlsx  Apples    22          1.5     9
dbtestfinal.xlsx  Stra...   245         4       122

I'd like it to be able to maintain the original files it came from, so instead of having just dbtest3.xlsx and dbtestfinal.xlsx, it would have dbtest1,2,3 instead. Is there a way to make it do such a thing?

Also, adding in a column for the date in which the file was added would be great, too!

And one last addition, and this one is likely not trivial: is there a way to have the program detect the same file origin and replace it with the new one? So if you edited dbtest2.xlsx and added/subtracted items, the program would remove the old ones and only input this new file?

Thank you for any suggestions!

1 Answer 1

1

Consider this adjusted script. Where before you appended to a list, this script imports to separate data frames, then later concatenates them. As for your naming dbtest1, 2, 3 simply name the files that way in CPU directory and the script indexes the files accordingly.

Also, nothing is saved in memory after script executes, so simply re-import an earlier outputted file to further concatenate other worksheet data frames to it in a sort of "running" appended data frame. Further, script imports the current state of the Excel file, so most recent data.

Finally, I add a few validation and try/except handling since much of the script relies on user input which should be checked before processing. I even add a success message with an automated open file of outputted worksheet.

import subprocess
import pandas as pd
import numpy as np
import os, sys
from os.path import basename

# CSV IMPORT DEFINED FUNCTION
def csvImport(ftype, fpath):
    try:
       if ftype == 1:
           masterdata = pd.read_csv(fpath)
           return masterdata

       if ftype == 2:
           updateddata = pd.read_csv(fpath)
           updateddata['originfile'] = pd.Series(os.path.basename(fpath), \
                                                 index=updateddata.index)             
           return updateddata

    except Exception as e:
       print "\nUnable to import CSV file. Error {}".format(e)
       sys.exit(1)

# EXCEL IMPORT DEFINED FUNCTION
def xlImport(ftype, fpath):
    try:
        if ftype == 1:
           masterdata = pd.read_excel(fpath, 0)
           return masterdata

        if ftype == 2:
           updateddata = pd.read_excel(fpath, 0)
           updateddata['orginfile'] = pd.Series(os.path.basename(fpath), \
                                                index=updateddata.index)             
           return updateddata

    except Exception as e:
       print "\nUnable to import Excel file. Error {}".format(e)
       sys.exit(1)

# MASTER FILE USER INPUT DEFINED FUNCTION
def masterfile():
    while True:
       masterfile = raw_input("Enter the path to the master file: ")    
       if masterfile.endswith(".csv"):
          return csvImport(1, masterfile)
          break
       elif masterfile.endswith(".xlsx"):
          return xlImport(1, masterfile)          
          break
       else:
          print "\nPlease enter a proper CSV format file."

# UPDATED FILE USER INPUT DEFINED FUNCTION
def updatefile():
    while True:       
       updatedfile = raw_input("\nEnter the path to the updated file: ")
       if updatedfile.endswith(".csv"):
          return csvImport(2, updatedfile)
          break
       elif updatedfile.endswith(".xlsx"):
          return xlImport(2, updatedfile)
          break
       else:
          print "\nPlease enter a proper Excel file in xlsx format."

# CALLING OPENING FUNCTIONS
masterdata = masterfile()
updateddata = updatefile()

# CONCATENATING DATA FRAMES
combineddata = pd.concat([updateddata, masterdata])

# REMOVING DUPLICATES
finaldata = combineddata.drop_duplicates(['Item'])

# SETTING FINAL PATH BY USER INPUT
while True:       
    final = raw_input("\nWhere do you want the file, and what do you want to name it? \
                      (e.g., C:\path_to_file\name_of_file.xlsx): ")
    if final.endswith(".xlsx"):
        break
    else:
        print "\nPlease enter a proper Excel file in xlsx format."

# OUTPUTTING DATA FRAME TO FILE 
finaldata.to_excel(final)
print "\nSuccessfully outputted appended data frame to Excel!"

# OPENING OUTPUTTED FILE
# (NOTE: PYTHON STILL RUNS UNTIL SPREADSHEET IS CLOSED)
subprocess.call(final, shell=True)
Sign up to request clarification or add additional context in comments.

12 Comments

I do like the improved script and will certainly use it going forward, but it still doesn't do the things I'm wanting it to do. I want it to maintain the original file name as opposed to giving it the new name. Ultimately, I'd like to change it quite a bit more, wherein it checks recursively with the main file to see if the file being added to it has been added before, then removes all the old entries and instead uses the new one. This is just a crude prototype for now. Either way, I appreciate the input.
Consider my edit. Now a master file is asked and updated file. Only updated file fills in a new column for 'originalfile'. Then the two are concatenated (updated first, then master) and duplicates by Item are dropped which would be the master (old) records.
This looks great. My only question now: is there a way to specify a column to use for said duplicated? Such as: item. So if it sees two "Apples" it will prioritize the new one with thew new stock/inventory/price?
Yes, you can specify one or more columns in a list for duplicates. Here Item is used. Duplicates will keep first instance and drop all instances thereafter. Hence I concatenate (i.e., stack) new on top of old. So always use older dataset as master file and new as updated file. Test it out!
That's brilliant! The only issue I've noticed is, it appears that if it is NOT in an .xlsx format it spams the error message as opposed to just giving it once!
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.