using python openpyxl to write to an excel spreadsheet (string searches)

Question

The following is my code. I want it to read the excel spreadsheet and use the data in the Warehouse column (i.e. search for substrings in that column's cells) to map and write a specific string to the corresponding cell in the next column called GeneralDescription. My spreadsheet has over 50000 rows. This snippet of code works for classifying two GeneralDescriptions at this point. Eventually I want to be able to easily scale this to cover all possible warehouses. The only thing that is not working and that I need specific help with is that when the string "WORLD WIDE DATA" appears in the Warehouse column, the code does not recognize it. I am assuming because of the all upper case. However, if the string "HUMANRESOURCES Toronto" appears in the Warehouse column, this code works correctly and writes "HumanResources" to the GeneralDescription column. It also recognizes "WWD" and "wwd" and correctly writes "World Wide Data" to the GeneralDescription column. I don't understand why its just that one particular string not is not being recognized unless it has something to do with whitespace. Also in the original speadsheet, there are some integers that identify Warehouses. If i don't delete these, I am unable to iterate through those rows. I need to keep these numbers in there. Any ideas on how i can make this work. Any help is much appreciated.

import openpyxl
import re

wb = openpyxl.load_workbook(filename="Trial_python.xlsx")

ws= wb.worksheets[0]

sheet = wb.active

for i in range(2, 94000):
    if(sheet.cell(row=i, column=6).value !=None):
        if(sheet.cell(row=i, column=6).value.lower()=="world wide data"):
            sheet.cell(row=i, column=7).value="World Wide Data"
        for j in re.findall(r"[\w']+", sheet.cell(row=i, column=6).value
            if(j.lower()=="wwd" or j.lower()=="world wide data"):
                sheet.cell(row=i, column=7).value="World Wide Data"
            if(j.lower()=="humanresources"):
                sheet.cell(row=i,column=7).value="HumanResources"

wb.save(filename="Trial_python.xlsx")

Using sheet['F'] and cell.offset() would make your code a lot more readable. — Charlie Clark
– Charlie Clark, Commented Jul 6, 2018 at 9:39

Community · Accepted Answer · 2020-06-20 09:12:55Z

I'd recommend creating an empty list and as you iterate through the column store each of the values in there with .append(), that should help your code scale a bit better, although i'm sure there will be other more efficient solutions.

I'd also recommend moving away from using == to check for equality and try using is, this link goes into detail about the differences: https://dbader.org/blog/difference-between-is-and-equals-in-python

So your code should look like this:

...
business_list = ['world wide data', 'other_businesses', 'etc']
for i in range(2, 94000):
    if(sheet.cell(row=i, column=6).value is not None):
        if(sheet.cell(row=i, column=6).value.lower() in business_list:
            sheet.cell(row=i, column=7).value = "World Wide Data"
...

Hope that helps

Edit to answer comments below

So to answer your question in comment 2, the business_list = [...] that we created will store anything that you want to check for. ie. if WWD, World Wide Data, 2467, etc. appear then you can check through this list, and if a match is found - which uses the in function - you can then write whatever you like into column 7. (final line of code).

If you want Machine operations or HumanResources or any of these other strings to appear there are a couple of methods that you can complete this. A simple way is to write a check for them like so:

...
business_list = ['world wide data', 'other_businesses', '2467',
                 'central operations', 'humanresources']
for i in range(2, 50000):
    if(sheet.cell(row=i, column=6).value is not None):
        if(sheet.cell(row=i, column=6).value.lower() in business_list:
            if business_list[i].lower() == "humanresources":
                sheet.cell(row = i, column = 7).value = "HumanResources"
            if business_list[i].lower() == "machine operations":
                sheet.cell(row = i, column = 7).value = "Machine Operations"
            else:
                 sheet.cell(row = i, column = 7).value = "World Wide Data"
...

So to explain what is happening here, a list is created with the values that you want to check for, called business_list. you are then iterating through your columns and checking that the cell is not empty with not None:. From here you do an initial check to see if the value of the cell is something that you want to even check for - in business_list: and if it is - you use the index of what it found to identify and update the cell value.

This ensures that you aren't checking for something that might not be there by checking the list first. Since the values that you suggested are one for one, ie. HumanResources for humanresources, Machine Operations for machine operations.

As for scaling, it should be easy to add new checks by adding the new company name to the list, and then a 2 line statement of if this then cell = this.

I use a similar system for a sheet that is roughly 1.2m entries and performance is still fast enough for production, although I don't know how complex yours is. There may be other more efficient means of doing it but this system is simple to maintain in the future as well, hopefully this makes a bit more sense for you. let me know if not and i'll help if possible

EDIT: As for your last comment, I wouldn't assume something like that without doing a check since it could lead to false positives!

Collectives™ on Stack Overflow

using python openpyxl to write to an excel spreadsheet (string searches)

1 Answer 1

Edit to answer comments below

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Edit to answer comments below

Comments

Your Answer

Sign up or log in

Post as a guest

Related