Speeding up python when using nested for and if loops

Question

I have a csv file that has a column called "Authors". In that column, each row has a couple of authors separated by commas. In the code below the function, getAuthorNames gets all the author names in that column and returns an array with all their names.

Then the function authCount counts how many times an individual name appears in the Author column. At first, I was doing it with a couple of hundred rows and had no issues. Now I am trying to do it with 20,000 rows+ and it has taken a couple of hours and still no results. I believe it is the nested for loops and if statement that is causing it to take so long. Any advice on how to speed up the process would help. Should I be using lambda? Is there a built it pandas function that could help?

This is what the input data looks like:

Title,Authors,ID
XXX,"Wang J, Wang H",XXX
XXX,"Wang J,Han H",XXX

And this is what the output would look like

Author,Count
Wang J,2
Wang H,1
Han H,1

Here is the code:

    import pandas as pd


    df = pd.read_csv (r'C:\Users\amos.epelman\Desktop\Pubmedpull3GC.csv')


    def getAuthorNames(dataFrame):
        arrayOfAuthors = []
        numRows = dataFrame.shape[0]

        cleanDF = dataFrame.fillna("0")

        for i in range (0,numRows):
            miniArray = cleanDF.at[i,"Authors"].split(",")
            arrayOfAuthors += miniArray
    
        return arrayOfAuthors


    def authCount(dataFrame):
        authArray = getAuthorNames(dataFrame)
        numAuthors = len(authArray)
        countOfAuth = [0] * numAuthors

        newDF = pd.DataFrame({"Author Name": authArray, "Count": countOfAuth})
        refDF = dataFrame.fillna("0")


        numRows= refDF.shape[0]


        for i in range (0,numAuthors):
            for j in range (0,numRows):
                if newDF.at[i, "Author Name"] in refDF.at[j,"Authors"]:
                    newDF.at[i,"Count"] += 1
            
        sortedDF = newDF.sort_values(["Count"], ascending = False)

        noDupsDF = sortedDF.drop_duplicates(subset ="Author Name", keep = False)

        return noDupsDF




    finalDF = authCount(df)
    file_name = 'GC Pubmed Pull3 Author Names with Count.xlsx'
    finalDF.to_excel(file_name)

what you are doing is a very, very slow way of going through a pandas dataframe. You might try using something like a lambda function to operate on the rows. Something like, newDF["Count"] = newDF.apply(lambda row: some function of row, axis=1) (instead of your for loop) — ekrall
– ekrall, Commented Jan 14, 2022 at 3:12
You should provide a sample of the data (as text) and the expected output — mozway
– mozway, Commented Jan 14, 2022 at 3:18

ekrall · Accepted Answer · 2022-01-14 04:13:41Z

1

you could try using Counter and a lambda function to eliminate your nested for loop over two dataframes, which seems like it would be a slow way to add a new column

from collections import Counter

Then to get the "Counts" column

author_counts = Counter(list(refDF["Authors"]))

newDF["Count"] = newDF.apply(lambda r: author_counts[r["Author Name"]], axis=1)

edited Jan 14, 2022 at 4:13

answered Jan 14, 2022 at 3:53

ekrall

1928 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Amos Epelman Over a year ago

Thanks. I don't think this works fully because the author names in the original file is a list of authors separated by commas. when I see what the result is of author_counts, it tells me the list of authors for one paper rather than individual authors. I think this is the right track though

ekrall Over a year ago

ok, well then you might not end up using Counter. but definitely don't loop over all the rows of two dataframes to make a new column. A lambda function over axis 1 is probably what you want to be looking at, as a first step.

Zach Young · Accepted Answer · 2022-01-14 17:11:47Z

You can do this with the csv reader and collections Counter classes from Python's standard library.

I made a sample CSV with 20K rows of randomly generated names like you described, random_names.csv:

Authors
"Darnel D, Blythe B"
"Wang H, Darnel D, Alice A"
"Wang J, Wang H, Darnel D, Blythe B"
"Han H, Wang J"
"Clarice C, Wang H, Darnel D, Alice A"
"Clarice C, Han H, Blythe B, Wang J"
"Clarice C, Darnel D, Blythe B"
"Clarice C, Wang H, Blythe B"
"Blythe B, Wang J, Darnel D"
...

My code doesn't sort, but points out where to insert your sorting. This ran in under a second (on my M1 Macbook Air):

import csv
from collections import Counter

author_counts = Counter()

with open('random_names.csv', newline='') as f:
    reader = csv.reader(f)
    next(reader)  # discard header

    for row in reader:
        authors = row[0]  # !! adjust for your data
        for author in authors.split(','):
            author_counts.update([author.strip()])

# Sort here
print(author_counts.items())

with open('output.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['Author','Count'])
    writer.writerows(author_counts.items())

It printed out this debug line:

dict_items([('Darnel D', 10690), ('Blythe B', 10645), ('Wang H', 10881), ('Alice A', 10750), ('Wang J', 10613), ('Han H', 10814), ('Clarice C', 10724)])

and saved that as output.csv:

Author,Count
Darnel D,10690
Blythe B,10645
Wang H,10881
Alice A,10750
Wang J,10613
Han H,10814
Clarice C,10724

shullaw · Accepted Answer · 2022-01-15 00:03:03Z

0

# take series of authors and split at comma and expand into dataframe
authors = df['author'].str.split(pat=',', expand=True)
authors.melt().value_counts()

I’m not sure if it is faster, but that should give you the unique values and counts.

Input:

x y z author book
0 0 0 aa,bb,cc l
0 0 0 a,b,c l
0 0 0 aa,bb,c l
0 0 0 aa,b,c l

Output:

variable  value
0         aa       3
2         c        3
1         b        2
          bb       2
0         a        1
2         cc       1
dtype: int64

Update:
This solution sorts the output without saving to file and %%timeit% gives:
7.03 ms ± 396 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@ZachYoung solution does not sort and without saving output %%timeit gives:
5.64 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I ran this on a test file with 8000 names

edited Jan 15, 2022 at 0:03

answered Jan 14, 2022 at 4:26

shullaw

1061 silver badge5 bronze badges

3 Comments

Amos Epelman Over a year ago

From my understanding, the unique function on pandas doesn't work on data frames so you would have to select a specific column. In this case, the authors get split out into multiple columns when I use your method but thank you for the help!

shullaw Over a year ago

Maybe the updated answer will work for you

shullaw Over a year ago

I forget you want a count, not a list of unique authors.

Collectives™ on Stack Overflow

Speeding up python when using nested for and if loops

3 Answers 3

2 Comments

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related