0

I have a dataframe like below where I have 2 million rows. The sample data can be found here.

enter image description here

The list of matches in every row can be any number between 1 to 761. I want to count the occurrences of every number between 1 to 761 in the matches column altogether. For example, the result of the above data will be:

enter image description here

If a particular id is not found, then the count will be 0 in the output. I tried using for loop approach but it is quite slow.

def readData():
    df = pd.read_excel(file_path)

    pattern_match_count = [0] * 761
    for index, row in df.iterrows():
        matches = row["matches"]

        for pattern_id in range(1, 762):
            if(pattern_id in matches):
                pattern_match_count[pattern_id - 1] = pattern_match_count[pattern_id - 1] + 1 

Is there any better approach with pandas to make the implementation faster?

3
  • 1
    please provide a reproducible self-sufficient input, not images Commented Oct 4, 2022 at 17:26
  • What is the data type of the matches column? Commented Oct 4, 2022 at 17:31
  • list is the datatype of the matches column Commented Oct 4, 2022 at 17:46

2 Answers 2

2

You can use the .explode() method to "explode" the lists into new rows.

def readData():
    df = pd.read_excel(file_path)
    return df.loc[:, "count"].explode().value_counts()
Sign up to request clarification or add additional context in comments.

2 Comments

Just a note that you can apply explode to the series directly which may save memory for a large dataframe df.loc[:, "matches"].explode().value_counts()
@Andrew Thanks. Selecting then applying is better memory-wise than applying then selecting. I have edited my answered and corrected a little typo you made in your suggestion.
0

You can use collections.Counter:

df = pd.DataFrame({"matches": [[1,2,3],[1,3,3,4]]})

#df:
#        matches
#0     [1, 2, 3]
#1  [1, 3, 3, 4]

from collections import Counter

C = Counter([i for sl in df.matches for i in sl])
#C:  
#Counter({1: 2, 2: 1, 3: 3, 4: 1})

pd.DataFrame(C.items(), columns=["match_id", "counts"]) 
#   match_id  counts
#0         1       2
#1         2       1
#2         3       3
#3         4       1

If you want zeros for match_ids that aren't in any of the matches, then you can update the Counter object C:

for i in range(1,762):
    if i not in C:
        C[i] = 0
pd.DataFrame(C.items(), columns=["match_id", "counts"]) 

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.