2

I have a list of lists as follws.

mylist = [[5274919, ["report", "porcelain", "firing", "technic"]], [5274920, ["implantology", "dentistry"]], [52749, ["method", "recognition", "long", "standing", "root", "perforation", "molar"]], [5274923, ["exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "method", "standing"]]]

I also have a list of concepts as follows.

myconcepts = ["method", "standing"]

I want to see how many times each concept in myconcepts is in mylist records. i.e.;

"method" = 2 times in records (i.e. in `52749` and `5274923`)
"standing" = 2 times in records

My current code is as follows.

mycounting = 0
for concept in myconcepts:
  for item in mylist:
     if concept in item[1]:
       mycounting = mycounting + 1
print(mycounting)

However, my current mylist is very very large and have about 5 million records. myconcepts list have about 10000 concepts.

In my current code it takes nearly 1 minute for a concept to get the count, which is very slow.

I would like to know the most efficient way of doing this in python?

For testing purposes I have attached a small portion of my dataset in: https://drive.google.com/file/d/1z6FsBtLyDZClod9hK8nK4syivZToa7ps/view?usp=sharing

I am happy to provide more details if needed.

5
  • That depends on your usage. If you're going to do this many times, it pays to use a flattened list, and either use a Counter object (a type of dict), or use the count method on the flattened list. These techniques are already documented well on Stack Overflow and elsewhere on line. Commented Nov 12, 2019 at 22:12
  • can you change the lists to sets? Searching sets is O(1) instead of O(n) Commented Nov 12, 2019 at 22:13
  • With that many records you might reconsider your data structure. Commented Nov 12, 2019 at 22:13
  • @Barmar yes, I am happy to change the structure of mylist Commented Nov 12, 2019 at 22:13
  • EmJ, is converting you information into a database a possible answer? If so i'll show you how to add you data to mongodb for a quick search answer. Commented Nov 12, 2019 at 22:18

3 Answers 3

2

Adapting approach 3 from https://www.geeksforgeeks.org/python-count-the-sublists-containing-given-element-in-a-list/

from itertools import chain 
from collections import Counter 

mylist = [[5274919, ["report", "porcelain", "firing", "technic"]], [5274920, ["implantology", "dentistry"]], [52749, ["method", "recognition", "long", "standing", "root", "perforation", "molar"]], [5274923, ["exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "method", "standing"]]]

myconcepts = ["method", "standing"]

def countList(lst, x):
" Counts number of times item x appears in sublists "
    return Counter(chain.from_iterable(set(i[1]) for i in lst))[x] 

# Use dictionary comprehension to apply countList to concept list
result = {x:countList(mylist, x) for x in myconcepts}
print(result) # {'method':2, 'standing':2}

*Revised current method (compute counts only once) *

def count_occurences(lst):
    " Number of counts of each item in all sublists "
    return Counter(chain.from_iterable(set(i[1]) for i in lst))

cnts = count_occurences(mylist)
result = {x:cnts[x] for x in myconcepts}
print(result) # {'method':2, 'standing':2}

Performance (comparing posted methods using Jupyter Notebook)

Results show this method and Barmar posted method are close (i.e. 36 vs 42 us)

The improvement to the current method reduced to time approximately in half (i.e. from 36 us to 19 us). This improvement should be even more substantial for a larger number of concepts (i.e. problem has > 1000 concepts).

However, the original method is faster at 2.55 us/loop.

Method current method

%timeit { x:countList(mylist, x) for x in myconcepts}
#10000 loops, best of 3: 36.6 µs per loop

Revised current method:

%%timeit
cnts = count_occurences(mylist)
result = {x:cnts[x] for x in myconcepts}
10000 loops, best of 3: 19.4 µs per loop

Method 2 (from Barmar post)

%%timeit
r = collections.Counter(flatten(mylist))
{i:r.get(i, 0) for i in myconcepts}
# 10000 loops, best of 3: 42.7 µs per loop

Method 3 (Original Method)

%%timeit

result = {}
for concept in myconcepts:
  mycounting = 0
  for item in mylist:
     if concept in item[1]:
       mycounting = mycounting + 1
  result[concept] = mycounting
  # 100000 loops, best of 3: 2.55 µs per loop
Sign up to request clarification or add additional context in comments.

15 Comments

@Emj--about to test performance. Will update the post in the next few minutes.
@emj--check my updated post where I added your original method. Surprisingly the original is fastest. Wonder if it depends upon the size of the dataset (i.e. original is faster for small datasets).
@EmJ--good feedback. Let me see if I can remove the overhead of computing sets for every concept.
@emj--check revised method that uses count_occurences. This should be a lot faster for your test case.
@emj--just saw your post. Looks like you have already found a good solution.
|
2

You can flatten the input and then use collections.Counter:

import collections
myconcepts = ["method", "standing"]
mylist = [[5274919, ["report", "porcelain", "firing", "technic"]], [5274920, ["implantology", "dentistry"]], [5274921, ["method", "recognition", "long", "standing", "root", "perforation", "molar"]], [5274923, ["exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "standing"]]]
def flatten(d):
  for i in d:
    yield from [i] if not isinstance(i, list) else flatten(i)

r = collections.Counter(flatten(mylist))
result = {i:r.get(i, 0) for i in myconcepts}

Output:

{'method': 2, 'standing': 2}

Edit: record lookup:

result = {i:sum(i in b for _, b in mylist) for i in myconcepts}

Output:

{'method': 2, 'standing': 2}

9 Comments

Thank you for your answer. However, I am looking how many times each concept is in mylist records (not the total count). For example, method appears in two records 52749 and 5274923 (please see my edited question). Thank you :)
@EmJ Are you trying to find the record names that have a corresponding value in myconcepts? The current result {'method': 2, 'standing': 2} is indeed the individual counts for each element in myconcepts.
Thank you for the comment. I want to find the number of records that each concept is in. For example consider a record like this; [5274923, ["exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "method", "standing"]. It has method 2 times. However, the number of records is 1. :)
@EmJ Please see my recent edit, as I added a solution to find the unique records that store a target value.
Thank you for the edit. However, I only need a count of number of records. For instance, {'method': 2, 'standing': 2} :)
|
1

Change the concept lists to sets, so that searching will be O(1). You can then use intersection to count the number of matches in each set.

import set
mylist = [
    [5274919, {"report", "porcelain", "firing", "technic"}], 
    [5274920, {"implantology", "dentistry"}], 
    [52749, {"method", "recognition", "long", "standing", "root", "perforation", "molar"}], 
    [5274923, {"exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "method", "standing"}]
]
myconcepts = {"method", "standing"}
mycounting = 0
for item in mylist:
    mycounting += len(set.intersection(myconcepts, item[1]))
print(mycounting)

If you want to get the counts for each concept separately, you'll need to loop over myconcept, then use the in operator. You can put the results in a dictionary.

mycounting = {concept: sum(1 for l in mylist if concept in l[1]) for concept in myconcepts}
print(mycounting) // {'standing': 2, 'method': 2}

This will still be more efficient than using a list, because concept in l[1] is O(1).

7 Comments

Thank you for the answer. I get an error saying NameError: name 'intersection' is not defined. Do you know how to fix this? Thank you :)
Should be set.intersection.
By the way, you can do intersections and unions of sets using the overloaded & and | operators; I think it is more readable, personally.
@Barmar I get the answer of your code as 4. But it should be 2 and 2. :)
@kaya3 I knew that, but I thought this would be more self-explanatory.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.