How to efficiently check if an element is in a list of lists in python

Question

I have a list of lists as follws.

mylist = [[5274919, ["report", "porcelain", "firing", "technic"]], [5274920, ["implantology", "dentistry"]], [52749, ["method", "recognition", "long", "standing", "root", "perforation", "molar"]], [5274923, ["exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "method", "standing"]]]

I also have a list of concepts as follows.

myconcepts = ["method", "standing"]

I want to see how many times each concept in myconcepts is in mylist records. i.e.;

"method" = 2 times in records (i.e. in `52749` and `5274923`)
"standing" = 2 times in records

My current code is as follows.

mycounting = 0
for concept in myconcepts:
  for item in mylist:
     if concept in item[1]:
       mycounting = mycounting + 1
print(mycounting)

However, my current mylist is very very large and have about 5 million records. myconcepts list have about 10000 concepts.

In my current code it takes nearly 1 minute for a concept to get the count, which is very slow.

I would like to know the most efficient way of doing this in python?

For testing purposes I have attached a small portion of my dataset in: https://drive.google.com/file/d/1z6FsBtLyDZClod9hK8nK4syivZToa7ps/view?usp=sharing

I am happy to provide more details if needed.

That depends on your usage. If you're going to do this many times, it pays to use a flattened list, and either use a Counter object (a type of dict), or use the count method on the flattened list. These techniques are already documented well on Stack Overflow and elsewhere on line. — Prune
– Prune, Commented Nov 12, 2019 at 22:12
can you change the lists to sets? Searching sets is O(1) instead of O(n) — Barmar
– Barmar, Commented Nov 12, 2019 at 22:13
With that many records you might reconsider your data structure. — alex
– alex, Commented Nov 12, 2019 at 22:13
EmJ, is converting you information into a database a possible answer? If so i'll show you how to add you data to mongodb for a quick search answer. — oppressionslayer
– oppressionslayer, Commented Nov 12, 2019 at 22:18

DarrylG · Accepted Answer · 2019-11-13 00:40:08Z

2

Adapting approach 3 from https://www.geeksforgeeks.org/python-count-the-sublists-containing-given-element-in-a-list/

from itertools import chain 
from collections import Counter 

mylist = [[5274919, ["report", "porcelain", "firing", "technic"]], [5274920, ["implantology", "dentistry"]], [52749, ["method", "recognition", "long", "standing", "root", "perforation", "molar"]], [5274923, ["exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "method", "standing"]]]

myconcepts = ["method", "standing"]

def countList(lst, x):
" Counts number of times item x appears in sublists "
    return Counter(chain.from_iterable(set(i[1]) for i in lst))[x] 

# Use dictionary comprehension to apply countList to concept list
result = {x:countList(mylist, x) for x in myconcepts}
print(result) # {'method':2, 'standing':2}

*Revised current method (compute counts only once) *

def count_occurences(lst):
    " Number of counts of each item in all sublists "
    return Counter(chain.from_iterable(set(i[1]) for i in lst))

cnts = count_occurences(mylist)
result = {x:cnts[x] for x in myconcepts}
print(result) # {'method':2, 'standing':2}

Performance (comparing posted methods using Jupyter Notebook)

Results show this method and Barmar posted method are close (i.e. 36 vs 42 us)

The improvement to the current method reduced to time approximately in half (i.e. from 36 us to 19 us). This improvement should be even more substantial for a larger number of concepts (i.e. problem has > 1000 concepts).

However, the original method is faster at 2.55 us/loop.

Method current method

%timeit { x:countList(mylist, x) for x in myconcepts}
#10000 loops, best of 3: 36.6 µs per loop

Revised current method:

%%timeit
cnts = count_occurences(mylist)
result = {x:cnts[x] for x in myconcepts}
10000 loops, best of 3: 19.4 µs per loop

Method 2 (from Barmar post)

%%timeit
r = collections.Counter(flatten(mylist))
{i:r.get(i, 0) for i in myconcepts}
# 10000 loops, best of 3: 42.7 µs per loop

Method 3 (Original Method)

%%timeit

result = {}
for concept in myconcepts:
  mycounting = 0
  for item in mylist:
     if concept in item[1]:
       mycounting = mycounting + 1
  result[concept] = mycounting
  # 100000 loops, best of 3: 2.55 µs per loop

edited Nov 13, 2019 at 0:40

answered Nov 12, 2019 at 22:35

DarrylG

17.2k2 gold badges20 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

15 Comments

DarrylG Over a year ago

@Emj--about to test performance. Will update the post in the next few minutes.

DarrylG Over a year ago

@emj--check my updated post where I added your original method. Surprisingly the original is fastest. Wonder if it depends upon the size of the dataset (i.e. original is faster for small datasets).

DarrylG Over a year ago

@EmJ--good feedback. Let me see if I can remove the overhead of computing sets for every concept.

DarrylG Over a year ago

@emj--check revised method that uses count_occurences. This should be a lot faster for your test case.

DarrylG Over a year ago

@emj--just saw your post. Looks like you have already found a good solution.

|

Ajax1234 · Accepted Answer · 2019-11-13 00:07:10Z

2

You can flatten the input and then use collections.Counter:

import collections
myconcepts = ["method", "standing"]
mylist = [[5274919, ["report", "porcelain", "firing", "technic"]], [5274920, ["implantology", "dentistry"]], [5274921, ["method", "recognition", "long", "standing", "root", "perforation", "molar"]], [5274923, ["exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "standing"]]]
def flatten(d):
  for i in d:
    yield from [i] if not isinstance(i, list) else flatten(i)

r = collections.Counter(flatten(mylist))
result = {i:r.get(i, 0) for i in myconcepts}

Output:

{'method': 2, 'standing': 2}

Edit: record lookup:

result = {i:sum(i in b for _, b in mylist) for i in myconcepts}

Output:

{'method': 2, 'standing': 2}

edited Nov 13, 2019 at 0:07

answered Nov 12, 2019 at 22:13

Ajax1234

71.7k9 gold badges67 silver badges110 bronze badges

9 Comments

EmJ Over a year ago

Thank you for your answer. However, I am looking how many times each concept is in mylist records (not the total count). For example, method appears in two records 52749 and 5274923 (please see my edited question). Thank you :)

Ajax1234 Over a year ago

@EmJ Are you trying to find the record names that have a corresponding value in myconcepts? The current result {'method': 2, 'standing': 2} is indeed the individual counts for each element in myconcepts.

EmJ Over a year ago

Thank you for the comment. I want to find the number of records that each concept is in. For example consider a record like this; [5274923, ["exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "method", "standing"]. It has method 2 times. However, the number of records is 1. :)

Ajax1234 Over a year ago

@EmJ Please see my recent edit, as I added a solution to find the unique records that store a target value.

EmJ Over a year ago

Thank you for the edit. However, I only need a count of number of records. For instance, {'method': 2, 'standing': 2} :)

|

Barmar · Accepted Answer · 2019-11-13 19:36:57Z

1

Change the concept lists to sets, so that searching will be O(1). You can then use intersection to count the number of matches in each set.

import set
mylist = [
    [5274919, {"report", "porcelain", "firing", "technic"}], 
    [5274920, {"implantology", "dentistry"}], 
    [52749, {"method", "recognition", "long", "standing", "root", "perforation", "molar"}], 
    [5274923, {"exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "method", "standing"}]
]
myconcepts = {"method", "standing"}
mycounting = 0
for item in mylist:
    mycounting += len(set.intersection(myconcepts, item[1]))
print(mycounting)

If you want to get the counts for each concept separately, you'll need to loop over myconcept, then use the in operator. You can put the results in a dictionary.

mycounting = {concept: sum(1 for l in mylist if concept in l[1]) for concept in myconcepts}
print(mycounting) // {'standing': 2, 'method': 2}

This will still be more efficient than using a list, because concept in l[1] is O(1).

edited Nov 13, 2019 at 19:36

answered Nov 12, 2019 at 22:23

Barmar

789k57 gold badges554 silver badges669 bronze badges

7 Comments

EmJ Over a year ago

Thank you for the answer. I get an error saying NameError: name 'intersection' is not defined. Do you know how to fix this? Thank you :)

kaya3 Over a year ago

Should be set.intersection.

kaya3 Over a year ago

By the way, you can do intersections and unions of sets using the overloaded & and | operators; I think it is more readable, personally.

EmJ Over a year ago

@Barmar I get the answer of your code as 4. But it should be 2 and 2. :)

Barmar Over a year ago

@kaya3 I knew that, but I thought this would be more self-explanatory.

|

Collectives™ on Stack Overflow

How to efficiently check if an element is in a list of lists in python

3 Answers 3

15 Comments

9 Comments

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

15 Comments

9 Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related