2

I have a list containing several thousand short strings and a .csv file containing several hundred thousand short strings. All list elements are unique. For each string in the .csv file, I need to check to see if it contains more than one list element.

For example. I have a string:

example_string = "mermaids have braids and tails"

And a list:

example_list = ["me", "ve", "az"]

Clearly the example string contains more than one list item; me and ve. My code needs to indicate this. However, if the list was

example_list = ["ai", "az", "nr"]

only one list element is contained.

I think that the following code will check to see if each line in my .csv file contains at least one list element. However, that doesn't tell me if it contains more than one different list element.

data = file("my_file_of_strings.csv", "r").readlines()
for line in data:       
    if any(item in my_list for i in line):
        #Do something#
1
  • Thanks for all of the helpful, insightful answers! ~♥ Commented Nov 28, 2012 at 0:24

4 Answers 4

2
with open("my_file_of_strings.csv", "r") as data:
    for line in data:       
        if any(item in i for i in line.split() for item in my_list):
            ...

If you need to count them use sum()

with open("my_file_of_strings.csv", "r") as data:
    for line in data:       
        result = sum(item in i for i in line.split() for item in my_list):
Sign up to request clarification or add additional context in comments.

Comments

1
def contains_multiple(string, substrings):
    count = 0

    for substring in substrings:
        if substring in string:
            count += 1
            if count > 1:
                return True

    return False

for line in data:
    if contains_multiple(line, my_list):
        ...

Not short, but it will exit early as soon as it finds the 2nd match. That may or may not be an important optimization.

1 Comment

Works exactly as I had hoped especially with the breaking once a second match was found =). Thanks! ~♥
0

Something like:

data = file("my_file_of_strings.csv", "r").readlines()
for line in data:       
    if len(set(item for item in my_list if item in line)) > 1:
        #Do something#

Comments

0

I think the other solutions are better for your purpose, but in case you want to keep track of the number of hits and which ones they were, you could try this:

In [14]: from collections import defaultdict

In [15]: example_list = ["me", "ve", "az"]

In [16]: example_string = "mermaids have braids and tails"

In [17]: d = defaultdict(int)

In [18]: for i in example_list:
   ....:     d[i] += example_string.count(i)
   ....:

In [19]: d
Out[19]: defaultdict(<type 'int'>, {'me': 1, 'az': 0, 've': 1})

And then to get the total number of unique matches:

In [20]: matches = sum(1 for v in d.values() if v)

In [21]: matches
Out[21]: 2

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.