0

I have a CSV file which has the following content:

Apple,Bat
Apple,Cat
Apple,Dry
Apple,East
Apple,Fun
Apple,Gravy
Apple,Hot
Bat,Cat
Bat,Dry
Bat,Fun
...

I also have a list as follows:

to_remove=[Fun,Gravy,...]

I would like an efficient way to delete all lines from the csv file which have any one of the words from the list to_remove.

I know one way to do it is to read each line of the csv file, loop through to_remove to see if any of the words are present in the line and save the line to another file if there was no match.

However, I have a lot of entries in both the csv file and the to_remove list (approx 21000 and 300 respectively). So I want a efficient way of doing it in Python.

I do not have access to clusters so map-reduce based options are not an option.

2
  • 2
    grep -Ev '(Fun|Gravy)' filename Commented Jan 25, 2014 at 11:44
  • You could try regular expressions or simply parallelise the code. There's only so much you can do. Huge operations will always be huge one way or another. Commented Jan 25, 2014 at 11:50

1 Answer 1

1
toremove = ['Fun','Gravy']
with open('test.in','r') as fin, open('test.out','w') as fout:
    for i in filter(lambda x:not any(i for i in toremove if i in x), fin.readlines()):
        fout.write(i)

with open('test.out') as fout:
    print fout.read()

test.in:

Apple,Bat
Apple,Cat
Apple,Dry
Apple,East
Apple,Fun
Apple,Gravy
Apple,Hot
Bat,Cat
Bat,Dry
Bat,Fun

[out:]

Apple,Bat
Apple,Cat
Apple,Dry
Apple,East
Apple,Hot
Bat,Cat
Bat,Dry
Sign up to request clarification or add additional context in comments.

1 Comment

fin.readlines() will read the entire file into memory. Not exactly what the OP wants.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.