compare two different files text by text using python

Question

i am trying to find same words/text between two different file but didn't get the result which i'm looking for.

i have tried to compare line by line but didn't get the result

with open('top_1k_domain.txt', 'r') as file1:
with open('latesteasylist.txt', 'r') as file2:
    same = set(file1).intersection(file2)

 same.discard('\n')

 with open('some_output_file1.txt', 'w') as file_out:
for line in same:
    file_out.write(line)

like my first file containing the text

 google.com
 youtube.com
 facebook.com
 doublepimp.com
 uod2quk646.com
 qq.com
 yahoo.com
 tmall.com

where as the second file contains

 ||doublepimp.com^$third-party
 ||uod2quk646.com^$third-party
 ....etc

it did not produce output which i m looking for that there should be doublepimp.com and uod2quk646.com in the some_output_file1.txt file but its empty.can any body help me out here

Hello, I hope you are doing well, Could you give us and example of the two files you use? and the wished output? Please. Thank you in advance. — Guillaume Lastecoueres
– Guillaume Lastecoueres, Commented Mar 23, 2019 at 9:53
first file contain the domain name where as second file contain the filter rule . i have to check that for which domain name the rule is described in the filter rule. i m trying to extract the domain name from both file which are common and for which rule is defined so your response will be apriciated @GuillaumeLastecoueres thanks — kashifbilal kashi
– kashifbilal kashi, Commented Mar 23, 2019 at 11:07

blhsing · Accepted Answer · 2019-03-23 10:52:40Z

1

By using set intersection, the items in the two sets will only match if they are identical, which they are not in the case of the two files, since the lines in the second file contain not just the domain names, but also other AdBlock syntax.

You should extract the domain name portion from the lines in the second file before you perform a set intersection with lines in the first file:

import re
same = set(file1).intersection((re.findall(r'[a-z0-9.-]+', line) or [''])[0] + '\n' for line in file2)

edited Mar 23, 2019 at 10:52

answered Mar 23, 2019 at 10:07

blhsing

109k9 gold badges88 silver badges132 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

kashifbilal kashi Over a year ago

it getting an AttributeError: 'NoneType' object has no attribute 'group' what i'm missing here

blhsing Over a year ago

That's because some of the lines in your second file do not have a domain name at all. I've updated my answer so that those lines are ignored.

kashifbilal kashi Over a year ago

i have an other question if you will do it for me i'll be thankfull to you @blhsing i am also trying to fetch the type of rule which contain only this category of rule /example.js $script,domain=example.com will you make me patteren for this so that i can fetch this type of rule from the filter list ?

blhsing Over a year ago

Glad to be of help. That really is out of the scope of this question though. Please ask about this in a new question with formatted code so that people can better help.

mhhollomon · Accepted Answer · 2019-03-23 10:16:04Z

0

The core idea is okay, but since the second file contains more than just the domain, you will need to strip that out first.

||example.com^$third-party will never equal example.com

One possibility:

same = set(file1).itersection(set(x[2, x.index('^')-2]+'\n' for x in file2))

answered Mar 23, 2019 at 10:16

mhhollomon

9937 silver badges16 bronze badges

1 Comment

kashifbilal kashi Over a year ago

its getting an error that substring not found , Could you please complete my code @mhhollomon because still i im in the learning stage

Collectives™ on Stack Overflow

compare two different files text by text using python

2 Answers 2

4 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related