1

I have a pattern compiled as

pattern_strings = ['\xc2d', '\xa0', '\xe7', '\xc3\ufffdd', '\xc2\xa0', '\xc3\xa7', '\xa0\xa0', '\xc2', '\xe9']
join_pattern = '|'.join(pattern_strings)
pattern = re.compile(join_pattern)

and then I find pattern in file as

def find_pattern(path):
    with open(path, 'r') as f:
        for line in f:
            print line
            found = pattern.search(line)
            if found:
                print dir(found)
                logging.info('found - ' + found)

and my input as path file is

\xc2d 
d\xa0 
\xe7 
\xc3\ufffdd 
\xc3\ufffdd 
\xc2\xa0 
\xc3\xa7 
\xa0\xa0 
'619d813\xa03697' 

When I run this program, nothing happens.

I it not able to catch these patterns, what is am I doing wrong here?

Desired output - each line because each line has one or the other matching pattern

Update

After changing the regex to

pattern_strings = ['\\xc2d', '\\xa0', '\\xe7', '\\xc3\\ufffdd', '\\xc2\\xa0', '\\xc3\\xa7', '\\xa0\\xa0', '\\xc2', '\\xe9']

It is still the same, no output

UPDATE

after making regex to

pattern_strings = ['\\xc2d', '\\xa0', '\\xe7', '\\xc3\\ufffdd', '\\xc2\\xa0', '\\xc3\\xa7', '\\xa0\\xa0', '\\xc2', '\\xe9']
join_pattern = '[' + '|'.join(pattern_strings) + ']'
pattern = re.compile(join_pattern)

Things started to work, but partially, the patterns still not caught are for line

\xc2\xa0 
\xc3\xa7 
\xa0\xa0 

for which my pattern string is ['\\xc2\\xa0', '\\xc3\\xa7', '\\xa0\\xa0']

4
  • is it possible the \x is being escaped from in the file? in which case you need to match \\x ? Commented Jul 27, 2012 at 18:21
  • Are you looking for literal backslashses? I agree with Joran - this looks like an escape bug. Commented Jul 27, 2012 at 18:22
  • yes I am looking for literal backslashes Commented Jul 27, 2012 at 18:22
  • use join_pattern = "("+"|".join(pattern_strings)+")" instead [ ]. since [] only matches single chars ... also you should order your list from largest to smallest Commented Jul 27, 2012 at 18:40

2 Answers 2

2

escape the \ in the search patterns either with r"\xa0" or as "\\xa0"

do this ....

 ['\\xc2d', '\\xa0', '\\xe7', '\\xc3\\ufffdd', '\\xc2\\xa0', '\\xc3\\xa7', '\\xa0\\xa0', '\\xc2', '\\xe9']

like everyones been saying to do except the one guy you listened too...

Sign up to request clarification or add additional context in comments.

Comments

0

Does your file actually contain \xc2d --- that is, five characters: a backslash followed by c, then 2, then d? If so, your regex won't match it. Each of your regexes will match one or two characters with certain character codes. If you want to match the string \xc2d your regex needs to be \\xc2d.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.