How to find non-ascii characters in file using Regular Expression Python

Question

I have a string of characters that includes [a-z] as well as á,ü,ó,ñ,å,... and so on. Currently I am using regular expressions to get every line in a file that includes these characters.

Sample of spanishList.txt:

adan
celular
tomás
justo
tom
átomo
camara
rosa
avion

Python code (charactersToSearch comes from flask @application.route('/<charactersToSearch>')):

print (charactersToSearch)
#'átdsmjfnueó'
...
#encode
charactersToSearch = charactersToSearch.encode('utf-8')
query = re.compile('[' + charactersToSearch + ']{2,}$', re.UNICODE).match
words = set(word.rstrip('\n') for word in open('spanishList.txt') if query(word))
...

When I do this, I am expecting to get the words in the text file that include the characters in charactersToSearch. It works perfectly for words without special characters:

...
#after doing further searching for other conditions, return list of found words.
return '<br />'.join(sorted(set(word for (word, path) in solve())))
>>> adan
>>> justo
>>> tom

Only problem is that it ignores all words in the file that aren't ASCII. I should also be getting tomás and átomo.

I've tried encode, UTF-8, using ur'[...], but I haven't been able to get it to work for all characters. The file and the program (# -*- coding: utf-8 -*-) are in utf-8 as well.

have you tried query = re.compile(u'[' + charactersToSearch + ']{2,}$', re.UNICODE).match and not encoding charactersToSearch to utf8?, but instead just leave it as unicode? — Joran Beasley
– Joran Beasley, Commented Jun 25, 2014 at 5:18
For clarification, are you considering á to be non-ASCII? It's dec 225 in the extended table. (But can also be represented as a + acute accent) — zx81
– zx81, Commented Jun 25, 2014 at 5:22
@JoranBeasley Yes. I've tried both ways but every time I get the list of words without any special characters included. — santybm
– santybm, Commented Jun 25, 2014 at 5:34

zx81 · Accepted Answer · 2014-06-25 05:50:00Z

0

A different tack

I'm not sure how to fix it in your current workflow, so I'll suggest a different route.

This regex will match characters that are neither white-space characters nor letters in the extended ASCII range, such as A and é. In other words, if one of your words contains a weird character that is not part of this set, the regex will match.

(?i)(?!(?![×Þß÷þø])[a-zÀ-ÿ])\S

Of course this will also match punctuation, but I'm assuming that we're only looking at words in an unpunctuated list. otherwise, excluding punctuation is not too hard.

As I see it, your challenge is to define your set.

In Python, you can so something like:

if re.search(r"(?i)(?!(?![×Þß÷þø])[a-zÀ-ÿ])\S", subject):
    # Successful match
else:
    # Match attempt failed

answered Jun 25, 2014 at 5:50

zx81

42k10 gold badges92 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2017-05-23 10:32:21Z

0

I feel your pain. Dealing with Unicode in python2.x is the headache.

The problem with that input is that python sees "á" as the raw byte string '\xc3\xa1' instead of the unicode character "u'\uc3a1'. So your going to need to sanitize the input before passing the string into your regex.

To change a raw byte string to to a unicode string

char = "á"
## print char yields the infamous, and in python unparsable "\xc3\xa1".
## which is probably what the regex is not registering.
bytes_in_string = [byte for byte in char]
string = ''.join([str(hex(ord(byte))).strip('0x') for byte in bytes_in_string])
new_unicode_string = unichr(int(string),16))

There's probably a better way, because this is a lot of operations to get something ready for regex, which I think is supposed to be faster in some way than iterating & 'if/else'ing. Dunno though, not an expert.

I used something similar to this to isolate the special char words when I parsed wiktionary which was a wicked mess. As far as I can tell your going to have to comb through that to clean it up anyways, you may as well just:

for word in file:
    try:
        word.encode('UTF-8')
    except UnicodeDecodeError:
        your_list_of_special_char_words.append(word)

Hope this helped, and good luck!

On further research found this post:

Bytes in a unicode Python string

edited May 23, 2017 at 10:32

CommunityBot

11 silver badge

answered Jun 25, 2014 at 10:19

blanket_cat

2882 silver badges11 bronze badges

2 Comments

santybm Over a year ago

So when I try to change from a raw byte string to unicode I get an error. Assuming the input text of áaceimsonñpórxül, bytes_in_string gives me:

['\xc3', '\xa1', 'a', 'c', 'e', 'i', 'm', 's', 'o', 'n', '\xc3', '\xb1', 'p', '\xc3', '\xb3', 'r', 'x', '\xc3', '\xbc', 'l']

and then string prints c3a1616365696d736f6ec3b17c3b37278c3bc6c. Now I can see that for example, á is made up of \xc3 & \xa1. When I run new_unicode_string, the error I get says: ValueError: invalid literal for int() with base 10: 'c3a1616365696d736f6ec3b17c3b37278c3bc6c'...as it's not just numbers. Any suggestions.?

santybm Over a year ago

I was able to fix the issue:

santybm · Accepted Answer · 2014-06-25 20:45:34Z

The was able to figure out the issue. After getting the string from the flask app route, encode it otherwise it give you an error, and then decode the charactersToSearch and each word in the file.

charactersToSearch = charactersToSearch.encode('utf-8')

Then decode it in UTF-8. If you leave the previous line out it give you an error

UNIOnlyAlphabet = charactersToSearch.decode('UTF-8')
query = re.compile('[' + UNIOnlyAlphabet + ']{2,}$', re.U).match

Lastly, when reading the UTF-8 file and using query, don't forget to decode each word in the file.

words = set(word.decode('UTF-8').rstrip('\n') for word in open('spanishList.txt') if query(word.decode('UTF-8')))

That should do it. Now the results show regular and special characters.

justo
tomás
átomo
adan
tom

Collectives™ on Stack Overflow

How to find non-ascii characters in file using Regular Expression Python

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related