How to create a regex for tokenizing Java source code in Python

Question

Hello guys I've been working on an interesting project involving some ML in python and some Java source code. Basically I need to tokenize each line of Java code with regular expressions and sadly I haven't been able to do that.

I've been trying to create my own regular expression pattern for the last couple of days with lots of googling and youtubing because I didn't know how to do it myself in the begging(I don't think do now either :( ). I tried using libraries for tokenizing but those work in really weird ways like sometimes missiing semi-colons and brackets and sometimes not.

def stringTokenizer(string):
    tokens = re.findall(r"[\w']+|[""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~""\\]", string);
    print(tokens);

stringTokenizer('void addAction(String commandId, IHandler action);');

Initially I wanted the to get the following output: ['void', 'addAction', '(', 'String', 'commandId', 'IHandler', 'action', ')', ';'] but saddly this is the closest I got to the result ['void', 'addAction(', 'String', 'commandId', 'IHandler', 'action);']

If anybody could help you'll be a lifesaver.

Just thought I should mention regex101 for testing your patterns — Buckeye14Guy
– Buckeye14Guy, Commented Jun 11, 2019 at 19:37
@Wiktor Stribiżew Thanks! Just what I needed. I'm curious though could you explain what these patterns detect. Also would it work for any line of code because it clearly works for this so I'm guessing it would work for most others as well but you know there are always exceptions to the rule. — Христо Петков
– Христо Петков, Commented Jun 11, 2019 at 19:39

Wiktor Stribiżew · Accepted Answer · 2019-06-11 19:46:39Z

1

You want to match chunks of 1+ word or single apostrophe chars or single occurrences of all other chars except for whitespace.

Thus, you need

re.findall(r"[\w']+|[^\w\s']", s)

You probably might consider using this expression when you need to match ' between word chars into word chunks:

re.findall(r"\w+(?:'\w+)*|[^\w\s]", s)
             ^^^^^^^^^^^^

See the regex demo and the regex graph:

Details

[\w']+ - a positive character class that matches one or more word chars (letters, digits, underscores, some more rare chars that are considered "word")
| - or
[^\w\s'] - a negated character class that matches any 1 char other than word, whitespace chars and single apostrophes.
\w+(?:'\w+)* matches 1+ word chars followed with 0 or more repetitions of ' and 1+ word chars.

edited Jun 11, 2019 at 19:46

answered Jun 11, 2019 at 19:41

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Христо Петков Over a year ago

Thanks for everything and most importantly for your time to answer to answer the question. Hope this question/answer will be of use to others as well.

Collectives™ on Stack Overflow

How to create a regex for tokenizing Java source code in Python

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related