2

Hello guys I've been working on an interesting project involving some ML in python and some Java source code. Basically I need to tokenize each line of Java code with regular expressions and sadly I haven't been able to do that.

I've been trying to create my own regular expression pattern for the last couple of days with lots of googling and youtubing because I didn't know how to do it myself in the begging(I don't think do now either :( ). I tried using libraries for tokenizing but those work in really weird ways like sometimes missiing semi-colons and brackets and sometimes not.

def stringTokenizer(string):
    tokens = re.findall(r"[\w']+|[""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~""\\]", string);
    print(tokens);

stringTokenizer('void addAction(String commandId, IHandler action);');

Initially I wanted the to get the following output: ['void', 'addAction', '(', 'String', 'commandId', 'IHandler', 'action', ')', ';'] but saddly this is the closest I got to the result ['void', 'addAction(', 'String', 'commandId', 'IHandler', 'action);']

If anybody could help you'll be a lifesaver.

3
  • 2
    Try r"[\w']+|[^\w\s']" Commented Jun 11, 2019 at 19:34
  • Just thought I should mention regex101 for testing your patterns Commented Jun 11, 2019 at 19:37
  • @Wiktor Stribiżew Thanks! Just what I needed. I'm curious though could you explain what these patterns detect. Also would it work for any line of code because it clearly works for this so I'm guessing it would work for most others as well but you know there are always exceptions to the rule. Commented Jun 11, 2019 at 19:39

1 Answer 1

1

You want to match chunks of 1+ word or single apostrophe chars or single occurrences of all other chars except for whitespace.

Thus, you need

re.findall(r"[\w']+|[^\w\s']", s)

You probably might consider using this expression when you need to match ' between word chars into word chunks:

re.findall(r"\w+(?:'\w+)*|[^\w\s]", s)
             ^^^^^^^^^^^^

See the regex demo and the regex graph:

enter image description here

Details

  • [\w']+ - a positive character class that matches one or more word chars (letters, digits, underscores, some more rare chars that are considered "word")
  • | - or
  • [^\w\s'] - a negated character class that matches any 1 char other than word, whitespace chars and single apostrophes.
  • \w+(?:'\w+)* matches 1+ word chars followed with 0 or more repetitions of ' and 1+ word chars.
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for everything and most importantly for your time to answer to answer the question. Hope this question/answer will be of use to others as well.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.