Hello guys I've been working on an interesting project involving some ML in python and some Java source code. Basically I need to tokenize each line of Java code with regular expressions and sadly I haven't been able to do that.
I've been trying to create my own regular expression pattern for the last couple of days with lots of googling and youtubing because I didn't know how to do it myself in the begging(I don't think do now either :( ). I tried using libraries for tokenizing but those work in really weird ways like sometimes missiing semi-colons and brackets and sometimes not.
def stringTokenizer(string):
tokens = re.findall(r"[\w']+|[""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~""\\]", string);
print(tokens);
stringTokenizer('void addAction(String commandId, IHandler action);');
Initially I wanted the to get the following output: ['void', 'addAction', '(', 'String', 'commandId', 'IHandler', 'action', ')', ';'] but saddly this is the closest I got to the result ['void', 'addAction(', 'String', 'commandId', 'IHandler', 'action);']
If anybody could help you'll be a lifesaver.

r"[\w']+|[^\w\s']"