2

I am trying to preprocess text before parsing them to StanfordCoreNLP server. Some of my text looks like this.

" Conversion of code written in C# to Visual Basic .NET (VB.NET)."

The ".NET" confuses the server because it appears as a period and makes the single sentence into two. I wanted to replace '.' that appears in front of a word with 'DOT' so that sentence remains the same. Note that I don't want to change anything in 'VB.NET' because the StanfordCoreNLP recognizes that as one word (Proper noun).

This is what I tried so far.

print(re.sub(r"\.(\S+)", r"DOT\g<0>", text))

The result looks like this.

Conversion of code written in C# to Visual Basic DOT.NET (VBDOT.NET).

I tried adding word boundaries to the pattern r"\b\.(\S+)\b". It didn't work.

Any help would be appreciated.

1

1 Answer 1

1

You can use

re.sub(r"\B\.\b", "DOT", text)

See the regex demo.

The \B\.\b regex matches a dot that is either at the start of string or immediately preceded with a non-word char, and that is followed with a word char.

See the Python demo:

import re
text = "Conversion of code written in C# to Visual Basic .NET (VB.NET)."
print( re.sub(r"\B\.\b", "DOT", text) )
# => Conversion of code written in C# to Visual Basic DOTNET (VB.NET).
Sign up to request clarification or add additional context in comments.

2 Comments

Could you please explain why \b\.\b doesn't work.
@akalanka \b\.\b matches a . that is located in between two word chars, e.g. b.c, 1.a, _._.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.