2

I am trying to replace all .(periods) with keyword XXX which lie within an alphanumeric word in a large text.

For example: I am trying to match a.b.c.d.e ...
Expected output: I am trying to match aXXXbXXXcXXXdXXXe ...

Pattern I used: (\w+)([\.]+)(\w+)
Actual result: I am trying to match aXXXb.cXXXd.e ...

How can I get expected output via regex without using any code/stubs.

3 Answers 3

1

You can use lookarounds:

str = str.replaceAll("(?<=[a-zA-Z0-9])\\.(?=[a-zA-Z0-9])", "XXX");

RegEx Demo

Lookaround Reference

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks alot Anubhava. Regex worked great:) I read about lookarounds, their usage may slow down the performance. I have a large documents on which I have to run these sort of regexes. I will give it a try. Thanks again!
This is one of the simpler lookaround so I don't see performance an issue here.
I tried lookarounds for 3MB document which is taking around 180secs. In alternative, I have tried following str = str.replaceAll("((\s|\a|\A|\b)[a-zA-Z0-9]+)([\.])+([a-zA-Z0-9]+(\s|\z|\Z|\b))", "$1XXX$4"); that comes in less than one second. But the issue is same again (aXXXb.cXXXd.e). Can you please suggest an alternate solution with high performance.
0

Why don't you do something like if you want to change all . -

str = str.replaceAll("\\.", "XXX");

Or below if you don't want to change . if any first or last index -

str = str.replaceAll("\\.", "XXX").replaceAll("^XXX", ".").replaceAll("XXX$", ".");

1 Comment

Thanks Raman. I am only interested in replacing .(period) with XXX within an alphanumeric word, not all periods in a given text. For example: Trying to replace a.b.c.d.e and $.b .... Expected output is: Trying to replace aXXXbXXXcXXXdXXXe and $.b ..... Any help would be appreciated.
0

Solution 1: Match Dots Together + Use Replace Function

If it's possible, I'd suggest a little bit different approach of using regex:

@Test
public void test_regex_replace() {
    var input = "I am trying to match a.b.c.d.e ...";
    var expectedOutput = "I am trying to match aXXXbXXXcXXXdXXXe ...";
    var regex = Pattern.compile("((\\w+)([\\.]))+(\\w+)");
    var output = regex.matcher(input).replaceAll(match -> match.group().replace(".", "XXX"));
    assertEquals(expectedOutput, output);
}

Notice how I changed the pattern:
(\w+) ([\.]+) (\w+)
((\w+)([\.]+))+ (\w+)
So it matches on words containing multiple dots. Notice, how it replaces a..b to aXXXXXXb instead of aXXXb; if you want otherwise, you must modify the lambda a little bit, e.g.:

regex.matcher(input).replaceAll(match -> match.group().replaceAll("\\.+", "XXX"));

or something more performant, which replaces any number of subsequent dots to only one XXX:

@Test
public void test_regex_replace() {
    final String input = "I am trying to match a.b.c.d.e ...";
    final String expectedOutput = "I am trying to match aXXXbXXXcXXXdXXXe ...";
    final Pattern regex = Pattern.compile("(?:(\\w+)\\.+)+\\w+");
    final String output = regex.matcher(input).replaceAll(match -> {
        final String matchText = match.group();
        final int matchTextLength = matchText.length();
        final var sb = new StringBuilder();
        int lastEnd = 0;
        while (lastEnd < matchTextLength) {
            int endOfWord = lastEnd;
            while (endOfWord < matchTextLength && matchText.charAt(endOfWord) != '.') {
                endOfWord += 1;
            }
            sb.append(matchText, lastEnd, endOfWord);
            int endOfDots = endOfWord;
            endOfDots = asd(endOfDots, matchTextLength, matchText);
            if (endOfDots != endOfWord) {
                sb.append("XXX");
            }
            lastEnd = endOfDots;
        }
        return sb.toString();
    });
    assertEquals(expectedOutput, output);
}

This avoids the problem of reusing some characters as both the left and right side of the dot by matching them together. Not sure about the performance, but it does not use any lookarounds, so I expect it to perform rather well.


Solution 2: Using Word Boundary

You mentioned "without using any code/stubs", so this might not fit your problem, but otherwise you must use lockarounds. Other than these, the only thing I can think of is using \b (word boundary symbol) in the regex, like so:

@Test
public void test_regex_replace() {
    final String input = "I am trying to match a.b.c.d.e ...";
    final String expectedOutput = "I am trying to match aXXXbXXXcXXXdXXXe ...";
    final String output = input.replaceAll("\\b\\.+\\b", "XXX");
    assertEquals(expectedOutput, output);
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.