2

As a result of text extrapolation from PDF's I need to fix some bugs. I need to replace every form of strings like these:

String example="the sun was shin- ing  and the sky bl- ue";

in the form:

String fixxed="the sun was shining  and the sky blue";

I'm not expert in regular expressions, I tried to do so but it's wrong.

String pattern="([\\w])+([\\-])+([\\s])";
String fixxed = text.replaceAll(pattern, "$1");

An important specification, I only have to replace the substring if the character before '-' is a letter (not a space and not a number).

1
  • Please check the answers below and let know if you need more help. Commented Oct 3, 2020 at 15:11

5 Answers 5

3

Do it as follows:

public class Main {
    public static void main(String[] args) {
        String example = "the sun was shin- ing  and the sky bl- ue";
        example = example.replaceAll("\\-\\s+", "");
        System.out.println(example);
    }
}

Output:

the sun was shining  and the sky blue
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the reply. I forgot an important specification. I only have to replace the substring if the character before '-' is a letter (not a space and not a number).
2

You can fetch the letters before, the letter after and combine them:

public static void main(String[] args) {
    String example = "the sun was shin- ing  and the sky bl- ue a - a 1-2 1 - 2";
    String pattern = "(\\w+)-\\s(\\w)";

    String newExample = example.replaceAll(pattern, "$1$2");
    System.out.println(newExample);
}


Output:

the sun was shining  and the sky blue a - a 1-2 1 - 2

4 Comments

Thanks for the reply. I forgot an important specification. I only have to replace the substring if the character before '-' is a letter (not a space and not a number). The followings examples are not to replace: a - a , 1 - a , 1 - 1
@Brianzaska The example only replaces if the dash - has a letter before it. (\\w*)- stands for "any sequence of word letter with any length that has a dash at the end". It ignores other parts of the string that has a dash but no letter before it. The example you posted should work with the solution above.
if I try with "a - a" it replace the string with "a a", but I don't want to replace it. For instance, the following string must not be replaced "a - a 1-2 1 - 2". Thanks
@Brianzaska I have updated the post and it should no longer replace a - a
0

You can use replaceAll() method of String to replace a specific set of characters.

According to Oracle docs,

replaceAll(String regex, String replacement)

Replaces each substring of this string that matches the given regular expression with the given replacement.

So, for your case you can do it like,

 String example = "the sun was shin- ing  and the sky bl- ue";
 System.out.println(example.replaceAll("- ",""));

or

String example = "the sun was shin- ing  and the sky bl- ue";
System.out.println(example.replaceAll("\\-\\s+",""));

Output for both cases will be like below,

 the sun was shining  and the sky blue

1 Comment

Thanks for the reply. I forgot an important specification. I only have to replace the substring if the character before '-' is a letter (not a space and not a number). The followings examples are not to replace: a - a , 1 - a , 1 - 1
0

To only replace the substring if the character before - is a letter (using \w to match a word character), you could use a lookarounds to assert a word character on the left and on the right.

This will replace bl- ue to blue and also replace bl- u- es to blues

(?<=\w)-\s(?=\w)

Regex demo | Java demo

For example

String example = "the sun was shin- ing and the sky bl- ue or bl- u- es";
System.out.println(example.replaceAll("(?<=\\w)-\\s(?=\\w)", ""));

Output

the sun was shining and the sky blue or blues

If you don't want to change:

bl-
ue

to

blue

You could use \h to match a horizontal whitespace char instead of using \s, which could also match a newline.

(?<=\w)-\h(?=\w)

Regex demo

Comments

0

You can use the following solution with any language, even those using diacritics:

(\p{L}\p{M}*+)-\h(?=\p{L})

Or, with \h+, if there can be more than one space between the letter and -+letter:

(\p{L}\p{M}*+)-\h+(?=\p{L})

Replace \h with \s if there can be a line break between the parts of a torn word.

See the regex demo, replace the matches with $1 replacement backreference that will put the contents of Group 1.

  • (\p{L}\p{M}*+) - Group 1: any Unicode letter followed with 0 or more diacritics
  • - - a hyphen
  • \h+ / \s+ - one or more horizontal / any whitespace chars
  • (?=\p{L}) - a positive lookahead that requires the next char to be any Unicode letter.

See the Java code:

String text = "the sun was shin- ing  and the sky bl- ue";
System.out.println(text.replaceAll("(\\p{L}\\p{M}*+)-\\s+(?=\\p{L})", "$1"));
// => the sun was shining  and the sky blue

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.