1

My input example text file

92721662,5819.53,2019 - 10 - 10,04332977,5938.30,.00,118.77 -

92721664,5510.56,2019 - 10 - 10,04332978,5623.02,.00,112.46 -

92730321,22805.90,2019 - 10 - 15,04354360,23350.20,.00,544.30 -

The last regex I have tried is:

var requestbody3 = Regex.Replace(requestbody2, @" { 3 ,}[\r\n]", "");

Where requestbody2 is the result of File.ReadAllText() from "testinput.txt" file

The goal is to remove only the blank lines containing 3 or more spaces ending with \r\n leaving individual lines without gaps between them.

2
  • 2
    Why not use File.ReadLines().Where(!string.IsNullOrWhiteSpace) instead? Regex seems overkill for this Commented Feb 25, 2020 at 18:32
  • Why would you want to keep a string made of only two whitespaces? Commented Feb 25, 2020 at 18:36

2 Answers 2

2

You can avoid Regex entirely for this, which I highly suggest.

Instead of reading your file as a giant string, get the lines using the built in method: File.ReadLines(). Then to remove blank lines you just use LINQ.

So all together your code should just be:

IEnumerable<string> lines = File.ReadLines("testinput.txt").Where(!string.IsNullOrWhiteSpace);
Sign up to request clarification or add additional context in comments.

2 Comments

@aloisdgmovingtocodidact.com You mean ||
And if you really want to accept string made of one or two spaces: var requestbody3 = File.ReadLines("testinput.txt").Where(x=> x.Length < 3 || !string.IsNullOrWhiteSpace(x)); (fixed. Ty @Pluto)
1

The crux of your problem is that the regex contains extraneous white space and isn't behaving as a "three or more" quantifier. Simply don't put spaces inside the curly brackets:

//three or more spaces followed by windows or unix newline
" {3,}\r?\n"

Consider also:

  • use \s instead of to match a space
  • don't put [\r\n] because it means "one of CR or LF" so if your file has CRLF it will match the CR and remove it but not the LF and your file will still have new lines but be corrupt/mixed line endings. The correct Regex would be to match 0 or 1 CR followed by 1 LF
  • per Pluto's comment, you could start your regex with a caret, to prevent matching lines that contain some text and then end with 3 or more spaces: ^\s{3,}\r?\n - note that you'll also need to enable Multiline regexoption so that the regex engine treats every line of text as a separate input - right now it's treating the entire input as one string so ^ only applies to the start of the file not the start of each line
  • alternatively you can use a positive look behind to ensure that only sequences of spaces preceded by a newline character are matched. The preceding newline is not made part of the match so it doesn't get replaced: (?<=\n)\s{3,}\r?\n. The downside of this is that it can't match the very first line of the file, so we need yet another extension, to say "match the start of input or a newline, followed by 3+ spaces, followed by CR/CRLF" which is: (^|(?<=\n))\s{3,}\r?\n

Overkill, but a nice learning journey. Maybe consider using one of the routes suggested that doesn't use regex :)

3 Comments

Consider adding some carrots to the regular expression salad to avoid merging lines that end with whitespace but are not solely composed of whitespace.
While regex is indeed overkill for the example the actual file contains thousands of lines. Essentially I am having to clean up an email and all the associated junk that goes with it. I am constrained by the environment to C# and regex tools no other additional libs maybe added or used. like htmltidy.
@Six if your file contains thousands of lines then using ReadAllText() is going to be much slower than enumerating line by line using File.ReadLines().

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.