1

I want remove some part of text using regular expression in c#. Text looks like that:

BEGIN:VNOTE
VERSION:1.1
BODY;CHARSET=UTF-8;ENCODING=QUOTED-PRINTABLE:Penguins are among the most popular of all birds. They only live in and around the South Pole and the continent of Antarctica.No wild penguins live at the North Pole. There are many different kinds of penguins. The largest penguin is called the Emperor Penguin, and the smallest kind of penguin is the Little Blue Penguin. There are 17 different kinds of penguins in all, and none of them can fly

As the result I want to remove from text part

BEGIN:VNOTE
VERSION:1.1
BODY;CHARSET=UTF-8;ENCODING=QUOTED-PRINTABLE:

Text between BEGIN and PRINTABLE: can be different. So I wrote code (last version):

var start = "BEGIN";
var end = "PRINTABLE:";
var regEx = string.Format("{0}(.*|\n){1}", start, end);
var result = Regex.Replace(sourceText, regEx, string.Empty);

But it doesn't work. I tried many different variants of regex with the same result. Any ideas how my regex should looks?

Thank you for any advice.

4
  • Perhaps you need to reverse your logic and match what you want to retrieve rather than delete what you don't. Would that be a possible approach? Commented Mar 6, 2016 at 11:37
  • @Filkolev: The case here mandates to match the unwanted part and replace it with empty string. That would be simpler. Commented Mar 6, 2016 at 11:44
  • Use the right tool for the job. The encoding doesn't have to be quoted-printable, and you shouldn't just ignore the character set. You'd better use a suitable VCard/VCalendar/VNote parser library that can properly read this format. Commented Mar 6, 2016 at 11:47
  • What are you expecting the (.*|\n) part to achieve? It will match either the .* or the single character \n. It may not even match the \n because normally it would be written as \\n within a regular expression. Another option is to use the @"..." syntax for strings. Commented Mar 6, 2016 at 15:25

2 Answers 2

2

You should be matching everything between the BEGIN and PRINTABLE. Following regex does the same thing.

Regex: BEGIN.+?PRINTABLE:

Flags used:

  • g global search.

  • s to allow . match newline

Replacement to do: Replace with empty string.

Regex101 Demo

Edit #1: Changed regex to become more lazy. Thanks to Jan for edit.

Sign up to request clarification or add additional context in comments.

5 Comments

The dot star-soup in single line mode consumes everything to the end of the string and backtracks afterwards. I changed your solution in favor of a lazy quantifier, do you see the the significant reduction of steps (428 vs. 71) ? It's sometimes more important to come to an end (and fail, eventually) rather than trying every possible step.
@Jan: Thanks, I realized that now. That is a significant improvement over my solution.
To make it clear: your solution - 60590 steps and the lazy quantifier - still 71 steps.
A lazy quantifier is not only more efficient, but seems more appropriate. What if the text that is to remain after replacement contains the string "PRINTABLE:"? A greedy quantifier will then remove a portion of the result and lead to a wrong answer.
@Filkolev: Yes fortunately that's not the case here. In that case some other logic will have to be applied based on the content.
0

I've been parsing text files for 40 years and have used Regex a lot. Regex will not always work and is not the best tool to parse every type of text file. The technique used below will work and is better for parsing files into multiple sections

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;

namespace ConsoleApplication1
{
    class Program
    {
        public enum State
        {
            FIND_BEGIN,
            FIND_END
        }
        static void Main(string[] args)
        {

            string input =
                "BEGIN:VNOTE\n" +
                "VERSION:1.1\n" +
                "BODY;CHARSET=UTF-8;ENCODING=QUOTED-PRINTABLE:Penguins are among the most popular of all birds. They only live in and around the South Pole and the continent of Antarctica.No wild penguins live at the North Pole. There are many different kinds of penguins. The largest penguin is called the Emperor Penguin, and the smallest kind of penguin is the Little Blue Penguin. There are 17 different kinds of penguins in all, and none of them can fly\n";
            StringReader reader = new StringReader(input);

            StringBuilder builder = new StringBuilder();
            StringWriter writer = new StringWriter(builder);

            State state = State.FIND_BEGIN;
            string inputLine = "";
            Boolean end = false;
            while ((inputLine = reader.ReadLine()) != null)
            {
                switch (state)
                {
                    case State.FIND_BEGIN :
                        if(inputLine.StartsWith("BEGIN:"))
                        {
                            writer.WriteLine(inputLine);
                            state = State.FIND_END;
                        }
                        break;
                    case State.FIND_END :
                        if (inputLine.StartsWith("BODY;"))
                        {
                            writer.WriteLine(inputLine.Substring(0, inputLine.IndexOf(":") + 1));
                            end = true;
                        }
                        else
                        {
                            writer.WriteLine(inputLine);
                        }
                        break;
                }
                if (end) break;
            }
            string output = builder.ToString();

        }
    }
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.