0

I have a text file, and I need to remove some trailing delimiters. The text file looks like this:

string text = @"1|'Nguyen Van| A'|'Nguyen Van A'|39
                2|'Nguyen Van B'|'Nguyen| Van B'|39";
string result = @"1|'Nguyen Van A'|'Nguyen Van A'|39
                  2|'Nguyen Van B'|'Nguyen Van B'|39";

I want to remove the char "|" In the string "Nguyen Van | A" and "Nguyen | Van B"

So I think the best way is to do a Regex replace? Can anyone help me with this regex?

Thanks

9
  • 1
    So you have a text file that is pipe delimited and you allow pipes within an entry when the entry is quoted? Am I understanding that correctly? Also are you allowed to use a third party library? Commented Mar 11, 2017 at 11:49
  • I would write it by hand instead of fighting against regexes... In the end it is a simple state machine. Commented Mar 11, 2017 at 11:53
  • BULK INSERT Product FROM 'D:\product.data' WITH ( FIELDTERMINATOR = '|', ROWTERMINATOR = '\n' ); I use BULK INSERT SQL Server to import file but have errors in these fields. Bulk load data conversion error (type mismatch or invalid character for the specified codepage). Commented Mar 11, 2017 at 11:59
  • Dear xanatos, Currently, I am in the data interface of the two systems so everything is import/export automatically. Thanks Commented Mar 11, 2017 at 12:04
  • @NguyễnQuang - I am also sort of confused now. Are you doing this in code to prepare your bulk insert file? The default FIELDQUOTE character is " so you will potentially have other issues, not just the extraneous pipe characters? Maybe it would be better if you wrote the question stating the actual problem rather than your proposed workaround, as there are potentially better ways to solve it. Commented Mar 11, 2017 at 12:08

2 Answers 2

1

The regex should be:

(?<=^[^']*'([^']*'[^']*')*[^']*)\|

to be used Multiline... so

var rx = new Regex(@"(?<=^[^']*'([^']*'[^']*')*[^']*)\|", RegexOptions.Multiline);

string text = @"1|'Nguyen Van| A'|'Nguyen Van A'|39

2|'Nguyen Van B'|'Nguyen| Van B'|39";

string replaced = rx.Replace(text, string.Empty);

Example: http://ideone.com/PTdsg5

I strongly suggest against using it... To explain why... Try to comprehend the regular expression. If you can comprehend it, then you can use it :-)

I would write a simple state machine that counts ' and replaces the | when the counted ' is odd.

Sign up to request clarification or add additional context in comments.

4 Comments

Because the BULK INSERT command does not accept these record. I need to clean it before using the BULK INSERT. stackoverflow.com/questions/12902110/… The solution solved the problem I encountered. However, for large file data, the process runs for a long time. Do you have a better way? Such as remove char | of content begins with string |' and the end is string '| Instead of counts ' and replaces the | when the counted ' is odd.
@NguyễnQuang - You should post this as a new question and be sure to post the relevant code (i.e. how you are opening and parsing the file, where you are calling the regex.Replace and how you are writing out the new file) to that question.
Thanks for looking for a solution here: stackoverflow.com/questions/13305139/…
@NguyễnQuang - Please post a new question with the code you are using. I am assuming that you are loading the whole file and using this Regex with the multiline mode. Post the new question and I can post an answer to help you.
0

You mentioned using the multiline regex is taking too long and asked about the state machine approach. So here is some code using a function to perform the operation (note, the function could probably use a little cleaning, but it shows the idea and works faster than the regex). In my testing, using the regex without multiline, I could process 1,000,000 lines (in memory, not writing to a file) in about 34 seconds. Using the state-machine approach it was about 4 seconds.

string RemoveInternalPipe(string line)
{
    int count = 0;
    var temp = new List<char>(line.Length);
    foreach (var c in line)
    {
        if (c == '\'')
        {
            ++count;
        }
        if (c == '|' && count % 2 != 0) continue;
        temp.Add(c);
    }
    return new string(temp.ToArray());
};

File.WriteAllLines(@"yourOutputFile",
    File.ReadLines(@"yourInputFile").Select(x => RemoveInternalPipe(x)));

To compare the performance against the Regex version (without the multiline option), you could run this code:

var regex = new Regex(@"(?<=^[^']*'([^']*'[^']*')*[^']*)\|");
File.WriteAllLines(@"yourOutputFile",
    File.ReadLines(@"yourInputFile").Select(x => regex.Replace(x, string.Empty));

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.