0

I have a text file looks like this

FieldA    FieldB    FieldC    FieldD  FieldE
  001       中文                15%     语言
  002       法文      20        12%     外文 
  003       英文      21                外文
  004     西班牙语               10%     外文

so basically I have the file read in and split into lines. Now I would like to use regex to split each line into fields. As you can see some fields in the column are actually empty, the fields may not in fixed width, but is separated by at least one white space. Some fields contains Chinese characters.

May I know how to do this? Thanks.

3
  • 2
    How, do you know that 外文 goes to column FieldE, not to FieldD? Commented Aug 22, 2015 at 9:42
  • that is the thing, i need the regex to know that there are 5 fields. but last fieldE is confirmed to be chinese, while FieldD is percent or empty. Commented Aug 22, 2015 at 9:43
  • 1
    What are the fields separated with? Commented Aug 22, 2015 at 10:08

2 Answers 2

1
string s = "001       中文                15%     语言";
Match m = Regex.Match(s, 
    @"(?<A>\d*)\s*" +       // Field A: any number of digits
    @"(?<B>\p{L}*)\s*" +    // Field B: any number of letters
    @"(?<C>\d*)\s+" +       // Field C: any number of digits
    @"(?<D>(\d+%)?)\s*" +   // Field D: one or more digits followed by '%', or nothing
    @"(?<E>\p{L}*)");       // Field E: any number of letters
string fieldA = m.Groups["A"].Value;    // "001"
string fieldB = m.Groups["B"].Value;    // "中文"
string fieldC = m.Groups["C"].Value;    // ""
string fieldD = m.Groups["D"].Value;    // "15%"
string fieldE = m.Groups["E"].Value;    // "语言"

All fields are optional. If a field is not present, it will be captured as an empty string, like in fieldC above.

Sign up to request clarification or add additional context in comments.

2 Comments

This will fail if some fields are missing.
@Nikolay: The original version would only have failed if Field A was missing. In the updated version, all fields are optional.
1
/\s*(\d*)\s*([^\d\s]*)\s*(\d*)\s\s*(\d*%?)\s*([^\d\s]*)/

Here is a regex that will capture all of the content you want, use it on each line.

\s*         //any number of whitespace
(\d*)       //any number of digits
\s*         //any number of whitespace
([^\d\s]*)  //any number of characters that aren't whitespace or digits
\s*         //any number of whitespace
(\d*)\s     //any number of digits with a space after it
\s*         //any number of whitespace
(\d*%?)     //any number of digits with an optional %
\s*         //any number of whitespace
([^\d\s]*)  //any number of characters that aren't whitespace or digits

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.