1

I am new to using regex and would really appreciate any help here. I have to parse a file with strings of following formats (main difference being that the second string has an extra "-" string in the middle:

  1. Abc_p123 abc_ghi_data

    OR

  2. Abc_de*_p123 abc_ghi_data

I could write a regex to match the first and second strings separately:

  1. data_lst = re.findall('([a-zA-Z0-9]+_p\d{3})\s.*_data.*', content, re.IGNORECASE)
  2. data_lst = re.findall('([a-zA-Z0-9]+_[a-zA-Z]+_p\d{3})\s.*_data.*', content, re.IGNORECASE)

Can someone guide on how to combine the two findall regex, so that it works with both strings. I can still create a combined single list by appending the second findall statement to first list. However, I am sure there is a way to handle it in one findall regex statement. I tried ".*" in the middle but, that gives error.

Please advise. Thanks,

3
  • Are you saying that _de* is optional? Commented Nov 2, 2020 at 20:35
  • To match an optional part, you can use a question mark. Commented Nov 2, 2020 at 20:36
  • Yes, in some cases we have it and in other cases we don't (then it is same as string1) Commented Nov 2, 2020 at 20:36

3 Answers 3

2

You were very close:

([a-zA-Z0-9]+(?:_[a-zA-Z]+\*)?_p\d{3})\s.*_data.*

Here is the important part:

(?:_[a-zA-Z]+\*)?

It says: optionally match an underscore, followed by unlimited a-z, followed by a asterisk.

https://regex101.com/r/5XCsPK/1

Sign up to request clarification or add additional context in comments.

Comments

1

You could try

([a-zA-Z0-9]+(_[a-zA-Z]+)?_p\d{3})\s.*_data.*

I replaced _[a-zA-Z]+ with (_[a-zA-Z]+)? to make it optional.

And if you don't want the extra capture group, add ?: like so: (?:_[a-zA-Z]+)?

Demo: https://regex101.com/r/5xynlx/2

3 Comments

The pattern misses the * in Abc_de*_p123 abc_ghi_data
Nope, OP said he matches successfully the second string using ([a-zA-Z0-9]+_[a-zA-Z]+_p\d{3})\s.*_data.* so I think * meant 'any random list of letters', (which I agree was misleading, but if he really wanted to match the star, then he wouldn't say ([a-zA-Z0-9]+_[a-zA-Z]+_p\d{3})\s.*_data.* is working)
Ok, I see what you mean. In that case it is misleading.
0

Use

([a-zA-Z0-9]+(?:_[a-zA-Z0-9*]+)?_p\d{3})\s.*_data

See proof

Explanation

--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    [a-zA-Z0-9]+             any character of: 'a' to 'z', 'A' to
                             'Z', '0' to '9' (1 or more times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
--------------------------------------------------------------------------------
      _                        '_'
--------------------------------------------------------------------------------
      [a-zA-Z0-9*]+            any character of: 'a' to 'z', 'A' to
                               'Z', '0' to '9', '*' (1 or more times
                               (matching the most amount possible))
--------------------------------------------------------------------------------
    )?                       end of grouping
--------------------------------------------------------------------------------
    _p                       '_p'
--------------------------------------------------------------------------------
    \d{3}                    digits (0-9) (3 times)
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  \s                       whitespace (\n, \r, \t, \f, and " ")
--------------------------------------------------------------------------------
  .*                       any character except \n (0 or more times
                           (matching the most amount possible))
--------------------------------------------------------------------------------
  _data                    '_data'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.