3

I have a long text with companies and IDs. I would like to split the string into a list, where an item ends with an ID. Every ID consists of 5 digits and appears in the text in the same format \(ID\:\d{5}\)

text = "Company A, Inc(ID:12345), some-company, X (ID:12324), Some Special Company Z (ID:34324)"

What I would like to get is the following:

["Company A, Inc (ID:12345)", "some-company, X (ID:12324)", "Some Special Company Z (ID:34324)"]

Is there a way to do it with Regex? Thanks in advance!

0

3 Answers 3

2

Try

a=re.findall(r'(.*?\(ID\:\d{5}\))',text)
print(a)

output

['Company A, Inc(ID:12345)',
 ', some-company, X (ID:12324)',
 ', Some Special Company Z (ID:34324)']
Sign up to request clarification or add additional context in comments.

Comments

1

Would you try the following:

import re
text = "Company A, Inc(ID:12345), some-company, X (ID:12324), Some Special Company Z (ID:34324)"

a = re.split(r'(?<=\(ID:\d{5}\)),\s*', text)
print(a)

Output:

['Company A, Inc(ID:12345)', 'some-company, X (ID:12324)', 'Some Special Company Z (ID:34324)']

Explanation of the regex r'(?<=\(ID:\d{5}\)),\s*':

  • (?<=pattern) is a positive lookbehind assertion. It has the zero width and the matched substring remains in the split list.
  • \(ID:\d{5}\) is the format as you describe.
  • ,\s* matches a comma followed by a zero or more whitespace(s). We do not want to include the substring in the result and it works as a delimiter.

Comments

0

You can optionally match a comma and 1 or more whitespace chars. Then match at least a single non whitespace char for the company name until the first occurrence of the id pattern.

Note that you don't have to escape the \:

(?:,\s+)?(\S.*?\(ID:\d{5}\))

Explanation

  • (?:,\s+)? Optionally match a comma and 1+ whitespace chars
  • ( Capture group 1
    • \S.*? Match a non whitespace char followed by 0* as least as possible chars
    • \(ID:\d{5}\) Match (ID: 5 digits and )
  • ) Close group

Regex demo | Python demo

Example

import re

text = "Company A, Inc(ID:12345), some-company, X (ID:12324), Some Special Company Z (ID:34324)"
print(re.findall(r"(?:,\s+)?(\S.*?\(ID:\d{5}\))", text))

Output

['Company A, Inc(ID:12345)', 'some-company, X (ID:12324)', 'Some Special Company Z (ID:34324)']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.