4

I have some text like this:

CustomerID:1111,

text1

CustomerID:2222,

text2

CustomerID:3333,

text3

CustomerID:4444,

text4

CustomerID:5555,

text5

Each text has multiple lines.

I want to store the customer id and the text for each id in tuples (e.g. (1111, text1), (2222, text2), etc).

First, I use the expression below:

re.findall('CustomerID:(\d+)(.*?)CustomerID:', rawtxt, re.DOTALL)

However, I only get (1111, text1), (3333, text3), (5555, text5).....

5 Answers 5

2
re.findall(r'CustomerID:(\d+),\s*(.*?)\s*(?=CustomerID:|$)', rawtxt, re.DOTALL)

Findall returns only the groups. use a lookahead for stopping the non greedy quantifier.Its also suggested to use r or raw mode to specify your regexes.If you dont use lookahead then customerid for next match will be consumed and so next match will not present.Overlapping matches has to be removed by using lookahead which do not consume string

Sign up to request clarification or add additional context in comments.

6 Comments

what is the function of re.DOTALL
@SIslam . be default does not match \n or newline.With this flag it does.So now .* will match mulitline
Ah! here with and without re.DOTALL prints the same!
@SIslam because we are covering \n or newlines with \s
So in this case do we need re.DOTALL? Thanks
|
2

Actually no need regex here:

>>> with open('file') as f:
...     rawtxt = [i.strip() for i in f if i != '\n']
...     
>>> l = []
>>> for i in [rawtxt[i:i+2] for i in range(0, len(rawtxt), 2)]:
...     l.append((i[0][11:-1], i[1]))
...     
... 
>>> l
[('1111', 'text1'), ('2222', 'text2'), ('3333', 'text3'), ('4444', 'text4'), ('5
555', 'text5')]
>>> 

If you need 1111, 2222, etc. be int, use l.append((int(i[0][11:-1]), i[1])) instead of l.append((i[0][11:-1], i[1])).

Comments

1

Given:

>>> txt='''\
... CustomerID:1111,
... 
... text1
... 
... CustomerID:2222,
... 
... text2
... 
... CustomerID:3333,
... 
... text3
... 
... CustomerID:4444,
... 
... text4
... 
... CustomerID:5555,
... 
... text5'''

You can do:

>>> [re.findall(r'^(\d+),\s+(.+)', block) for block in txt.split('CustomerID:') if block]
[[('1111', 'text1')], [('2222', 'text2')], [('3333', 'text3')], [('4444', 'text4')], [('5555', 'text5')]]

If it is multiline text, you can do:

>>> [re.findall(r'^(\d+),\s+([\s\S]+)', block) for block in txt.split('CustomerID:') if block]
[[('1111', 'text1\n\n')], [('2222', 'text2\n\n')], [('3333', 'text3\n\n')], [('4444', 'text4\n\n')], [('5555', 'text5')]]

Comments

1

Another simple one may be-

>>>re.findall(r'(\b\d+\b),\s*(\btext\d+\b)', rawtxt)
>>>[('1111', 'text1'), ('2222', 'text2'), ('3333', 'text3'), ('4444', 'text4'), ('5555', 'text5')]

Edit- If needed (for worse ordered data) use filter

filter(lambda x: len(x)>1,re.findall(r'(\b\d+\b),\s*(\btext\d+\b)', rawtxt))

SEE DEMO Live Demo

Comments

0

re.findall is not the best tool for this, since regex is always greedy and will try to gobble up all the subsequent customerID's with the text.

A tool practically created for this is re.split. Brackets capture the id number and filter out "CustomerID". A second line stitches tokens into tuples the way you wanted:

toks = re.split(r'CustomerID:(\d{4}),\n', t)
zip(toks[1::2],toks[2::2])

EDIT: corrected index in zip(). Sample output after correction:

[('1111', 'text1\n'),
 ('2222', 'text2\n'),
 ('3333', 'text3\n'),
 ('4444', 'text4\n'),
 ('5555', 'text5')]

2 Comments

This is not what OP wants, your expression returns [('1111', '2222'), ('2222', '3333'), ('3333', '4444'), ('4444', '5555')]
@SIslam ... toks[2::2] instead of toks[3::2]. I will correct it

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.