regex text between two string python

Question

I have some text like this:

CustomerID:1111,

text1

CustomerID:2222,

text2

CustomerID:3333,

text3

CustomerID:4444,

text4

CustomerID:5555,

text5

Each text has multiple lines.

I want to store the customer id and the text for each id in tuples (e.g. (1111, text1), (2222, text2), etc).

First, I use the expression below:

re.findall('CustomerID:(\d+)(.*?)CustomerID:', rawtxt, re.DOTALL)

However, I only get (1111, text1), (3333, text3), (5555, text5).....

vks · Accepted Answer · 2015-11-19 04:47:59Z

2

re.findall(r'CustomerID:(\d+),\s*(.*?)\s*(?=CustomerID:|$)', rawtxt, re.DOTALL)

Findall returns only the groups. use a lookahead for stopping the non greedy quantifier.Its also suggested to use r or raw mode to specify your regexes.If you dont use lookahead then customerid for next match will be consumed and so next match will not present.Overlapping matches has to be removed by using lookahead which do not consume string

edited Nov 19, 2015 at 4:47

answered Nov 19, 2015 at 4:32

vks

68.1k11 gold badges96 silver badges132 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Learner Over a year ago

what is the function of re.DOTALL

vks Over a year ago

@SIslam . be default does not match \n or newline.With this flag it does.So now .* will match mulitline

Learner Over a year ago

Ah! here with and without re.DOTALL prints the same!

vks Over a year ago

@SIslam because we are covering \n or newlines with \s

Learner Over a year ago

So in this case do we need re.DOTALL? Thanks

|

Remi Guan · Accepted Answer · 2015-11-19 04:38:28Z

2

Actually no need regex here:

>>> with open('file') as f:
...     rawtxt = [i.strip() for i in f if i != '\n']
...     
>>> l = []
>>> for i in [rawtxt[i:i+2] for i in range(0, len(rawtxt), 2)]:
...     l.append((i[0][11:-1], i[1]))
...     
... 
>>> l
[('1111', 'text1'), ('2222', 'text2'), ('3333', 'text3'), ('4444', 'text4'), ('5
555', 'text5')]
>>>

If you need 1111, 2222, etc. be int, use l.append((int(i[0][11:-1]), i[1])) instead of l.append((i[0][11:-1], i[1])).

answered Nov 19, 2015 at 4:38

Remi Guan

22.5k17 gold badges68 silver badges90 bronze badges

Comments

dawg · Accepted Answer · 2015-11-19 04:52:17Z

1

Given:

>>> txt='''\
... CustomerID:1111,
... 
... text1
... 
... CustomerID:2222,
... 
... text2
... 
... CustomerID:3333,
... 
... text3
... 
... CustomerID:4444,
... 
... text4
... 
... CustomerID:5555,
... 
... text5'''

You can do:

>>> [re.findall(r'^(\d+),\s+(.+)', block) for block in txt.split('CustomerID:') if block]
[[('1111', 'text1')], [('2222', 'text2')], [('3333', 'text3')], [('4444', 'text4')], [('5555', 'text5')]]

If it is multiline text, you can do:

>>> [re.findall(r'^(\d+),\s+([\s\S]+)', block) for block in txt.split('CustomerID:') if block]
[[('1111', 'text1\n\n')], [('2222', 'text2\n\n')], [('3333', 'text3\n\n')], [('4444', 'text4\n\n')], [('5555', 'text5')]]

answered Nov 19, 2015 at 4:52

dawg

105k24 gold badges142 silver badges217 bronze badges

Comments

Learner · Accepted Answer · 2015-11-19 06:32:40Z

1

Another simple one may be-

>>>re.findall(r'(\b\d+\b),\s*(\btext\d+\b)', rawtxt)
>>>[('1111', 'text1'), ('2222', 'text2'), ('3333', 'text3'), ('4444', 'text4'), ('5555', 'text5')]

Edit- If needed (for worse ordered data) use filter

filter(lambda x: len(x)>1,re.findall(r'(\b\d+\b),\s*(\btext\d+\b)', rawtxt))

SEE DEMO Live Demo

edited Nov 19, 2015 at 6:32

answered Nov 19, 2015 at 5:38

Learner

5,3001 gold badge29 silver badges39 bronze badges

Comments

Muposat · Accepted Answer · 2015-11-19 15:29:10Z

0

re.findall is not the best tool for this, since regex is always greedy and will try to gobble up all the subsequent customerID's with the text.

A tool practically created for this is re.split. Brackets capture the id number and filter out "CustomerID". A second line stitches tokens into tuples the way you wanted:

toks = re.split(r'CustomerID:(\d{4}),\n', t)
zip(toks[1::2],toks[2::2])

EDIT: corrected index in zip(). Sample output after correction:

[('1111', 'text1\n'),
 ('2222', 'text2\n'),
 ('3333', 'text3\n'),
 ('4444', 'text4\n'),
 ('5555', 'text5')]

edited Nov 19, 2015 at 15:29

answered Nov 19, 2015 at 6:05

Muposat

1,5061 gold badge12 silver badges25 bronze badges

2 Comments

Learner Over a year ago

This is not what OP wants, your expression returns [('1111', '2222'), ('2222', '3333'), ('3333', '4444'), ('4444', '5555')]

Muposat Over a year ago

@SIslam ... toks[2::2] instead of toks[3::2]. I will correct it

Collectives™ on Stack Overflow

regex text between two string python

5 Answers 5

6 Comments

Comments

Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

6 Comments

Comments

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related