2

I have a string as:

s=

"(2021-06-29T10:53:42.647Z) [Denis]: hi
(2021-06-29T10:54:53.693Z) [Nicholas]: TA FOR SHOWING
(2021-06-29T11:58:29.053Z) [Nicholas]: how are you bane 
(2021-06-29T11:58:29.053Z) [Nicholas]: 
(2021-06-29T11:58:29.053Z) [Nicholas]: #END_REMOTE#
(2021-06-30T08:07:42.029Z) [Denis]: VAL 01JUL2021
(2021-06-30T08:07:42.029Z) [Denis]: ##ENDED AT 08:07 GMT##"

I want to extract the text from it. Expected output as:

comments=['hi','TA FOR SHOWING','how are you bane',' ','#END_REMOTE#','VAL 01JUL2021','##ENDED AT 08:07 GMT##'] 

What I have tried is:

comments=re.findall(r']:\s+(.*?)\n',s) 

regex works well but I'm not able to get the blank text as ''

3
  • You have to exclude matching the ] like ]:\s+([^]\n]*)$ Commented Nov 9, 2021 at 11:33
  • Could you please provide the code you use to process your text? The string literal your provided does not compile. Commented Nov 9, 2021 at 11:33
  • @Thefourthbird I did...will surely do for rest ones. Commented Nov 11, 2021 at 7:03

4 Answers 4

1

You can exclude matching the ] instead in the capture group, and if you also want to match the value on the last line, you can assert the end of the string $ instead of matching a mandatory newline with \n

Note that \s can match a newline and also the negated character class [^]]* can match a newline

]:\s+([^]]*)$

Regex demo | Python demo

import re

regex = r"]:\s+([^]]*)$"

s = ("(2021-06-29T10:53:42.647Z) [Denis]: hi\n"
    "(2021-06-29T10:54:53.693Z) [Nicholas]: TA FOR SHOWING\n"
    "(2021-06-29T11:58:29.053Z) [Nicholas]: how are you bane \n"
    "(2021-06-29T11:58:29.053Z) [Nicholas]: \n"
    "(2021-06-29T11:58:29.053Z) [Nicholas]: #END_REMOTE#\n"
    "(2021-06-30T08:07:42.029Z) [Denis]: VAL 01JUL2021\n"
    "(2021-06-30T08:07:42.029Z) [Denis]: ##ENDED AT 08:07 GMT##")

print(re.findall(regex, s, re.MULTILINE))

Output

['hi', 'TA FOR SHOWING', 'how are you bane ', '', '#END_REMOTE#', 'VAL 01JUL2021', '##ENDED AT 08:07 GMT##'] 

If you don't want to cross lines:

]:[^\S\n]+([^]\n]*)$

Regex demo

Sign up to request clarification or add additional context in comments.

Comments

1

You could identify all after the colon into an array from capture group 1.

re.findall(r'(?m):[ \t]+(.*?)[ \t]*$',s) 

then loop the array assigning a space to all empty elements.

>>> import re
>>>
>>> s= """
... (2021-06-29T10:53:42.647Z) [Denis]: hi
... (2021-06-29T10:54:53.693Z) [Nicholas]: TA FOR SHOWING
... (2021-06-29T11:58:29.053Z) [Nicholas]: how are you bane
... (2021-06-29T11:58:29.053Z) [Nicholas]:
... (2021-06-29T11:58:29.053Z) [Nicholas]: #END_REMOTE#
... (2021-06-30T08:07:42.029Z) [Denis]: VAL 01JUL2021
... (2021-06-30T08:07:42.029Z) [Denis]: ##ENDED AT 08:07 GMT##
... """
>>>
>>> talk = [re.sub('^$', ' ', w) for w in re.findall(r'(?m):[ \t]+(.*?)[ \t]*$',s)]
>>> print(talk)
['hi', 'TA FOR SHOWING', 'how are you bane', ' ', '#END_REMOTE#', 'VAL 01JUL2021', '##ENDED AT 08:07 GMT##']

Comments

0

Is this what you want?

comments = re.findall(r']:\s(.*?)\n',s)

If the space after : is always one space, \s+ should be \s. \s+ means one or more spaces.

Comments

0

With your shown samples please try following regex.

^\(\d{4}-\d{2}-\d{2}T(?:\d{2}:){2}\d{2}\.\d{3}Z\)\s+\[[^]]*\]:\s+([^)]*)$

Online demo for above regex

Explanation: Adding detailed explanation for above.

^\(\d{4}-\d{2}-\d{2}  ##Matching from starting of line ( followed by 4 digits-2 digits- 2 digits here.
T(?:\d{2}:){2}        ##Matching T followed by a non-capturing group which is matching 2 digits followed by colon 2 times.
\d{2}\.\d{3}Z\)\s+    ##Matching 2 digits followed by dot followed by 3 digits Z and ) followed by space(s).
\[[^]]*\]:\s+         ##Matching literal [ till first occurrence of ] followed by ] colon and space(s).
([^)]*)$              ##Creating 1st capturing group which has everything till next occurrence of `)`.

With Python3x:

import re
regex = r"^\(\d{4}-\d{2}-\d{2}T(?:\d{2}:){2}\d{2}\.\d{3}Z\)\s+\[[^]]*\]:\s+([^)]*)$"
varVal = ("(2021-06-29T10:53:42.647Z) [Denis]: hi\n"
    "(2021-06-29T10:54:53.693Z) [Nicholas]: TA FOR SHOWING\n"
    "(2021-06-29T11:58:29.053Z) [Nicholas]: how are you bane \n"
    "(2021-06-29T11:58:29.053Z) [Nicholas]: \n"
    "(2021-06-29T11:58:29.053Z) [Nicholas]: #END_REMOTE#\n"
    "(2021-06-30T08:07:42.029Z) [Denis]: VAL 01JUL2021\n"
    "(2021-06-30T08:07:42.029Z) [Denis]: ##ENDED AT 08:07 GMT##")

print(re.findall(regex, varVal, re.MULTILINE))

Output will be as follows with samples shown by OP:

['hi', 'TA FOR SHOWING', 'how are you bane ', '', '#END_REMOTE#', 'VAL 01JUL2021', '##ENDED AT 08:07 GMT##']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.