4

I need some help with the following pattern, I am struggling many hours now. I have a text like:

<<12/24/2015 00:00  userrrr>>
********** Text all char and symbols ************
<<12/24/2015 00:00 CET userr>>
Text all char and symbols
<<12/24/2015 00:00 GMT+1 userrrr>> Text in same line
<<12/24/2015 00:00 CET userrr>>
Text all characters and symbols
<<12/24/2015 00:00 GMT+1 userrrrrrr>> Text in same line
More Text all characters and symbols
<<12/24/2015 00:00 CET userrrrr>>
More text all characters and symbols
<<12/24/2015 00:00 CET userrrrrrrrrrr>>
More Text all characters and symbols

By Using the pattern:

(\<<)(\d{2}/\d{2}/\d{4}\s\d{2}:\d{2})(.*?(?=>>))(>>)

The datetime and everything between the arrows is matched correctly.Unfortunately, I can not find a way to extract the text between the patterns.The final groups should look like (left_arrows), (datetime), (user), (right_arrows), (text).The closer I got was by using:

(\<<)(\d{2}/\d{2}/\d{4}\s\d{2}:\d{2}\s\D{3}.*?(?=\s))\s(.*?(?=>>))(>>)((?s).*?(?=<<\d{2}/\d{2}))

But it doesn't match the first and the last correctly.Click Here to check the result(pythex.org)

3
  • what do you want to extract? You sure line.startswith("<<") could not do most of what you want? Commented Dec 24, 2015 at 11:51
  • 3 groups (datetimeoffset) ,(User),(Text Between pattterns) . So now I am failing to extract the text between patterns. I don't have an issue with the first 2 groups. Commented Dec 24, 2015 at 11:57
  • Why is this tagged with BeautifulSoup?.. Commented Dec 24, 2015 at 13:39

2 Answers 2

1
(\<<)(\d{2}/\d{2}/\d{4}\s\d{2}:\d{2}\s\D{0,3}.*?(?=\s))\s(.*?(?=>>))(>>)((?s).*?(?=<<\d{2}/\d{2}|$))
                                                                                                ^^

You need to give |$ for the last line to match.See demo.

https://regex101.com/r/fM9lY3/51

Sign up to request clarification or add additional context in comments.

7 Comments

Hi vks, the last line works now, thanks for the info.While in your demo works, in my terminal and pythex link the "********** Text all char and symbols ************ " fails .
@Zars try (\<<)(\d{2}/\d{2}/\d{4}\s\d{2}:\d{2}\s\D{0,3}.*?(?=\s))\s(.*?(?=>>))(>>)([\s\S]*?(?=<<\d{2}/\d{2}|$))
@vsk no the second one does not work, I am still wondering why in regex101 works , but no in pythex and my terminal...
@Zars Can u show d code where u r implementing d same
@vsk Just figured out that the pattern works for <<12/24/2015 00:00 CET userr>> but not for <<12/24/2015 00:00 userrrr>> where timezone is missing.To see my implementation check the link on my first post.I think I am getting closes,Thanks.
|
0

I think the easiest way will be to go over the file line by line and try to match them with different regexes, one for header lines and one for text lines. But if you really need to get it in one shot, you could do:

(\<<)(\d{2}/\d{2}/\d{4}\s\d{2}:\d{2})(.*?(?=>>))(>>)\n\*+([^\*]+)\*+\n

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.