Problem Introduction
So I've fried my brain trying to get negative look ahead/behinds to work. For the last example input, my current solution returns no match (see expected output table). I'm struggling with how to match the title part of the string when it includes a year that is not at the end of the string. To be clear, I'm only interested in matching the year if it is at the end of the string. The current regex fails on the last example, as it is matching NOT("Q" OR "\d*") in the title. However, I only want it to match NOT("Q" AND "\d{1}"). Any tips/suggestions greatly appreciated. Note using Python 3.8.
Example Input
AXP - Earnings call Q2 2021
AXP - Conference call 2021
BAC,BAC.PE,BAC.PL,BACRP,BML.PL,BML.PJ,BML.PH,BML.PG,BAC.PB,BAC.PK,BAC.PM,BAC.PN - Earnings call Q2 2021
GM - General Motors Company (GM) Presents at Deutsche Bank AutoTech Conference
AXP - American Express Company (AXP) Management Presents at Barclays 2020 Global Financial Services Conference
The period will always be of the form Q[1-4]. period and year are optional. If they do occur, they will be at the end of the string. symbol and title are always separated by - and always occur.
Expected Output
| symbol | title | period | year |
|---|---|---|---|
| AXP | Earnings call | Q2 | 2021 |
| AXP | Conference call | 2021 | |
| BAC | Earnings call | Q2 | 2021 |
| GM | General Motors Company (GM) Presents at Deutsche Bank AutoTech Conference | ||
| AXP | American Express Company (AXP) Management Presents at Barclays 2020 Global Financial Services Conference |
What I've Tried
r"^(?P<symbol>[^\,]{1,8})(\,[A-Z\.]+)*\s\-\s(?P<title>[^Q\d]*)\s?(?P<period>Q\d)?\s?(?P<year>19|20\d{2})$"