Disclaimer: This is my first post. Feel free to give me feedback and how I should or shouldn't have formatted this question. Thanks!
I'm looking to pull out data from text blocks by capturing anything that matches a pattern of a date format followed by a colon. I have successfully used regular expressions to capture information including an observation date, a colon, and any text that follows up to the period before the next date.
For example:
1999-01-01: 10 birds observed.
The problem that I am having is that some of my data contains site names followed by a colon within the observation data that follows that observation date and first colon. This sub-pattern of 'sitename: data' could occur zero or many times within the block following the observation date.
For example:
1999-01-01: BS-001: 5 birds observed. All in good health. BS-002: 5 birds observed, some in poor health.
What pattern should I use to capture all text after the date format and colon, including the potential site names, their colons, and related data up to the period before the next observation date?
I currently extract the simple observation data (without multiple sites within them) by date and observation using the following pattern:
pattern = re.compile(r'(\d\d\d\d\-*\s*\&*\d+\-*\d*:[A-Za-z0-9\s\,\(\)\;\"\-]*\.*)')
The code above lets me pull out observation dates that could be in a variety of forms. Using periods as part of the pattern is tricky since observation data could be one or many sentences.
Here is an example of the text I am trying to search and split out. Each new match should begin with an observation date, so in the data below there should be 3 matches returned (2013-04-13: data, 2017-01-01: data, and 2018-07-04: data):
2013-04-13: BS-440: 10 egg masses observed in vernal pool habitat. Observer noted 3 of the AMJE masses had firm jelly, akin to a 3-wk old AMMA mass, but "bumpier" on outside (membrane and embryo-spacing in the masses were AMJE-like). BS-443: 3 egg masses observed in vernal pool habitat. A few egg masses may have been missed due to poor light conditions. Smith-019: 250 egg masses observed in vernal pool habitat. Observer searched only portions abutting the road (SW margin of pool). Many AMJE masses observed attached to herbaceous vegetation and difficult to differentiate from one another. AMJE egg-mass count is a rough estimate within area searched. 2017-01-01: 23 individuals observed. Egg masses were not present. 2018-07-04: BS-440: All individuals took a break from breeding for the long holiday weekend.
Ideally the output would look like this:
2013-04-13: BS-440: 10 egg masses observed in vernal pool habitat. Observer noted 3 of the AMJE masses had firm jelly, akin to a 3-wk old AMMA mass, but "bumpier" on outside (membrane and embryo-spacing in the masses were AMJE-like). BS-443: 3 egg masses observed in vernal pool habitat. A few egg masses may have been missed due to poor light conditions. Smith-019: 250 egg masses observed in vernal pool habitat. Observer searched only portions abutting the road (SW margin of pool). Many AMJE masses observed attached to herbaceous vegetation and difficult to differentiate from one another. AMJE egg-mass count is a rough estimate within area searched.
2017-01-01: 23 individuals observed. Egg masses were not present.
2018-07-04: BS-440: All individuals took a break from breeding for the long holiday weekend.