Python: Using Regex to Capture Sub-Patterns within a Pattern

Question

Disclaimer: This is my first post. Feel free to give me feedback and how I should or shouldn't have formatted this question. Thanks!

I'm looking to pull out data from text blocks by capturing anything that matches a pattern of a date format followed by a colon. I have successfully used regular expressions to capture information including an observation date, a colon, and any text that follows up to the period before the next date.

For example:
1999-01-01: 10 birds observed.

The problem that I am having is that some of my data contains site names followed by a colon within the observation data that follows that observation date and first colon. This sub-pattern of 'sitename: data' could occur zero or many times within the block following the observation date.

For example:
1999-01-01: BS-001: 5 birds observed. All in good health. BS-002: 5 birds observed, some in poor health.

What pattern should I use to capture all text after the date format and colon, including the potential site names, their colons, and related data up to the period before the next observation date?

I currently extract the simple observation data (without multiple sites within them) by date and observation using the following pattern:

pattern = re.compile(r'(\d\d\d\d\-*\s*\&*\d+\-*\d*:[A-Za-z0-9\s\,\(\)\;\"\-]*\.*)')

The code above lets me pull out observation dates that could be in a variety of forms. Using periods as part of the pattern is tricky since observation data could be one or many sentences.

Here is an example of the text I am trying to search and split out. Each new match should begin with an observation date, so in the data below there should be 3 matches returned (2013-04-13: data, 2017-01-01: data, and 2018-07-04: data):

2013-04-13: BS-440: 10 egg masses observed in vernal pool habitat. Observer noted 3 of the AMJE masses had firm jelly, akin to a 3-wk old AMMA mass, but "bumpier" on outside (membrane and embryo-spacing in the masses were AMJE-like). BS-443: 3 egg masses observed in vernal pool habitat. A few egg masses may have been missed due to poor light conditions. Smith-019: 250 egg masses observed in vernal pool habitat. Observer searched only portions abutting the road (SW margin of pool). Many AMJE masses observed attached to herbaceous vegetation and difficult to differentiate from one another. AMJE egg-mass count is a rough estimate within area searched. 2017-01-01: 23 individuals observed. Egg masses were not present. 2018-07-04: BS-440: All individuals took a break from breeding for the long holiday weekend.

Ideally the output would look like this:

2013-04-13: BS-440: 10 egg masses observed in vernal pool habitat. Observer noted 3 of the AMJE masses had firm jelly, akin to a 3-wk old AMMA mass, but "bumpier" on outside (membrane and embryo-spacing in the masses were AMJE-like). BS-443: 3 egg masses observed in vernal pool habitat. A few egg masses may have been missed due to poor light conditions. Smith-019: 250 egg masses observed in vernal pool habitat. Observer searched only portions abutting the road (SW margin of pool). Many AMJE masses observed attached to herbaceous vegetation and difficult to differentiate from one another. AMJE egg-mass count is a rough estimate within area searched.

2017-01-01: 23 individuals observed. Egg masses were not present.

2018-07-04: BS-440: All individuals took a break from breeding for the long holiday weekend.

glibdud · Accepted Answer · 2018-03-06 13:52:06Z

2

Basically, it sounds like you want to separate your text into fields that start with a date and end just before a date or the end of the text. Here's one possibility:

\d{4}-\d\d-\d\d:           # date with colon
.*?                        # the minimal amount of any characters required to match
(?=                        # positive lookahead (match text but don't consume it)
   \d{4}-\d\d-\d\d:        # date with colon
  |                        # or
   $                       # end of text
)                          # end lookahead

Use it in conjunction with re.findall():

findall(r'\d{4}-\d\d-\d\d:.*?(?=\d{4}-\d\d-\d\d:|$)', mytext)

Run against your sample text above:

['2013-04-13: BS-440: 10 egg masses observed in vernal pool habitat.
  Observer noted 3 of the AMJE masses had firm jelly, akin to a 3-wk
  old AMMA mass, but "bumpier" on outside (membrane and embryo-spacing
  in the masses were AMJE-like). BS-443: 3 egg masses observed in
  vernal pool habitat. A few egg masses may have been missed due to
  poor light conditions. Smith-019: 250 egg masses observed in
  vernal pool habitat. Observer searched only portions abutting the 
  road (SW margin of pool). Many AMJE masses observed attached
  to herbaceous vegetation and difficult to differentiate from
  one another. AMJE egg-mass count is a rough estimate within
  area searched. ',
 '2017-01-01: 23 individuals observed. Egg masses were not present. ',
 '2018-07-04: BS-440: All individuals took a break from breeding for
  the long holiday weekend.']

answered Mar 6, 2018 at 13:52

glibdud

7,9704 gold badges32 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

MrChancey Over a year ago

Thanks, there are a few great solutions here but your solution included an explanation that helped me understand the regex symbols more. Thank you!

Casimir et Hippolyte · Accepted Answer · 2018-03-06 13:54:45Z

2

You can try a replacement of all white-spaces followed by a date with two newline characters:

s = re.sub(r'\s+(?=\d{4}-*\s*&*\d+-*\d*:)', "\n\n", s)

This way you don't match the first date at the beginning of the string.

If you are unsure each date is preceded by whitespaces, you can also write it like this:

s = re.sub(r'\s*(?!^)(?=\d{4}-*\s*&*\d+-*\d*:)', "\n\n", s)

edited Mar 6, 2018 at 13:54

answered Mar 6, 2018 at 13:49

Casimir et Hippolyte

90k5 gold badges102 silver badges131 bronze badges

Comments

Srdjan M. · Accepted Answer · 2018-03-06 13:58:57Z

2

You can use split() and regex (?=\d{4}-\d{2}-\d{2})

output = re.compile(" (?=\d{4}-\d{2}-\d{2})").split(text)

Code demo

answered Mar 6, 2018 at 13:58

Srdjan M.

3,4253 gold badges17 silver badges35 bronze badges

1 Comment

MrChancey Over a year ago

Clever, I would not have thought to split on the date pattern but that makes sense since it is fairly consistent.

Roland W · Accepted Answer · 2018-03-06 13:48:56Z

1

You can use pattern.split:

pattern = re.compile(r'(\d{4}-\d{2}-\d{2})')
parts = pattern.split(string)

This yields

['', '2013-04-13', ': BS-440: 10 egg masses observed...', ...]

If pattern contains capturing parentheses, their contents are interleaved with the splitted parts of the input string. Because the start of the string matches a date, the first part is empty. So ''.join(parts[1:3]) yields the first entry and so on.

answered Mar 6, 2018 at 13:48

Roland W

1,47115 silver badges22 bronze badges

1 Comment

MrChancey Over a year ago

Thank you, splitting the data into these parts may actually be useful if I end up putting them into a table with a date field, and a biological data field.

Collectives™ on Stack Overflow

Python: Using Regex to Capture Sub-Patterns within a Pattern

4 Answers 4

1 Comment

Comments

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related