3

Disclaimer: This is my first post. Feel free to give me feedback and how I should or shouldn't have formatted this question. Thanks!

I'm looking to pull out data from text blocks by capturing anything that matches a pattern of a date format followed by a colon. I have successfully used regular expressions to capture information including an observation date, a colon, and any text that follows up to the period before the next date.

For example:
1999-01-01: 10 birds observed.

The problem that I am having is that some of my data contains site names followed by a colon within the observation data that follows that observation date and first colon. This sub-pattern of 'sitename: data' could occur zero or many times within the block following the observation date.

For example:
1999-01-01: BS-001: 5 birds observed. All in good health. BS-002: 5 birds observed, some in poor health.

What pattern should I use to capture all text after the date format and colon, including the potential site names, their colons, and related data up to the period before the next observation date?

I currently extract the simple observation data (without multiple sites within them) by date and observation using the following pattern:

pattern = re.compile(r'(\d\d\d\d\-*\s*\&*\d+\-*\d*:[A-Za-z0-9\s\,\(\)\;\"\-]*\.*)')  

The code above lets me pull out observation dates that could be in a variety of forms. Using periods as part of the pattern is tricky since observation data could be one or many sentences.

Here is an example of the text I am trying to search and split out. Each new match should begin with an observation date, so in the data below there should be 3 matches returned (2013-04-13: data, 2017-01-01: data, and 2018-07-04: data):

2013-04-13: BS-440: 10 egg masses observed in vernal pool habitat. Observer noted 3 of the AMJE masses had firm jelly, akin to a 3-wk old AMMA mass, but "bumpier" on outside (membrane and embryo-spacing in the masses were AMJE-like). BS-443: 3 egg masses observed in vernal pool habitat. A few egg masses may have been missed due to poor light conditions. Smith-019: 250 egg masses observed in vernal pool habitat. Observer searched only portions abutting the road (SW margin of pool). Many AMJE masses observed attached to herbaceous vegetation and difficult to differentiate from one another. AMJE egg-mass count is a rough estimate within area searched. 2017-01-01: 23 individuals observed. Egg masses were not present. 2018-07-04: BS-440: All individuals took a break from breeding for the long holiday weekend.

Ideally the output would look like this:

2013-04-13: BS-440: 10 egg masses observed in vernal pool habitat. Observer noted 3 of the AMJE masses had firm jelly, akin to a 3-wk old AMMA mass, but "bumpier" on outside (membrane and embryo-spacing in the masses were AMJE-like). BS-443: 3 egg masses observed in vernal pool habitat. A few egg masses may have been missed due to poor light conditions. Smith-019: 250 egg masses observed in vernal pool habitat. Observer searched only portions abutting the road (SW margin of pool). Many AMJE masses observed attached to herbaceous vegetation and difficult to differentiate from one another. AMJE egg-mass count is a rough estimate within area searched.

2017-01-01: 23 individuals observed. Egg masses were not present.

2018-07-04: BS-440: All individuals took a break from breeding for the long holiday weekend.

0

4 Answers 4

2

Basically, it sounds like you want to separate your text into fields that start with a date and end just before a date or the end of the text. Here's one possibility:

\d{4}-\d\d-\d\d:           # date with colon
.*?                        # the minimal amount of any characters required to match
(?=                        # positive lookahead (match text but don't consume it)
   \d{4}-\d\d-\d\d:        # date with colon
  |                        # or
   $                       # end of text
)                          # end lookahead

Use it in conjunction with re.findall():

findall(r'\d{4}-\d\d-\d\d:.*?(?=\d{4}-\d\d-\d\d:|$)', mytext)

Run against your sample text above:

['2013-04-13: BS-440: 10 egg masses observed in vernal pool habitat.
  Observer noted 3 of the AMJE masses had firm jelly, akin to a 3-wk
  old AMMA mass, but "bumpier" on outside (membrane and embryo-spacing
  in the masses were AMJE-like). BS-443: 3 egg masses observed in
  vernal pool habitat. A few egg masses may have been missed due to
  poor light conditions. Smith-019: 250 egg masses observed in
  vernal pool habitat. Observer searched only portions abutting the 
  road (SW margin of pool). Many AMJE masses observed attached
  to herbaceous vegetation and difficult to differentiate from
  one another. AMJE egg-mass count is a rough estimate within
  area searched. ',
 '2017-01-01: 23 individuals observed. Egg masses were not present. ',
 '2018-07-04: BS-440: All individuals took a break from breeding for
  the long holiday weekend.']
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, there are a few great solutions here but your solution included an explanation that helped me understand the regex symbols more. Thank you!
2

You can try a replacement of all white-spaces followed by a date with two newline characters:

s = re.sub(r'\s+(?=\d{4}-*\s*&*\d+-*\d*:)', "\n\n", s)

This way you don't match the first date at the beginning of the string.

If you are unsure each date is preceded by whitespaces, you can also write it like this:

s = re.sub(r'\s*(?!^)(?=\d{4}-*\s*&*\d+-*\d*:)', "\n\n", s)

Comments

2

You can use split() and regex (?=\d{4}-\d{2}-\d{2})

output = re.compile(" (?=\d{4}-\d{2}-\d{2})").split(text)

Code demo

1 Comment

Clever, I would not have thought to split on the date pattern but that makes sense since it is fairly consistent.
1

You can use pattern.split:

pattern = re.compile(r'(\d{4}-\d{2}-\d{2})')
parts = pattern.split(string)

This yields

['', '2013-04-13', ': BS-440: 10 egg masses observed...', ...]

If pattern contains capturing parentheses, their contents are interleaved with the splitted parts of the input string. Because the start of the string matches a date, the first part is empty. So ''.join(parts[1:3]) yields the first entry and so on.

1 Comment

Thank you, splitting the data into these parts may actually be useful if I end up putting them into a table with a date field, and a biological data field.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.