0

I have developed regex pattern to parse bibliography in scientific articles. We use AMA citation style, for journal citations it is can look like this:

"Nielsen MK, Neergaard MA, Jensen AB, Bro F, Guldin MB. Psychological distress, health, and socio-economic factors in caregivers of terminally ill patients: a nationwide population-based cohort study. Support Care Cancer. 2016; 24(7): 3057-3067."

or without issue number:

"Nielsen MK, Neergaard MA, Jensen AB, Bro F, Guldin MB. Psychological distress, health, and socio-economic factors in caregivers of terminally ill patients: a nationwide population-based cohort study. Support Care Cancer. 2016; 24: 3057-3067."

or with only first page (electronic number).

"Nielsen MK, Neergaard MA, Jensen AB, Bro F, Guldin MB. Psychological distress, health, and socio-economic factors in caregivers of terminally ill patients: a nationwide population-based cohort study. Support Care Cancer. 2016; 24(7): 3057."

or simply with only volume number (if ahead of print):

"Nielsen MK, Neergaard MA, Jensen AB, Bro F, Guldin MB. Psychological distress, health, and socio-economic factors in caregivers of terminally ill patients: a nationwide population-based cohort study. Support Care Cancer. 2016; 24."

My pattern matches all this situations and groups all the data (escape with 2 slashes because of Java):

(.*?)\\.(.*?)\\.(.*?)(?<year>\\d+)\\s*?;?\\s*?(?:(?<volume>\\d+))?(?:\\((?<issue>\\d+)\\))?\\s*?(?::\\s*?(?<fpage>\\d+|[A-Za-z]+\\d+))?(?:[\\-\\–](?<lpage>\\d+))?\\.

The problem is that authors consistently put whitespace between first and last page number. I think maybe this pattern can be changed also to match this?

"Nielsen MK, Neergaard MA, Jensen AB, Bro F, Guldin MB. Psychological distress, health, and socio-economic factors in caregivers of terminally ill patients: a nationwide population-based cohort study. Support Care Cancer. 2016; 24(7): 3057 - 3067."

here is an example, where can be seen that pattern matches this incorrectly.

4
  • Doesn't work replacing [\\-\\–] near the end with something like (?:\\-|\\–| \\- | \\– ) ? Commented Mar 30, 2017 at 13:46
  • Nope. It all goes to group 3. Commented Mar 30, 2017 at 13:51
  • 2
    Something like regex101.com/r/rUhFs0/1 .. just been playing around. Commented Mar 30, 2017 at 13:52
  • Also may be used. But not ideal match for me. Commented Mar 30, 2017 at 14:08

1 Answer 1

1

The Proper regex is

(.*?)\.(.*?)\.(.*?)(?<year>\d+)\s*?;?\s*?(?:(?<volume>\d+))?(?:\((?<issue>\d+)\))?\s*?(?::\s*?(?<fpage>\d+|[A-Za-z]+\d+))?(?:[ ]*[\-|\–][ ]*(?<lpage>\d+))?\.

This one https://regex101.com/r/RAdNgb/2 fixes your issue. please check it.

Sign up to request clarification or add additional context in comments.

3 Comments

An answer needs to be self-contained. If your link ever becomes stale, your answer is effectively useless. Include your solution in the answer itself.
@VGR : Done buddy !!
Yep. It works. Forgot that I can put space inside the square brackets to match the optional whitespace :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.