Regex - Get text between two strings

Question

I have a large text file which contains many abstracts (7k of them). I want to separate them. They have the following properties:

a number at the begining with a period right after

123.

and it always ends in:

[PubMed - indexed for MEDLINE]

It would be even better if I can get the title and abstract out of the separated string. I am fine if I have to split the articles first then split the texts.

In the example the title is the third line:

Effects of propofol and isoflurane on haemodynamics and the inflammatory response in cardiopulmonary bypass surgery.

The abstract is on the 8th line:

Cardiopulmonary bypass (CPB) causes reperfusion injury...

I have tried to use the following code for this text

Regex:

[0-9\.]*\s*(((?![0-9\.]*|MEDLINE).)+)\s*MEDLINE

Text:

1. Br J Biomed Sci. 2015;72(3):93-101.

Effects of propofol and isoflurane on haemodynamics and the inflammatory response
in cardiopulmonary bypass surgery.

Sayed S, Idriss NK, Sayyedf HG, Ashry AA, Rafatt DM, Mohamed AO, Blann AD.

Cardiopulmonary bypass (CPB) causes reperfusion injury that when most severe is
clinically manifested as a systemic inflammatory response syndrome. The
anaesthetic propofol may have anti-inflammatory properties that may reduce such a
response. We hypothesised differing effects of propofol and isoflurane on
inflammatory markers in patients having CBR Forty patients undergoing elective
CPB were randomised to receive either propofol or isoflurane for maintenance of
anaesthesia. CRP, IL-6, IL-8, HIF-1α (ELISA), CD11 and CD18 expression (flow
cytometry), and haemoxygenase (HO-1) promoter polymorphisms (PCR/electrophoresis)
were measured before anaesthetic induction, 4 hours post-CPB, and 24 hours later.
There were no differences in the 4 hours changes in CRP, IL-6, IL-8 or CD18
between the two groups, but those in the propofol group had higher HIF-1α (P =
0.016) and lower CD11 expression (P = 0.026). After 24 hours, compared to the
isoflurane group, the propofol group had significantly lower levels of CRP (P <
0.001), IL-6 (P < 0.001) and IL-8 (P < 0.001), with higher levels CD11 (P =
0.009) and CD18 (P = 0.002) expression. After 24 hours, patients on propofol had 
increased expression of shorter HO-1 GT(n) repeats than patients on isoflurane (P
= 0.001). Use of propofol in CPB is associated with a less adverse inflammatory
profile than is isofluorane, and an increased up-regulation of HO-1. This
supports the hypothesis that propofol has anti-inflammatory activity.

PMID: 26510263  [PubMed - indexed for MEDLINE]

Just split by [PubMed - indexed for MEDLINE] then. Can't tell you how, since you didn't tag by programming language. — Amadan
– Amadan, Commented Nov 27, 2015 at 5:55
I just spent about 30 minutes trying to come up with a couple of regexes for you, but was unable to do so. I think you would be better extracting out the text using a language like Java or Python. Regular expressions are not the best solution for every problem. — Tim Biegeleisen
– Tim Biegeleisen, Commented Nov 27, 2015 at 6:06
In addition, you cannot know how many lines (and therefore how many newline symbols) the abstract and body may have. — Tim Biegeleisen
– Tim Biegeleisen, Commented Nov 27, 2015 at 6:07
In addition to what Amadan said, if you always have the same structure (journal-title-authors-abstract-id), you can split on 2+ consecutive newlines to get the title and abstract. DEMO — Mariano
– Mariano, Commented Nov 27, 2015 at 7:22
Alternatively, here is a regex that should parse the whole input text. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Nov 27, 2015 at 8:59

Community · Accepted Answer · 2017-05-23 12:15:10Z

1

Two useful solutions have been proposed by Mariano and stribizhev:

Mariano's solution: Use the `split` method with the typical end

(?m)\[PubMed - indexed for MEDLINE\]$

DEMO : http://ideone.com/Qw5ss2

Java 4+

stribizhev's solution: Fully extract data from the text

(?m)^\s*\d+\..*\R{2}                 # Get to the title
(?<title>[^\n]*(?:\n(?!\n)[^\n]*)*)  # Get title
\R{2}                                # Get to the authors
[^\n]*(?:\n(?!\R)[^\R]*)*            # Consume authors
(?<abstract>[^\[]*(?:\[(?!PubMed[ ]-[ ]indexed[ ]for[ ]MEDLINE\])[^\[]*)*) #Grab abstract

DEMO: https://regex101.com/r/sG2yQ2/2

Java 8+

edited May 23, 2017 at 12:15

CommunityBot

11 silver badge

answered Nov 27, 2015 at 9:41

Stephan

43.2k69 gold badges245 silver badges342 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Wiktor Stribiżew Over a year ago

Why post others' solutions? If you do, explain why they work.

Stephan Over a year ago

@stribizhev Why are useful solutions buried in the comments? IMO, this is not their place. They diserve to be in an anwser not a comment.

Wiktor Stribiżew Over a year ago

If they work, OP would inform the user who commented, and then I or Mariano could post our solutions with explanations. If it does not work for OP - why post on others' behalf?

Stephan Over a year ago

@stribizhev OP didn't say if it works or not and the question would become one of the many questions without answers haunting SO. BTW, if a full answer is provided and OP accepts it, I will remove my answer.

Mariano · Accepted Answer · 2015-11-27 09:14:34Z

1

Try this:

"^[0-9]+\..*\s+(.*)\s+.*\s+((?:\s|.)*?)\[PubMed - indexed for MEDLINE\]"

First group would be title. Second would be abstract.

edited Nov 27, 2015 at 9:14

Mariano

6,5214 gold badges35 silver badges50 bronze badges

answered Nov 27, 2015 at 8:04

Dmitry

1,29311 silver badges15 bronze badges

1 Comment

Mariano Over a year ago

You need to use 2 backslashes as a escape in Java. In your regex, group 1 is only matching the first line in the title, and group 2 is also matching authors and part of the id. regex101.com/r/rT0tG1/1

Collectives™ on Stack Overflow

Regex - Get text between two strings

2 Answers 2

Mariano's solution: Use the `split` method with the typical end

stribizhev's solution: Fully extract data from the text

4 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Mariano's solution: Use the split method with the typical end

stribizhev's solution: Fully extract data from the text

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related

Mariano's solution: Use the `split` method with the typical end