0

I have a large text file which contains many abstracts (7k of them). I want to separate them. They have the following properties:

a number at the begining with a period right after

123.

and it always ends in:

[PubMed - indexed for MEDLINE]

It would be even better if I can get the title and abstract out of the separated string. I am fine if I have to split the articles first then split the texts.

In the example the title is the third line:

Effects of propofol and isoflurane on haemodynamics and the inflammatory response in cardiopulmonary bypass surgery.

The abstract is on the 8th line:

Cardiopulmonary bypass (CPB) causes reperfusion injury...

I have tried to use the following code for this text

Regex:

[0-9\.]*\s*(((?![0-9\.]*|MEDLINE).)+)\s*MEDLINE

Text:

1. Br J Biomed Sci. 2015;72(3):93-101.

Effects of propofol and isoflurane on haemodynamics and the inflammatory response
in cardiopulmonary bypass surgery.

Sayed S, Idriss NK, Sayyedf HG, Ashry AA, Rafatt DM, Mohamed AO, Blann AD.

Cardiopulmonary bypass (CPB) causes reperfusion injury that when most severe is
clinically manifested as a systemic inflammatory response syndrome. The
anaesthetic propofol may have anti-inflammatory properties that may reduce such a
response. We hypothesised differing effects of propofol and isoflurane on
inflammatory markers in patients having CBR Forty patients undergoing elective
CPB were randomised to receive either propofol or isoflurane for maintenance of
anaesthesia. CRP, IL-6, IL-8, HIF-1α (ELISA), CD11 and CD18 expression (flow
cytometry), and haemoxygenase (HO-1) promoter polymorphisms (PCR/electrophoresis)
were measured before anaesthetic induction, 4 hours post-CPB, and 24 hours later.
There were no differences in the 4 hours changes in CRP, IL-6, IL-8 or CD18
between the two groups, but those in the propofol group had higher HIF-1α (P =
0.016) and lower CD11 expression (P = 0.026). After 24 hours, compared to the
isoflurane group, the propofol group had significantly lower levels of CRP (P <
0.001), IL-6 (P < 0.001) and IL-8 (P < 0.001), with higher levels CD11 (P =
0.009) and CD18 (P = 0.002) expression. After 24 hours, patients on propofol had 
increased expression of shorter HO-1 GT(n) repeats than patients on isoflurane (P
= 0.001). Use of propofol in CPB is associated with a less adverse inflammatory
profile than is isofluorane, and an increased up-regulation of HO-1. This
supports the hypothesis that propofol has anti-inflammatory activity.

PMID: 26510263  [PubMed - indexed for MEDLINE]
7
  • Just split by [PubMed - indexed for MEDLINE] then. Can't tell you how, since you didn't tag by programming language. Commented Nov 27, 2015 at 5:55
  • I just spent about 30 minutes trying to come up with a couple of regexes for you, but was unable to do so. I think you would be better extracting out the text using a language like Java or Python. Regular expressions are not the best solution for every problem. Commented Nov 27, 2015 at 6:06
  • In addition, you cannot know how many lines (and therefore how many newline symbols) the abstract and body may have. Commented Nov 27, 2015 at 6:07
  • 1
    In addition to what Amadan said, if you always have the same structure (journal-title-authors-abstract-id), you can split on 2+ consecutive newlines to get the title and abstract. DEMO Commented Nov 27, 2015 at 7:22
  • 2
    Alternatively, here is a regex that should parse the whole input text. Commented Nov 27, 2015 at 8:59

2 Answers 2

1

Two useful solutions have been proposed by Mariano and stribizhev:

Mariano's solution: Use the split method with the typical end

(?m)\[PubMed - indexed for MEDLINE\]$

DEMO : http://ideone.com/Qw5ss2

Java 4+

stribizhev's solution: Fully extract data from the text

(?m)^\s*\d+\..*\R{2}                 # Get to the title
(?<title>[^\n]*(?:\n(?!\n)[^\n]*)*)  # Get title
\R{2}                                # Get to the authors
[^\n]*(?:\n(?!\R)[^\R]*)*            # Consume authors
(?<abstract>[^\[]*(?:\[(?!PubMed[ ]-[ ]indexed[ ]for[ ]MEDLINE\])[^\[]*)*) #Grab abstract

DEMO: https://regex101.com/r/sG2yQ2/2

Java 8+

Sign up to request clarification or add additional context in comments.

4 Comments

Why post others' solutions? If you do, explain why they work.
@stribizhev Why are useful solutions buried in the comments? IMO, this is not their place. They diserve to be in an anwser not a comment.
If they work, OP would inform the user who commented, and then I or Mariano could post our solutions with explanations. If it does not work for OP - why post on others' behalf?
@stribizhev OP didn't say if it works or not and the question would become one of the many questions without answers haunting SO. BTW, if a full answer is provided and OP accepts it, I will remove my answer.
1

Try this:

"^[0-9]+\..*\s+(.*)\s+.*\s+((?:\s|.)*?)\[PubMed - indexed for MEDLINE\]"

First group would be title. Second would be abstract.

1 Comment

You need to use 2 backslashes as a escape in Java. In your regex, group 1 is only matching the first line in the title, and group 2 is also matching authors and part of the id. regex101.com/r/rT0tG1/1

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.