0

I would like to use bash on a file to extract text that lies between two strings. There are already some answers to this, eg:

Print text between two strings on the same line

But I would like to do this for multiple occurrences, sometimes on the same line, sometimes on new lines. for example, starting with a file like this:

\section{The rock outcrop pools experimental system} \label{intro:rockpools}
contain pools at their summit \parencite{brendonck_pools_2010} that have weathered into the rock over time \parencite{bayly_aquatic_2011} through chemical weathering after water collecting at the rock surface \parencite{lister_microgeomorphology_1973}.
Classification depends on dimensions \parencite{twidale_gnammas_1963}.

I would like to retrieve:

brendonck_pools_2010
bayly_aquatic_2011
lister_microgeomorphology_1973
twidale_gnammas_1963

I imagine sed should be able to do this but I'm not sure where to start.

1
  • 1
    It's always better to show enough context to give some perspective on the complexity of the problem. What anubhava showed when I commented was for a simpler input. I would probably use a marginally modified version of his (PCRE-enabled) grep command that puts the \parencite before the open brace, and then filter the output with sed to remove the unwanted material. Commented Jan 13, 2016 at 16:01

3 Answers 3

1

This two stage extract might be easier to understand, without using Perl regex.

$ grep -o "parencite{[^}]*}" cite | sed 's/parencite{//;s/}//'
brendonck_pools_2010
bayly_aquatic_2011
lister_microgeomorphology_1973
twidale_gnammas_1963

or, as always awk to the rescue!

$ awk -F'[{}]' -v RS=" " '/parencite/{print $2}' cite
brendonck_pools_2010
bayly_aquatic_2011
lister_microgeomorphology_1973
twidale_gnammas_1963
Sign up to request clarification or add additional context in comments.

Comments

1

Using grep -oP;

grep -oP '\\parencite\{\K[^}]+' file
brendonck_pools_2010
bayly_aquatic_2011
lister_microgeomorphology_1973
twidale_gnammas_1963

Or using gnu-awk:

awk -v FPAT='\\\\parencite{[^}]+' '{for (i=1; i<=NF; i++) {
    sub(/\\parencite{/, "", $i); print $i}}' file
brendonck_pools_2010
bayly_aquatic_2011
lister_microgeomorphology_1973
twidale_gnammas_1963

6 Comments

Thank you, this gets me some of the way. I updated the example because there are other things in the file with the {} string that I do not wish to print. Could you explain how grep is being told to use the "{" and "}" ? when I use grep -oP 'parencite{\K[^}]+' file it isn't working...
@Shearn: which system are you on? Do you have GNU grep or another PCRE-enabled grep? You might need to escape the {, for example. You need to read the manual (depressingly) carefully. When you say "it isn't working", what are the symptoms? Complaints about the regex? Simply not returning anything? When you report 'not working', you need to be explicit — what you see may not be what others see.
Totally agreed with @JonathanLeffler, isn't working` doesn't really tell us what is not working. Also regarding your edited question, why are 2 values inside {...} not in output?
@Shearn { and } are special and need to be escaped: \\parencite\{\K[^}]+ or (?<=\\parencite\{).+?(?=\}) work with grep -oP
sorry @JonathanLeffler and @anubhava for being vague. I am running Ubuntu 15.04. By 'not working' I meant that there was no output. @anubhava, the two values to which you refer are not intended to be in the output, only those between parencite{ and }. @glenn jackman, thank you for the explanation, grep -oP '\\parencite\{\K[^}]+' file works perfect
|
0

This might work for you (GNU sed):

sed '/\\parencite{\([^}]*\)}/!d;s//\n\1\n/;s/^[^\n]*\n//;P;D' file

Delete any lines that don't contain the required string. Surround the first occurrance with newlines and remove upto and including the first newline. Print upto and including the following newline then delete what was printed and repeat.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.