Joining multiple lines based on a regex

Question

I have the output of a pandoc conversion to HTML which looks like this:

foo

bar

<blockquote>

That's one small step for man, one giant leap for mankind

A new line and another quote

</blockquote>

baz

I'd like to make it like this:

foo

bar<blockquote>That's one small step for man, one giant leap for mankind

A new line and another quote</blockquote>baz

(Because block quotes are rendered separately anyway so I don't need the extra new lines.)

I started trying with sed and ended up with this awk:

'/./ {printf "%s%s", $0, ($1 ~ /^$/ && $2 ~ /<\/?blockquote>/) ? OFS : ORS}'

Which does part of what I want, but is a bit too advanced for me to understand how to modify.

In words I think the rule I want is: if the next line is blank and the one after matches /<\/?blockquote>/, then print current line, next line, and the one after without any separators, and then move on.

Something tells me the awk shall not work on six lines of text only. If that is the case then please explain exactly how to handle the data. — Hauke Laging
– Hauke Laging, Commented Apr 29, 2023 at 21:11
There are tools for working with HTML and XML documents, but what you show is just a fragment which doesn't look like either HTML nor XML (the foo and bar can't be written like that; foo must be part of some other node, possibly a p node, while bar could be the value of the body node that you don't show, which, if this is HTML, ought to be part of a root html node). — Kusalananda
– Kusalananda ♦, Commented Apr 29, 2023 at 21:47
@Kusalananda correct. This is after stripping out <p> tags. The full command I use to produce this is pandoc file.org -t html | gsed 's:</\?p>::g' | gsed 's:$:\n:g' — Josh Friedlander
– Josh Friedlander, Commented Apr 29, 2023 at 22:00

Ed Morton · Accepted Answer · 2023-05-02 00:20:32Z

3

Using GNU awk for multi-char RS, RT, gensub(), and \s and without reading the whole file into memory at one time:

$ awk -v RS='\\s*</?blockquote>\\s*' '{ORS=gensub(/\s+/,"","g",RT)} 1' file
foo

bar<blockquote>That's one small step for man, one giant leap for mankind

A new line and another quote</blockquote>baz

edited May 2, 2023 at 0:20

answered May 1, 2023 at 13:50

Ed Morton

35.9k6 gold badges25 silver badges60 bronze badges

Wow, that is some black belt awk! It didn't work with my system (macOS) awk, but with GNU awk (installed via homebrew) it is perfect.

Josh Friedlander
– Josh Friedlander

2023-05-01 18:15:10 +00:00
Commented May 1, 2023 at 18:15

Add a comment |

Gilles Quénot · Accepted Answer · 2023-05-02 16:03:54Z

With a Perl's one liner:

>= 5.36:

$ perl -gpe 's/(\w+)\n\n(</?blockquote\b[^\n]+)\s*\n/$1$2/g' file

Or < 5.36:

$ perl -0777 -pe 's/(\w+)\n\n(</?blockquote\b[^\n]+)\s*\n/$1$2/g' file

foo<blockquote>That's one small step for man, one giant leap for mankind

A new line and another quote</blockquote>bar

-g or -0777 read the whole file in memory
's///' is the replacement skeleton, exactly like sed
$1$2 are the two captured groups, like \1\2 with sed

The regular expression matches as follows:

Node	Explanation
`(`	group and capture to $1:
`\w+`	word characters (a-z, A-Z, 0-9, _) (1 or more times (matching the most amount possible))
`)`	end of $1
`\n`	'\n' (newline)
`\n`	'\n' (newline)
`(`	group and capture to $2:
`</?blockquote`	'<' + optional '/' + 'blockquote'
`\b`	word boundary anchor
`[^\n]+`	any character except: '\n' (newline) (1 or more times (matching the most amount possible))
`)`	end of $2
`\s*`	whitespace (\n, \r, \t, \f, and " ") (0 or more times (matching the most amount possible))
`\n`	'\n' (newline)

Many thanks. Such a succinct and powerful command! To fit my exact input I altered the first capturing group to (\S+:?\.?) — Josh Friedlander
– Josh Friedlander, Commented Apr 29, 2023 at 22:17

Cbhihe · Accepted Answer · 2023-04-30 15:23:44Z

1

awk 'BEGIN { waiting_for_tag=1; };
     NF==0 { next; };
     $1 ~ "</?blockquote>" { printf "%s",$1; waiting_for_tag=0; next; };
     waiting_for_tag==1 { printf "%s",$0; next; }; 
     { printf "%s\n",$0; waiting_for_tag=1; }' input
foo<blockquote>That's one small step for man, one giant leap for mankind
A new line and another quote</blockquote>bar

edited Apr 30, 2023 at 15:23

Cbhihe

2,9104 gold badges24 silver badges33 bronze badges

answered Apr 29, 2023 at 21:26

Hauke Laging

94.7k21 gold badges132 silver badges185 bronze badges

Thanks, this works really well except that it also strips blank lines not adjacent to a <blockquote> tag (for example, after "mankind").

Josh Friedlander
– Josh Friedlander

2023-04-29 21:55:12 +00:00
Commented Apr 29, 2023 at 21:55
1

@joshFriedlander: All you need to do is to modify HaukeLaging's answer adding a new line(\n) in the last printf block: { printf "%s\n\n",$0;...}.

Cbhihe
– Cbhihe

2023-04-30 15:17:36 +00:00
Commented Apr 30, 2023 at 15:17
1

@Cbhihe no, the NF==0 { next; } would still strip all blank lines and the other printfs would need some work too even if you did that.

Ed Morton
– Ed Morton

2023-05-01 13:54:21 +00:00
Commented May 1, 2023 at 13:54
1

@EdMorton, yes, NF==0 {next} strips blank lines by not printing them, hence the suggested { printf "%s\n\n",$0;...} would print two newlines to compensate, but only for any 2nd citation following </?blockquote>. --- Anyway we all agree that this solution is far from optimal, being ultra-specialized to a particular example format and length. Yours is more general (and arguably more readable) as it does not touch anything that is between the first and last non empty records located inside \\s*</?blockquote>\\s*.

Cbhihe
– Cbhihe

2023-05-01 14:33:32 +00:00
Commented May 1, 2023 at 14:33

Add a comment |

Stack Exchange Network

Joining multiple lines based on a regex

3 Answers 3

The regular expression matches as follows:

You must log in to answer this question.

Hot Network Questions

Joining multiple lines based on a regex

3 Answers 3

The regular expression matches as follows:

You must log in to answer this question.

Related

Hot Network Questions