-1

I have the output of a pandoc conversion to HTML which looks like this:

foo

bar

<blockquote>

That's one small step for man, one giant leap for mankind

A new line and another quote

</blockquote>

baz

I'd like to make it like this:

foo

bar<blockquote>That's one small step for man, one giant leap for mankind

A new line and another quote</blockquote>baz

(Because block quotes are rendered separately anyway so I don't need the extra new lines.)

I started trying with sed and ended up with this awk:

'/./ {printf "%s%s", $0, ($1 ~ /^$/ && $2 ~ /<\/?blockquote>/) ? OFS : ORS}'

Which does part of what I want, but is a bit too advanced for me to understand how to modify.

In words I think the rule I want is: if the next line is blank and the one after matches /<\/?blockquote>/, then print current line, next line, and the one after without any separators, and then move on.

3
  • Something tells me the awk shall not work on six lines of text only. If that is the case then please explain exactly how to handle the data. Commented Apr 29, 2023 at 21:11
  • There are tools for working with HTML and XML documents, but what you show is just a fragment which doesn't look like either HTML nor XML (the foo and bar can't be written like that; foo must be part of some other node, possibly a p node, while bar could be the value of the body node that you don't show, which, if this is HTML, ought to be part of a root html node). Commented Apr 29, 2023 at 21:47
  • @Kusalananda correct. This is after stripping out <p> tags. The full command I use to produce this is pandoc file.org -t html | gsed 's:</\?p>::g' | gsed 's:$:\n:g' Commented Apr 29, 2023 at 22:00

3 Answers 3

3

Using GNU awk for multi-char RS, RT, gensub(), and \s and without reading the whole file into memory at one time:

$ awk -v RS='\\s*</?blockquote>\\s*' '{ORS=gensub(/\s+/,"","g",RT)} 1' file
foo

bar<blockquote>That's one small step for man, one giant leap for mankind

A new line and another quote</blockquote>baz
1
  • Wow, that is some black belt awk! It didn't work with my system (macOS) awk, but with GNU awk (installed via homebrew) it is perfect. Commented May 1, 2023 at 18:15
2

With a Perl's one liner:

>= 5.36:

$ perl -gpe 's/(\w+)\n\n(</?blockquote\b[^\n]+)\s*\n/$1$2/g' file 

Or < 5.36:

$ perl -0777 -pe 's/(\w+)\n\n(</?blockquote\b[^\n]+)\s*\n/$1$2/g' file 
foo<blockquote>That's one small step for man, one giant leap for mankind

A new line and another quote</blockquote>bar

  • -g or -0777 read the whole file in memory
  • 's///' is the replacement skeleton, exactly like sed
  • $1$2 are the two captured groups, like \1\2 with sed

The regular expression matches as follows:

Node Explanation
( group and capture to $1:
\w+ word characters (a-z, A-Z, 0-9, _) (1 or more times (matching the most amount possible))
) end of $1
\n '\n' (newline)
\n '\n' (newline)
( group and capture to $2:
</?blockquote '<' + optional '/' + 'blockquote'
\b word boundary anchor
[^\n]+ any character except: '\n' (newline) (1 or more times (matching the most amount possible))
) end of $2
\s* whitespace (\n, \r, \t, \f, and " ") (0 or more times (matching the most amount possible))
\n '\n' (newline)
1
  • 1
    Many thanks. Such a succinct and powerful command! To fit my exact input I altered the first capturing group to (\S+:?\.?) Commented Apr 29, 2023 at 22:17
1
awk 'BEGIN { waiting_for_tag=1; };
     NF==0 { next; };
     $1 ~ "</?blockquote>" { printf "%s",$1; waiting_for_tag=0; next; };
     waiting_for_tag==1 { printf "%s",$0; next; }; 
     { printf "%s\n",$0; waiting_for_tag=1; }' input
foo<blockquote>That's one small step for man, one giant leap for mankind
A new line and another quote</blockquote>bar
4
  • Thanks, this works really well except that it also strips blank lines not adjacent to a <blockquote> tag (for example, after "mankind"). Commented Apr 29, 2023 at 21:55
  • 1
    @joshFriedlander: All you need to do is to modify HaukeLaging's answer adding a new line(\n) in the last printf block: { printf "%s\n\n",$0;...}. Commented Apr 30, 2023 at 15:17
  • 1
    @Cbhihe no, the NF==0 { next; } would still strip all blank lines and the other printfs would need some work too even if you did that. Commented May 1, 2023 at 13:54
  • 1
    @EdMorton, yes, NF==0 {next} strips blank lines by not printing them, hence the suggested { printf "%s\n\n",$0;...} would print two newlines to compensate, but only for any 2nd citation following </?blockquote>. --- Anyway we all agree that this solution is far from optimal, being ultra-specialized to a particular example format and length. Yours is more general (and arguably more readable) as it does not touch anything that is between the first and last non empty records located inside \\s*</?blockquote>\\s*. Commented May 1, 2023 at 14:33

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.