Bash Grep between two strings in html file

Question

I used the curl command to download an html file from homeoint.org/books/boericmm/d.htm and saved it into a file.

The relevant part looks like this:

      <p><font size="2"><a href="d/dam.htm" target="_top">DAM</a> ------&gt;
      DAMIANA (TURNERA)<br>
      <a href="d/daph.htm" target="_top">DAPH</a> ------&gt; DAPHNE INDICA<br>
      <a href="d/dig.htm" target="_top">DIG</a> ------&gt; DIGITALIS PURPUREA
      (DIGITALIS)<br>
      <a href="d/dios.htm" target="_top">DIOS</a> ------&gt; DIOSCOREA VILLOSA<br>
      <a href="d/diosm.htm" target="_top">DIOSM</a> ------&gt; DIOSMA LINCARIS<br>
      <a href="d/diph.htm" target="_top">DIPH</a> ------&gt; DIPHTHERINUM<br>
      <a target="_top" href="d/dol.htm">DOL</a> ------&gt; DOLICHOS PRURIENS
      (DOLICHOS PURIENS - MUCUNA)<br>
      <a href="d/dor.htm" target="_top">DOR</a> ------&gt; DORYPHORA
      DECEMLINEATA (DORYPHORA)<br>
      <a href="d/dros.htm" target="_top">DROS</a> ------&gt; DROSERA
      ROTUNDIFOLIA (DROSERA)<br>
      <a href="d/dubo-m.htm" target="_top">DUBO-M</a> ------&gt; DUBOISIA
      MYOPOROIDES (DUBOISIA)<br>
      <a href="d/dulc.htm" target="_top">DULC</a> ------&gt; DULCAMARA<br>
      &nbsp;</font></p>

I need to grep the value from

"&gt;" to "<br>"

I need output to be:-

 DAMIANA (TURNERA)
 DAPHNE INDICA
 DIGITALIS PURPUREA (DIGITALIS)
 DIOSCOREA VILLOSA
 DIOSMA LINCARIS
 DIPHTHERINUM
 DOLICHOS PRURIENS (DOLICHOS PURIENS - MUCUNA)
 DORYPHORA DECEMLINEATA (DORYPHORA)
 DROSERA ROTUNDIFOLIA (DROSERA)
 DUBOISIA MYOPOROIDES (DUBOISIA)
 DULCAMARA

i am trying to use the grep command

cat d.htm | grep -o -P '(?<=&gt; ).*(?=<br>)'

but my output is not complete.

i used curl command to download the html file from "homeoint.org/books/boericmm/d.htm" and saved into file. — user452849
– user452849, Commented Jan 25, 2021 at 16:33

Kusalananda · Accepted Answer · 2021-01-25 17:00:45Z

Using lynx to render the HTML into text, then sed to delete everything up to the space after > on every line (but print only the lines that were actually affected).

$ lynx --dump 'http://homeoint.org/books/boericmm/d.htm' | sed -n 's/.*> //p'
DAMIANA (TURNERA)
DAPHNE INDICA
DIGITALIS PURPUREA (DIGITALIS)
DIOSCOREA VILLOSA
DIOSMA LINCARIS
DIPHTHERINUM
DOLICHOS PRURIENS (DOLICHOS PURIENS - MUCUNA)
DORYPHORA DECEMLINEATA (DORYPHORA)
DROSERA ROTUNDIFOLIA (DROSERA)
DUBOISIA MYOPOROIDES (DUBOISIA)
DULCAMARA

If you run into issues with lynx inserting line-breaks, then increase the width of the "page" from the default 80 to some higher number with --width (see the lynx manual).

Ed Morton · Accepted Answer · 2021-01-25 23:58:58Z

2

With GNU awk for multi-char RS:

awk -v RS='&gt;|<br>' '!(NR%2){$1=$1; print}' file
DAMIANA (TURNERA)
DAPHNE INDICA
DIGITALIS PURPUREA (DIGITALIS)
DIOSCOREA VILLOSA
DIOSMA LINCARIS
DIPHTHERINUM
DOLICHOS PRURIENS (DOLICHOS PURIENS - MUCUNA)
DORYPHORA DECEMLINEATA (DORYPHORA)
DROSERA ROTUNDIFOLIA (DROSERA)
DUBOISIA MYOPOROIDES (DUBOISIA)
DULCAMARA

answered Jan 25, 2021 at 23:58

Ed Morton

35.9k6 gold badges25 silver badges60 bronze badges

Add a comment |

pLumo · Accepted Answer · 2021-01-25 16:54:47Z

Use tr to remove line breaks (tr -d $'\n') and to squeeze repeats of <space> (tr -s ' '), then you can easily grep:

curl 'http://www.homeoint.org/books/boericmm/d.htm' \
| tr -d $'\n' \
| tr -s ' ' \
|  grep -Po '&gt; *\K[^<]*'

Output:

DAMIANA (TURNERA)
DAPHNE INDICA
DIGITALIS PURPUREA (DIGITALIS)
DIOSCOREA VILLOSA
DIOSMA LINCARIS
DIPHTHERINUM
DOLICHOS PRURIENS (DOLICHOS PURIENS - MUCUNA)
DORYPHORA DECEMLINEATA (DORYPHORA)
DROSERA ROTUNDIFOLIA (DROSERA)
DUBOISIA MYOPOROIDES (DUBOISIA)
DULCAMARA

(your grep would also work, but your .* is greedy, you need .*?).

pLumo · Accepted Answer · 2021-01-25 17:28:41Z

You can use python + BeautifulSoup to parse the website.

This is not very beautiful as this website's html code is worst practice, but it works.

Put this in a file script.py:

#!/usr/bin/env python3
import requests
from bs4 import BeautifulSoup
import re

def parse(url):
    html = requests.get(url).text
    soup = BeautifulSoup(html, 'html.parser')
    items=(re.findall('(?<=&gt; )[^<]*', " ".join(str(soup.find_all("p")[4]).split())))
    for i in items:
        print (i)

parse('http://homeoint.org/books/boericmm/d.htm')

To get all pages (what I think you want to do ...), replace the last line with:

import string
for c in list(string.ascii_lowercase):
    parse('http://homeoint.org/books/boericmm/'+c+'.htm')

And run python script.py or python3 script.py

Of course you need to have the dependencies installed (bs4, re, requests).

Stack Exchange Network

Bash Grep between two strings in html file

4 Answers 4

You must log in to answer this question.

Hot Network Questions

Bash Grep between two strings in html file

4 Answers 4

You must log in to answer this question.

Related

Hot Network Questions