1

I used the curl command to download an html file from homeoint.org/books/boericmm/d.htm and saved it into a file.

The relevant part looks like this:

      <p><font size="2"><a href="d/dam.htm" target="_top">DAM</a> ------&gt;
      DAMIANA (TURNERA)<br>
      <a href="d/daph.htm" target="_top">DAPH</a> ------&gt; DAPHNE INDICA<br>
      <a href="d/dig.htm" target="_top">DIG</a> ------&gt; DIGITALIS PURPUREA
      (DIGITALIS)<br>
      <a href="d/dios.htm" target="_top">DIOS</a> ------&gt; DIOSCOREA VILLOSA<br>
      <a href="d/diosm.htm" target="_top">DIOSM</a> ------&gt; DIOSMA LINCARIS<br>
      <a href="d/diph.htm" target="_top">DIPH</a> ------&gt; DIPHTHERINUM<br>
      <a target="_top" href="d/dol.htm">DOL</a> ------&gt; DOLICHOS PRURIENS
      (DOLICHOS PURIENS - MUCUNA)<br>
      <a href="d/dor.htm" target="_top">DOR</a> ------&gt; DORYPHORA
      DECEMLINEATA (DORYPHORA)<br>
      <a href="d/dros.htm" target="_top">DROS</a> ------&gt; DROSERA
      ROTUNDIFOLIA (DROSERA)<br>
      <a href="d/dubo-m.htm" target="_top">DUBO-M</a> ------&gt; DUBOISIA
      MYOPOROIDES (DUBOISIA)<br>
      <a href="d/dulc.htm" target="_top">DULC</a> ------&gt; DULCAMARA<br>
      &nbsp;</font></p>

I need to grep the value from

"&gt;" to "<br>"

I need output to be:-

 DAMIANA (TURNERA)
 DAPHNE INDICA
 DIGITALIS PURPUREA (DIGITALIS)
 DIOSCOREA VILLOSA
 DIOSMA LINCARIS
 DIPHTHERINUM
 DOLICHOS PRURIENS (DOLICHOS PURIENS - MUCUNA)
 DORYPHORA DECEMLINEATA (DORYPHORA)
 DROSERA ROTUNDIFOLIA (DROSERA)
 DUBOISIA MYOPOROIDES (DUBOISIA)
 DULCAMARA

i am trying to use the grep command

cat d.htm | grep -o -P '(?<=&gt; ).*(?=<br>)'

but my output is not complete.

1

4 Answers 4

2

Using lynx to render the HTML into text, then sed to delete everything up to the space after > on every line (but print only the lines that were actually affected).

$ lynx --dump 'http://homeoint.org/books/boericmm/d.htm' | sed -n 's/.*> //p'
DAMIANA (TURNERA)
DAPHNE INDICA
DIGITALIS PURPUREA (DIGITALIS)
DIOSCOREA VILLOSA
DIOSMA LINCARIS
DIPHTHERINUM
DOLICHOS PRURIENS (DOLICHOS PURIENS - MUCUNA)
DORYPHORA DECEMLINEATA (DORYPHORA)
DROSERA ROTUNDIFOLIA (DROSERA)
DUBOISIA MYOPOROIDES (DUBOISIA)
DULCAMARA

If you run into issues with lynx inserting line-breaks, then increase the width of the "page" from the default 80 to some higher number with --width (see the lynx manual).

2

With GNU awk for multi-char RS:

awk -v RS='&gt;|<br>' '!(NR%2){$1=$1; print}' file
DAMIANA (TURNERA)
DAPHNE INDICA
DIGITALIS PURPUREA (DIGITALIS)
DIOSCOREA VILLOSA
DIOSMA LINCARIS
DIPHTHERINUM
DOLICHOS PRURIENS (DOLICHOS PURIENS - MUCUNA)
DORYPHORA DECEMLINEATA (DORYPHORA)
DROSERA ROTUNDIFOLIA (DROSERA)
DUBOISIA MYOPOROIDES (DUBOISIA)
DULCAMARA
1

Use tr to remove line breaks (tr -d $'\n') and to squeeze repeats of <space> (tr -s ' '), then you can easily grep:

curl 'http://www.homeoint.org/books/boericmm/d.htm' \
| tr -d $'\n' \
| tr -s ' ' \
|  grep -Po '&gt; *\K[^<]*'

Output:

DAMIANA (TURNERA)
DAPHNE INDICA
DIGITALIS PURPUREA (DIGITALIS)
DIOSCOREA VILLOSA
DIOSMA LINCARIS
DIPHTHERINUM
DOLICHOS PRURIENS (DOLICHOS PURIENS - MUCUNA)
DORYPHORA DECEMLINEATA (DORYPHORA)
DROSERA ROTUNDIFOLIA (DROSERA)
DUBOISIA MYOPOROIDES (DUBOISIA)
DULCAMARA

(your grep would also work, but your .* is greedy, you need .*?).

0

You can use python + BeautifulSoup to parse the website.

This is not very beautiful as this website's html code is worst practice, but it works.


Put this in a file script.py:

#!/usr/bin/env python3
import requests
from bs4 import BeautifulSoup
import re

def parse(url):
    html = requests.get(url).text
    soup = BeautifulSoup(html, 'html.parser')
    items=(re.findall('(?<=&gt; )[^<]*', " ".join(str(soup.find_all("p")[4]).split())))
    for i in items:
        print (i)

parse('http://homeoint.org/books/boericmm/d.htm')

To get all pages (what I think you want to do ...), replace the last line with:

import string
for c in list(string.ascii_lowercase):
    parse('http://homeoint.org/books/boericmm/'+c+'.htm')

And run python script.py or python3 script.py

Of course you need to have the dependencies installed (bs4, re, requests).

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.