0

For example when filtering html file, if every line is in this kind of pattern:

<a href="xxxxxx" style="xxxx"><i>some text</i></a>

how can I get the content of href, and how can I get the text between <i> and </i>?

3
  • 1
    Use xmlstarlet. stackoverflow.com/questions/1732348/… Commented Dec 21, 2010 at 5:15
  • @Ignacio Vazquez-Abrams: Does xmlstarlet work with HTML too? Commented Dec 21, 2010 at 5:32
  • @Gumbo: You'd have to shove it through HTML Tidy first, but that's not too big a deal. And it's more a matter of the option not existing, not the underlying libraries being unable to handle it. Commented Dec 21, 2010 at 5:33

3 Answers 3

1

cat file | cut -f2 -d\"

FYI: Just about every other HTML/regexp post on Stackoverflow explains why getting values from HTML using anything other than HTML parsing is a bad idea. You may want to read some of those. This one for example.

Sign up to request clarification or add additional context in comments.

Comments

0

If href is always the second token separated by space in a,ine then u can try

grep "href" file | cut -d' ' -f2 | cut -d'=' -f2

Comments

0

Here's how to do it using xmlstarlet (optionally with tidy):

# extract content of href and <i>...</i>
echo '<a href="xxxxxx" style="xxxx"><i>some text</i></a>' |
xmlstarlet sel -T -t -m "//a" -v @href -n -v i -n

# using tidy & xmlstarlet
echo '<a href="xxxxxx" style="xxxx"><i>some text</i></a>' |
tidy -q -c -wrap 0 -numeric -asxml -utf8 --merge-divs yes --merge-spans yes 2>/dev/null | 
xmlstarlet sel -N x="http://www.w3.org/1999/xhtml" -T -t -m "//x:a" -v @href -n -v . -n

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.