0

I need to extract specific meta information from html document using Linux command.

For example: A html document having

<meta content="2017-12-26" name="lastmod"/>

I need to extract 2017-12-26 from this meta tag.

I have set of article in 'test' folder which i am iterating to get the title and meta information.

I am able to get the title but not meta.

Code which i am trying

    DOC_FOLDER_PATH=test"/"

        for i in `find $DOC_FOLDER_PATH -type f -name "*.htm*"`
        do
          title_to_get=$(grep "<title>" $i | tail -1)
          title_to_get=$(echo $title_to_get | sed 's/<title>//g' | sed 's/<\/title>//g')
          echo "Title: "$title_to_get

          last_modify_date=$(grep "<meta name='lastmod' $i | tail -1)
          last_modify_date=$(echo $last_modify_date | sed 's/<meta//g' | sed 's/<\">>//g')
          echo 'content'$last_modify_date
        done

I am getting title_to_get but not last_modify_date. How can i get the last_modify_date?

I hope i am able to clear the question. Please help me.

0

1 Answer 1

1

The order of content and name in the meta tag is free, but your expression expects (<meta name='lastmod') lastmod being first, while it is second:

<meta content="2017-12-26" name="lastmod"/>

With sed you can look, whether lastmod is present at all, and then just pick the content-content:

echo '<meta content="2017-12-26" name="lastmod"/>'| sed -rn "/<meta .*name=.lastmod./ s/.*content=.([0-9-]+).*/\1/p"
2017-12-26

So your code

last_modify_date=$(grep "<meta name='lastmod' $i | tail -1)
last_modify_date=$(echo $last_modify_date | sed 's/<meta//g' | sed 's/<\">>//g')

could be improved to

 last_modify_date=$(sed -rn "/<meta .*name=.lastmod./ s/.*content=.([0-9-]+).*/\1/p" "$i")

There are some pitfalls to mention:

Maybe the next time the date is written 2017/12/26. Or maybe in classic continental form 26.12.2017. Or one of the zillion other formats.

The pattern ".([0-9-]+).*" is agnostic against single or double quotes and might work flawlessly. But you can group the valid characters and restrict the error possibilities content=2017-12-26 further with ["'] but I don't know exactly, how to mask these characters so you have to try out.

With linebreaks you're doomed:

<meta content="2017-12-26" 
      name="lastmod"/>

And with comments, too:

<!-- that's not longer valid:
    <meta content="2017-12-26" 
          name="lastmod"/>
-->

but often it is sufficient to check your results, like 'exactly one lastmod date shall be found, and react to changes on the input format.

Most html pages don't comply exactly to standards, so using an xml-parser might not work, too. But have a look at xmlstarlet, how to parse xml. It's very useful in general and might help with this problem too.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for your time. <meta content="2017-12-26" name="lastmod"/>. The date here in the meta is dynamic. so I need to get the content from meta based on this.
Well, I assumed so and the sed-command, I introduced, filters it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.