How to extract specific meta information from html document using Regex

Question

I need to extract specific meta information from html document using Linux command.

For example: A html document having

<meta content="2017-12-26" name="lastmod"/>

I need to extract 2017-12-26 from this meta tag.

I have set of article in 'test' folder which i am iterating to get the title and meta information.

I am able to get the title but not meta.

Code which i am trying

    DOC_FOLDER_PATH=test"/"

        for i in `find $DOC_FOLDER_PATH -type f -name "*.htm*"`
        do
          title_to_get=$(grep "<title>" $i | tail -1)
          title_to_get=$(echo $title_to_get | sed 's/<title>//g' | sed 's/<\/title>//g')
          echo "Title: "$title_to_get

          last_modify_date=$(grep "<meta name='lastmod' $i | tail -1)
          last_modify_date=$(echo $last_modify_date | sed 's/<meta//g' | sed 's/<\">>//g')
          echo 'content'$last_modify_date
        done

I am getting title_to_get but not last_modify_date. How can i get the last_modify_date?

I hope i am able to clear the question. Please help me.

user unknown · Accepted Answer · 2018-04-10 12:39:29Z

1

The order of content and name in the meta tag is free, but your expression expects (<meta name='lastmod') lastmod being first, while it is second:

<meta content="2017-12-26" name="lastmod"/>

With sed you can look, whether lastmod is present at all, and then just pick the content-content:

echo '<meta content="2017-12-26" name="lastmod"/>'| sed -rn "/<meta .*name=.lastmod./ s/.*content=.([0-9-]+).*/\1/p"
2017-12-26

So your code

last_modify_date=$(grep "<meta name='lastmod' $i | tail -1)
last_modify_date=$(echo $last_modify_date | sed 's/<meta//g' | sed 's/<\">>//g')

could be improved to

 last_modify_date=$(sed -rn "/<meta .*name=.lastmod./ s/.*content=.([0-9-]+).*/\1/p" "$i")

There are some pitfalls to mention:

Maybe the next time the date is written 2017/12/26. Or maybe in classic continental form 26.12.2017. Or one of the zillion other formats.

The pattern ".([0-9-]+).*" is agnostic against single or double quotes and might work flawlessly. But you can group the valid characters and restrict the error possibilities content=2017-12-26 further with ["'] but I don't know exactly, how to mask these characters so you have to try out.

With linebreaks you're doomed:

<meta content="2017-12-26" 
      name="lastmod"/>

And with comments, too:

<!-- that's not longer valid:
    <meta content="2017-12-26" 
          name="lastmod"/>
-->

but often it is sufficient to check your results, like 'exactly one lastmod date shall be found, and react to changes on the input format.

Most html pages don't comply exactly to standards, so using an xml-parser might not work, too. But have a look at xmlstarlet, how to parse xml. It's very useful in general and might help with this problem too.

edited Apr 10, 2018 at 12:39

answered Apr 10, 2018 at 11:52

user unknown

36.4k12 gold badges77 silver badges123 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Vishnu Sharma Over a year ago

Thank you for your time. <meta content="2017-12-26" name="lastmod"/>. The date here in the meta is dynamic. so I need to get the content from meta based on this.

user unknown Over a year ago

Well, I assumed so and the sed-command, I introduced, filters it.

Collectives™ on Stack Overflow

How to extract specific meta information from html document using Regex

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related