I need to extract specific meta information from html document using Linux command.
For example: A html document having
<meta content="2017-12-26" name="lastmod"/>
I need to extract 2017-12-26 from this meta tag.
I have set of article in 'test' folder which i am iterating to get the title and meta information.
I am able to get the title but not meta.
Code which i am trying
DOC_FOLDER_PATH=test"/"
for i in `find $DOC_FOLDER_PATH -type f -name "*.htm*"`
do
title_to_get=$(grep "<title>" $i | tail -1)
title_to_get=$(echo $title_to_get | sed 's/<title>//g' | sed 's/<\/title>//g')
echo "Title: "$title_to_get
last_modify_date=$(grep "<meta name='lastmod' $i | tail -1)
last_modify_date=$(echo $last_modify_date | sed 's/<meta//g' | sed 's/<\">>//g')
echo 'content'$last_modify_date
done
I am getting title_to_get but not last_modify_date. How can i get the last_modify_date?
I hope i am able to clear the question. Please help me.