1

I've got a text file containing the html-source of a web page. There are lines with "data-adid="...". These lines I'd like to capture. Therefore, I use:

Id=$(grep -m 10 -A 1 "data-adid" Textfile)

to get the first ten results. The variable Id contains the following:

<arcicle class="aditem" data-adid="1234567890" <div class="aditem-image"> -- 
<arcicle class="aditem" data-adid="2134567890" <div class="aditem-image"> --
<arcicle class="aditem" data-adid="2134567890" <div class="aditem-image"> --
...

I would like to get the following output:

id="1234567890" id="2134567890" id="3124567890"

When using the grep command, I only managage to get the numbers, e.g.

Id2=$(echo $Id | grep -oP '(?<=data-ad=").*?(?=")')

gets 1234567890 2134567890 3124567890

When trying

Id2=$(echo $Id | grep -oP '(?<=data-ad).*?(?=")')

this will only give me id= id= id=

How could the code be change to get the desired output?

6
  • Could you please fix your sample of input and remove .... and put more clear input that will give us better understanding of your question Commented Sep 8, 2020 at 21:10
  • Also could you please share sample of your Input_file or how you are creating variable may be if possible we could directly read Input_file itself and get the values rather than using variables, if you could add Some more info it will help us to help you here. Commented Sep 8, 2020 at 21:14
  • @RavinderSingh13 I edited the questions hoping that things became clearer now Commented Sep 8, 2020 at 21:22
  • the html-source of a web page then use a html (ie. xml) aware tool to extract the data. xmllint or xmlstarlet Commented Sep 8, 2020 at 21:22
  • @KamilCuk I curled the webpage, until now it worked fine extracting both links and dates. Would it be easier using xmllint or xmlstarlet? Haven't heard of them until now Commented Sep 8, 2020 at 21:24

3 Answers 3

2

Though html values should be dealt with tools which understand html well but since OP is mentioning he/she needs in shell like tools, I would go for awk for this one. Written and tested in https://ideone.com/EpU1aW

echo "$var" |
awk '
match($0,/data-adid="[^"]*"/){
  val=substr($0,RSTART,RLENGTH)
  sub(/^data-ad/,"",val)
  print val
  val=""
}
'
Sign up to request clarification or add additional context in comments.

Comments

2

data-ad is matching only data-ad - actually match the id= part too, with a " up until the next ". And I see no reason to use fancy lookarounds - just match the string and output the matched part only.

grep -oP 'data-ad\Kid="[^"]*"'

Should be enough. Note that $Id undergoes word splitting expansion and most probably should be quoted and that it's impossible to parse html using regex so you should most probably use html syntax aware tools instead.

Comments

0

With any sed:

$ sed 's/.*data-ad\(id="[^"]*"\).*/\1/' file
id="1234567890"
id="2134567890"
id="2134567890"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.