extract certain string from variable

Question

I've got a text file containing the html-source of a web page. There are lines with "data-adid="...". These lines I'd like to capture. Therefore, I use:

Id=$(grep -m 10 -A 1 "data-adid" Textfile)

to get the first ten results. The variable Id contains the following:

<arcicle class="aditem" data-adid="1234567890" <div class="aditem-image"> -- 
<arcicle class="aditem" data-adid="2134567890" <div class="aditem-image"> --
<arcicle class="aditem" data-adid="2134567890" <div class="aditem-image"> --
...

I would like to get the following output:

id="1234567890" id="2134567890" id="3124567890"

When using the grep command, I only managage to get the numbers, e.g.

Id2=$(echo $Id | grep -oP '(?<=data-ad=").*?(?=")')

gets 1234567890 2134567890 3124567890

When trying

Id2=$(echo $Id | grep -oP '(?<=data-ad).*?(?=")')

this will only give me id= id= id=

How could the code be change to get the desired output?

Could you please fix your sample of input and remove .... and put more clear input that will give us better understanding of your question — RavinderSingh13
– RavinderSingh13, Commented Sep 8, 2020 at 21:10
Also could you please share sample of your Input_file or how you are creating variable may be if possible we could directly read Input_file itself and get the values rather than using variables, if you could add Some more info it will help us to help you here. — RavinderSingh13
– RavinderSingh13, Commented Sep 8, 2020 at 21:14
@RavinderSingh13 I edited the questions hoping that things became clearer now — X3nion
– X3nion, Commented Sep 8, 2020 at 21:22
the html-source of a web page then use a html (ie. xml) aware tool to extract the data. xmllint or xmlstarlet — KamilCuk
– KamilCuk, Commented Sep 8, 2020 at 21:22
@KamilCuk I curled the webpage, until now it worked fine extracting both links and dates. Would it be easier using xmllint or xmlstarlet? Haven't heard of them until now — X3nion
– X3nion, Commented Sep 8, 2020 at 21:24

RavinderSingh13 · Accepted Answer · 2020-09-08 21:48:45Z

2

Though html values should be dealt with tools which understand html well but since OP is mentioning he/she needs in shell like tools, I would go for awk for this one. Written and tested in https://ideone.com/EpU1aW

echo "$var" |
awk '
match($0,/data-adid="[^"]*"/){
  val=substr($0,RSTART,RLENGTH)
  sub(/^data-ad/,"",val)
  print val
  val=""
}
'

edited Sep 8, 2020 at 21:48

answered Sep 8, 2020 at 21:32

RavinderSingh13

135k14 gold badges61 silver badges100 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

KamilCuk · Accepted Answer · 2020-09-08 21:28:01Z

2

data-ad is matching only data-ad - actually match the id= part too, with a " up until the next ". And I see no reason to use fancy lookarounds - just match the string and output the matched part only.

grep -oP 'data-ad\Kid="[^"]*"'

Should be enough. Note that $Id undergoes word splitting expansion and most probably should be quoted and that it's impossible to parse html using regex so you should most probably use html syntax aware tools instead.

answered Sep 8, 2020 at 21:28

KamilCuk

146k8 gold badges84 silver badges154 bronze badges

Comments

Ed Morton · Accepted Answer · 2020-09-08 22:32:23Z

0

With any sed:

$ sed 's/.*data-ad\(id="[^"]*"\).*/\1/' file
id="1234567890"
id="2134567890"
id="2134567890"

answered Sep 8, 2020 at 22:32

Ed Morton

209k18 gold badges90 silver badges212 bronze badges

Collectives™ on Stack Overflow

extract certain string from variable

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related