1

I currently try to extract URLs embedded in a Call-To-Action button within videos on Twitter. An example:

Twitter Video

When utilising Chrome Inspect, I can relatively easily spot what I'm after:

enter image description here

Now I'm trying to scrape that highlighted link in Python. I couldn't find any way to get it from the Twitter API, therefore I switched to BeautifulSoup. But when searching for any link it doesn't show it to me:

In[23]: url = "https://amp.twimg.com/v/a693e53f-a6a3-4ff1-b06e-7c5402db0e06"
In[24]: resp = requests.get(url).content 
In[25]: soup = BeautifulSoup(resp, 'lxml') 
In[26]: soup.find_all('a')
Out[26]: 
[<a href="https://twitter.com/unibet" target="_blank">@unibet</a>,
<a class="download-btn" id="app-download"><img id="whiteLogo"      
src="https://amp.twimg.com/amplify-web-player/prod/styles/img/twitter_logo_white.png"/></a>]

Any idea what I could do to extract that embedded URL? Any help is much appreciated!

1 Answer 1

2

The data is dynamically created via a ajax request, you can pull the url for the xml from the original pages meta tag with the name="twitter:amplify:vmap" then request that data which is xml like:

?xml version="1.0" encoding="utf-8"?>
<vmap:VMAP xmlns:esi="http://www.edge-delivery.org/esi/1.0" xmlns:tw="http://twitter.com/schema/videoVMapV2.xsd" xmlns:vmap="http://www.iab.net/vmap-1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="vast3.xsd">
<vmap:Extensions>
<vmap:Extension>
<tw:amplify>
<tw:content contentId="745543706946658305" ownerId="143820595" stitched="false">
<tw:cta_watch_now url="https://www.unibet.co.uk/stan/campaign.do?cmpId=1042109&amp;affiliateId=52&amp;affId=5211000020&amp;adID=LINC_E2_T9&amp;unibetTarget=/luckisnocoincidence"/>
<MediaFiles>
<MediaFile>
              http://amp.twimg.com/prod/multibr_v_1/video/2016/06/22/09/745543706946658305-libx264-main-2028k.mp4?5LiXscTGA2BYvqh2cKP8uTkru1N%2Fj8exRYhB9PbbFpM%3D
            </MediaFile>
</MediaFiles>
<tw:videoVariants>
<tw:videoVariant content_type="application/x-mpegURL" url="https://video.twimg.com/amplify_video/745543706946658305/pl/st7wblyZRtiYtYP9.m3u8?expiration=1466688540&amp;hmac=cb919c7cbe840ad38f8892f430695245991b19022d3359a68f724754171a5874"/>
<tw:videoVariant bit_rate="320000" content_type="video/mp4" url="https://video.twimg.com/amplify_video/745543706946658305/vid/320x180/JST5dEfLU99QyWle.mp4?expiration=1466688540&amp;hmac=0dc8d5a53cba3228ad6b01d766bf0ad0b8c8504b9cba5db93dd62e379cdad9dc"/>
<tw:videoVariant content_type="application/dash+xml" url="https://video.twimg.com/amplify_video/745543706946658305/pl/st7wblyZRtiYtYP9.mpd?expiration=1466688540&amp;hmac=74a2b83bdc0020957b7d8603a66ae514425e25c05b546108d7667fe7345afbfb"/>
<tw:videoVariant bit_rate="2176000" content_type="video/mp4" url="https://video.twimg.com/amplify_video/745543706946658305/vid/1280x720/U7ucLbF_u4E8CYBQ.mp4?expiration=1466688540&amp;hmac=5207d3904cb34b9fc21a584e2f47247e6e0f9a97cacb0ae5721b5f1fd9167916"/>
<tw:videoVariant bit_rate="832000" content_type="video/mp4" url="https://video.twimg.com/amplify_video/745543706946658305/vid/640x360/Zopai0yZTfHhyq6W.mp4?expiration=1466688540&amp;hmac=fd736bdd53b487f2a881b583cd2e39610365d82970a9a0ed6c695c5eb44476b2"/>
</tw:videoVariants>
</tw:content>
</tw:amplify>
</vmap:Extension>
</vmap:Extensions>
<!-- We only support linear start (preroll) for now -->
<vmap:AdBreak breakId="preroll3" breakType="linear" timeOffset="start">
<vmap:AdSource allowMultipleAds="false" followRedirects="false" id="0">
<vmap:VASTData>
<VAST>
</VAST>
</vmap:VASTData>
</vmap:AdSource>
</vmap:AdBreak>
</vmap:VMAP>

So we just need to pull the url from that:

from bs4 import BeautifulSoup
import requests

url = "https://amp.twimg.com/v/a693e53f-a6a3-4ff1-b06e-7c5402db0e06"
resp = requests.get(url).content
soup = BeautifulSoup(resp, 'lxml')

xml = soup.select_one("meta[name=twitter:amplify:vmap]")["content"]
soup2 = BeautifulSoup(requests.get(xml).content,"xml")

print(soup2.find("cta_watch_now")["url"])

That then gives us the link:

https://www.unibet.co.uk/stan/campaign.do?cmpId=1042109&affiliateId=52&affId=5211000020&adID=LINC_E2_T9&unibetTarget=/luckisnocoincidence
Sign up to request clarification or add additional context in comments.

2 Comments

Perfect this really helps! Is this a "standard" way of dynamically calling embedded media content, e.g. would it work similarly on Facebook?
No worries, unfortunately pretty much every site is different so you would need to monitor the requests to see exactly what is happening, chrome tools or firebug are essential tools when it comes to scraping, if you open chrom tools and look under the xhr tab under the network tab you can see the get request.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.