Scrape Twitter Embedded URL via Python

Question

I currently try to extract URLs embedded in a Call-To-Action button within videos on Twitter. An example:

When utilising Chrome Inspect, I can relatively easily spot what I'm after:

Now I'm trying to scrape that highlighted link in Python. I couldn't find any way to get it from the Twitter API, therefore I switched to BeautifulSoup. But when searching for any link it doesn't show it to me:

In[23]: url = "https://amp.twimg.com/v/a693e53f-a6a3-4ff1-b06e-7c5402db0e06"
In[24]: resp = requests.get(url).content 
In[25]: soup = BeautifulSoup(resp, 'lxml') 
In[26]: soup.find_all('a')
Out[26]: 
[<a href="https://twitter.com/unibet" target="_blank">@unibet</a>,
<a class="download-btn" id="app-download"><img id="whiteLogo"      
src="https://amp.twimg.com/amplify-web-player/prod/styles/img/twitter_logo_white.png"/></a>]

Any idea what I could do to extract that embedded URL? Any help is much appreciated!

Padraic Cunningham · Accepted Answer · 2016-06-23 12:29:45Z

The data is dynamically created via a ajax request, you can pull the url for the xml from the original pages meta tag with the name="twitter:amplify:vmap" then request that data which is xml like:

?xml version="1.0" encoding="utf-8"?>
<vmap:VMAP xmlns:esi="http://www.edge-delivery.org/esi/1.0" xmlns:tw="http://twitter.com/schema/videoVMapV2.xsd" xmlns:vmap="http://www.iab.net/vmap-1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="vast3.xsd">
<vmap:Extensions>
<vmap:Extension>
<tw:amplify>
<tw:content contentId="745543706946658305" ownerId="143820595" stitched="false">
<tw:cta_watch_now url="https://www.unibet.co.uk/stan/campaign.do?cmpId=1042109&amp;affiliateId=52&amp;affId=5211000020&amp;adID=LINC_E2_T9&amp;unibetTarget=/luckisnocoincidence"/>
<MediaFiles>
<MediaFile>
              http://amp.twimg.com/prod/multibr_v_1/video/2016/06/22/09/745543706946658305-libx264-main-2028k.mp4?5LiXscTGA2BYvqh2cKP8uTkru1N%2Fj8exRYhB9PbbFpM%3D
            </MediaFile>
</MediaFiles>
<tw:videoVariants>
<tw:videoVariant content_type="application/x-mpegURL" url="https://video.twimg.com/amplify_video/745543706946658305/pl/st7wblyZRtiYtYP9.m3u8?expiration=1466688540&amp;hmac=cb919c7cbe840ad38f8892f430695245991b19022d3359a68f724754171a5874"/>
<tw:videoVariant bit_rate="320000" content_type="video/mp4" url="https://video.twimg.com/amplify_video/745543706946658305/vid/320x180/JST5dEfLU99QyWle.mp4?expiration=1466688540&amp;hmac=0dc8d5a53cba3228ad6b01d766bf0ad0b8c8504b9cba5db93dd62e379cdad9dc"/>
<tw:videoVariant content_type="application/dash+xml" url="https://video.twimg.com/amplify_video/745543706946658305/pl/st7wblyZRtiYtYP9.mpd?expiration=1466688540&amp;hmac=74a2b83bdc0020957b7d8603a66ae514425e25c05b546108d7667fe7345afbfb"/>
<tw:videoVariant bit_rate="2176000" content_type="video/mp4" url="https://video.twimg.com/amplify_video/745543706946658305/vid/1280x720/U7ucLbF_u4E8CYBQ.mp4?expiration=1466688540&amp;hmac=5207d3904cb34b9fc21a584e2f47247e6e0f9a97cacb0ae5721b5f1fd9167916"/>
<tw:videoVariant bit_rate="832000" content_type="video/mp4" url="https://video.twimg.com/amplify_video/745543706946658305/vid/640x360/Zopai0yZTfHhyq6W.mp4?expiration=1466688540&amp;hmac=fd736bdd53b487f2a881b583cd2e39610365d82970a9a0ed6c695c5eb44476b2"/>
</tw:videoVariants>
</tw:content>
</tw:amplify>
</vmap:Extension>
</vmap:Extensions>
<!-- We only support linear start (preroll) for now -->
<vmap:AdBreak breakId="preroll3" breakType="linear" timeOffset="start">
<vmap:AdSource allowMultipleAds="false" followRedirects="false" id="0">
<vmap:VASTData>
<VAST>
</VAST>
</vmap:VASTData>
</vmap:AdSource>
</vmap:AdBreak>
</vmap:VMAP>

So we just need to pull the url from that:

from bs4 import BeautifulSoup
import requests

url = "https://amp.twimg.com/v/a693e53f-a6a3-4ff1-b06e-7c5402db0e06"
resp = requests.get(url).content
soup = BeautifulSoup(resp, 'lxml')

xml = soup.select_one("meta[name=twitter:amplify:vmap]")["content"]
soup2 = BeautifulSoup(requests.get(xml).content,"xml")

print(soup2.find("cta_watch_now")["url"])

That then gives us the link:

https://www.unibet.co.uk/stan/campaign.do?cmpId=1042109&affiliateId=52&affId=5211000020&adID=LINC_E2_T9&unibetTarget=/luckisnocoincidence

Perfect this really helps! Is this a "standard" way of dynamically calling embedded media content, e.g. would it work similarly on Facebook?
No worries, unfortunately pretty much every site is different so you would need to monitor the requests to see exactly what is happening, chrome tools or firebug are essential tools when it comes to scraping, if you open chrom tools and look under the xhr tab under the network tab you can see the get request.

Collectives™ on Stack Overflow

Scrape Twitter Embedded URL via Python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related