Extract specific text from URL Python

Question

I'm trying to extract specific text from many urls that are being returned. Im using Python 2.7 with requests and BeautifulSoup.

The reason is i need to find the latest URL which can be identified by the highest number "DF_7" with 7 been the highest from the below urls.This url will then be downloaded. Note, each day new files are added, this is why i need to check for the one with the highest number.

Once i find the highest number in the list of URL's i then need to join this "https://service.rl360.com/scripts/customer.cgi/SC/servicing/" to the url with the highest number. The final product should look like this. https://service.rl360.com/scripts/customer.cgi/SC/servicing/downloads.php?Reference=DF_7&SortField=ExpiryDays&SortOrder=Ascending

The urls look like this just with DF_ incrementing each time

Is this the right approach? if so how do i go about doing this.

Thanks

import base
import requests
import zipfile, StringIO, re
from lxml import html
from bs4 import BeautifulSoup

from base import os

from django.conf import settings

# Fill in your details here to be posted to the login form.
payload = {
    'USERNAME': 'xxxxxx',
    'PASSWORD': 'xxxxxx',
    'option': 'login'
}

headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5)     AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}

# Use 'with' to ensure the session context is closed after use.

with requests.Session() as s:
        p = s.post('https://service.rl360.com/scripts/customer.cgi?option=login', data=payload)

    # An authorised request.
    r = s.get('https://service.rl360.com/scripts/customer.cgi/SC/servicing/downloads.php?Folder=DataDownloads&SortField=ExpiryDays&SortOrder=Ascending', stream=True)
    content = r.text
    soup = BeautifulSoup(content, 'lxml')
    table = soup.find('table')
    links = table.find_all('a')
    print links

<a class="tabletd" href="downloads.php?Reference=DF_7&SortField=ExpiryDays&SortOrder=Ascending"> — Nic Palvie
– Nic Palvie, Commented Aug 14, 2017 at 10:27
Is the link you want always the last link on the page with the class tabletd? — Dan-Dev
– Dan-Dev, Commented Aug 14, 2017 at 10:28

Dan-Dev · Accepted Answer · 2017-08-14 11:09:28Z

1

You can go straight to the last link with the class "tableid" and print it's href value like this:

href = soup.find_all("a", {'class':'tabletd'})[-1]['href']
base = "https://service.rl360.com/scripts/customer.cgi/SC/servicing/"
print (base + href)

edited Aug 14, 2017 at 11:09

answered Aug 14, 2017 at 10:34

Dan-Dev

9,5783 gold badges42 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Extract specific text from URL Python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related