2

I'm trying to extract specific text from many urls that are being returned. Im using Python 2.7 with requests and BeautifulSoup.

The reason is i need to find the latest URL which can be identified by the highest number "DF_7" with 7 been the highest from the below urls.This url will then be downloaded. Note, each day new files are added, this is why i need to check for the one with the highest number.

Once i find the highest number in the list of URL's i then need to join this "https://service.rl360.com/scripts/customer.cgi/SC/servicing/" to the url with the highest number. The final product should look like this. https://service.rl360.com/scripts/customer.cgi/SC/servicing/downloads.php?Reference=DF_7&SortField=ExpiryDays&SortOrder=Ascending

The urls look like this just with DF_ incrementing each time

Is this the right approach? if so how do i go about doing this.

Thanks

import base
import requests
import zipfile, StringIO, re
from lxml import html
from bs4 import BeautifulSoup

from base import os

from django.conf import settings

# Fill in your details here to be posted to the login form.
payload = {
    'USERNAME': 'xxxxxx',
    'PASSWORD': 'xxxxxx',
    'option': 'login'
}

headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5)     AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}

# Use 'with' to ensure the session context is closed after use.

with requests.Session() as s:
        p = s.post('https://service.rl360.com/scripts/customer.cgi?option=login', data=payload)

    # An authorised request.
    r = s.get('https://service.rl360.com/scripts/customer.cgi/SC/servicing/downloads.php?Folder=DataDownloads&SortField=ExpiryDays&SortOrder=Ascending', stream=True)
    content = r.text
    soup = BeautifulSoup(content, 'lxml')
    table = soup.find('table')
    links = table.find_all('a')
    print links
9
  • Have you got any code for this? Commented Aug 14, 2017 at 10:11
  • yes, i will amend my post now Commented Aug 14, 2017 at 10:13
  • Could you add the links your script prints please? Commented Aug 14, 2017 at 10:25
  • <a class="tabletd" href="downloads.php?Reference=DF_7&amp;SortField=ExpiryDays&amp;SortOrder=Ascending"> Commented Aug 14, 2017 at 10:27
  • Is the link you want always the last link on the page with the class tabletd? Commented Aug 14, 2017 at 10:28

1 Answer 1

1

You can go straight to the last link with the class "tableid" and print it's href value like this:

href = soup.find_all("a", {'class':'tabletd'})[-1]['href']
base = "https://service.rl360.com/scripts/customer.cgi/SC/servicing/"
print (base + href)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.