3

I'm parsing a url in Python, below you can find a sample url and the code, what i want to do is splitting the (74743) from the url and make a for loop which will be taking it from a parts list. Tried to use urlparse but couldn't complete it to the end mostly because of the changing parts in the url. Ijust want the easiest and fastest way to do this.

Sample url:

http://example.com/wps/portal/lYuxDoIwGAYf6f9aqKSjMNQ/?PartNo=74743&IntNumberOf=&is=

(http://example.com/wps/portal) Always fixed

(lYuxDoIwGAYf6f9aqKSjMNQ) Always changing

(74743) Will be taken from a list name Parts

(IntNumberOf=&is=) Also changing depending on the section of the website

Here's the Code:

from lxml import html
import requests
import urlparse


Parts = [74743, 85731, 93021]

url = 'http://example.com/wps/portal/lYuxDoIwGAYf6f9aqKSjMNQ/?PartNo=74743&IntNumberOf=&is='

parsing = urlparse.urlsplit(url)

print parsing
3
  • In what way couldn't you 'complete it to the end'? Commented Oct 18, 2015 at 22:09
  • just want the changing parts of the url to be ignored, and the no.(74743) to be taken from Parts . Commented Oct 18, 2015 at 22:40
  • Yes, but why couldn't you complete it? Commented Oct 18, 2015 at 22:42

1 Answer 1

4
>>> import urlparse

>>> url = 'http://example.com/wps/portal/lYuxDoIwGAYf6f9aqKSjMNQ/?PartNo=74743&IntNumberOf=&is='

>>> split_url = urlparse.urlsplit(url)
>>> split_url.path
'/wps/portal/lYuxDoIwGAYf6f9aqKSjMNQ/'

You can split the path into a list of strings using '/', slice the list, and re-join:

>>> path = split_url.path
>>> path.split('/')
['', 'wps', 'portal', 'lYuxDoIwGAYf6f9aqKSjMNQ', '']

Slice off the last two:

>>> path.split('/')[:-2]
['', 'wps', 'portal']

And re-join:

>>> '/'.join(path.split('/')[:-2])
'/wps/portal'

To parse the query, use parse_qs:

>>> parsed_query = urlparse.parse_qs(split_url.query)
{'PartNo': ['74743']}

To keep the empty parameters use keep_blank_values=True:

>>> query = urlparse.parse_qs(split_url.query, keep_blank_values=True)
>>> query
{'PartNo': ['74743'], 'is': [''], 'IntNumberOf': ['']}

You can then modify the query dictionary:

>>> query['PartNo'] = 85731

And update the original split_url:

>>> updated = split_url._replace(path='/'.join(base_path.split('/')[:-2] +
                                              ['ASDFZXCVQWER', '']),
                                query=urllib.urlencode(query, doseq=True))

>>> urlparse.urlunsplit(updated)
'http://example.com/wps/portal/ASDFZXCVQWER/?PartNo=85731&IntNumberOf=&is='
Sign up to request clarification or add additional context in comments.

13 Comments

for the base_path, what about if i have more than two '/' ... like ( /wps/portal/ut/p/c1/lYuxDoIwGAYf6f9aqKSjMNQ/ , how can i deal with it ?
@T.M What url? Have you tried the code? If you have another question, ask a new question. Read how to ask first, particularly the section on how to create a Minimal, Complete, Verifiable Example.
sorry my computer got jammed .. thanks , appreciate it .. but with this url : 'url = 'example.com/wps/portal/!ut/p/c1/…' for the base_path it gives me nothing, and it gives me an error with (updated) "Invalid syntax"
Apologies, I was using os.path.basename without thinking. I've replaced with an example using str.split.
thanks the first part works very good but "updated" throw me a Traceback ... Traceback (most recent call last): File "solving_url_issue2.py", line 41, in <module> updated = split_url._update(path='/'.join(base_path.split('/')[:-2] + AttributeError: 'SplitResult' object has no attribute '_update'.... tried to find a solution for it but didn't find any..
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.