4

I have parsed an html document containing javascript with beautifulsoup, and have managed to isolate the javascript within it and convert it into a string. The javascript looks like this:

<script>
    [irrelevant javascript code here]
    sources:[{file:"http://url.com/folder1/v.html",label:"label1"},
    {file:"http://url.com/folder2/v.html",label:"label2"},
    {file:"http://url.com/folder3/v.html",label:"label3"}],
    [irrelevant javascript code here]
</script>

I am trying to get an array with only urls contained in this sources array, which would look like so:

urls = ['http://url.com/folder1/v.html', 
        'http://url.com/folder2/v.html', 
        'http://url.com/folder3/v.html']

The domains are unknown IPs, the folders are of random name-length consisting of lowercase letters and numbers, and there are 1-5 of them in each file(usually 3). All that is constant is that they start with http and end with .html.

I decided to use regular expressions to deal with this problem(which I am quite new at) and my code looks like this: urls=re.findall(r'http://[^t][^s"]+', document)

The [^t] is there because there are other urls in the document whose domain names start with t. My problem is, there is another url with a jpg from the same domain as the urls I am extracting, which gets put into the urls array along with the others.

Example:

urls = ['http://123.45.67.89/asodibfo3ribawoifbadsoifasdf3/v.html'
        'http://123.45.67.89/alwefaoewifiasdof224a/v.html',
        'http://123.45.67.89/baoisdbfai235oubodsfb45/v.html',
        'http://123.45.67.89/i/0123/12345/aoief243oinsdf.jpg']

How would I go about only fetching the html urls?

3 Answers 3

2

You can use r'"(http.*?)"' to get the urls within your text :

>>> s="""<script>
...     [irrelevant javascript code here]
...     sources:[{file:"http://url.com/folder1/v.html",label:"label1"},
...     {file:"http://url.com/folder2/v.html",label:"label2"},
...     {file:"http://url.com/folder3/v.html",label:"label3"}],
...     [irrelevant javascript code here]
... </script>"""

>>> re.findall(r'"(http.*?)"',s,re.MULTILINE|re.DOTALL)
['http://url.com/folder1/v.html', 'http://url.com/folder2/v.html', 'http://url.com/folder3/v.html']

ans for extracting the .html's in list of urls you can use str.endswith :

>>> urls = ['http://123.45.67.89/asodibfo3ribawoifbadsoifasdf3/v.html',
...         'http://123.45.67.89/alwefaoewifiasdof224a/v.html',
...         'http://123.45.67.89/baoisdbfai235oubodsfb45/v.html',
...         'http://123.45.67.89/i/0123/12345/aoief243oinsdf.jpg']
>>> 
>>> [i for i in urls if i.endswith('html')]
['http://123.45.67.89/asodibfo3ribawoifbadsoifasdf3/v.html', 
 'http://123.45.67.89/alwefaoewifiasdof224a/v.html', 
 'http://123.45.67.89/baoisdbfai235oubodsfb45/v.html']

Also as another general and flexible way for such tasks you can use fnmatch module :

>>> from fnmatch import fnmatch
>>> [i for i in urls if fnmatch(i,'*.html')]
['http://123.45.67.89/asodibfo3ribawoifbadsoifasdf3/v.html', 
 'http://123.45.67.89/alwefaoewifiasdof224a/v.html', 
 'http://123.45.67.89/baoisdbfai235oubodsfb45/v.html'] 
Sign up to request clarification or add additional context in comments.

Comments

0

Something like this?

re.findall(r'http://[^t][^s"]+\.html', document)

Comments

0

If the format is always the same with {file:urllook for the substring between quotes following {file::

s="""<script>
    [irrelevant javascript code here]
    sources:[{file:"http://url.com/folder1/v.html",label:"label1"},
    {file:"http://url.com/folder2/v.html",label:"label2"},
    {file:"http://url.com/folder3/v.html",label:"label3"}],
    [irrelevant javascript code here]
</script>"""


print(re.findall("\{file\:\"(.*?)\"",s))
['http://url.com/folder1/v.html', 'http://url.com/folder2/v.html', 'http://url.com/folder3/v.html']

You could also limit your strings to search by splitting once on sources:

s="""<script>
    [irrelevant javascript code here]
    sources:[{file:"http://url.com/folder1/v.html",label:"label1"},
    {file:"http://url.com/folder2/v.html",label:"label2"},
    {file:"http://url.com/folder3/v.html",label:"label3"}],
    [irrelevant javascript code here]
</script>"""

print(re.findall("\{file\:\"(.*?)\"",s.split("sources:[",1)[1]))

Which would remove all the other lines before sources:[, presuming there are not other sources:[.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.