Extract information part f URL in python

Question

I have a list of 200k urls, with the general format of:

http[s]://..../..../the-headline-of-the-article
OR
http[s]://..../..../the-headline-of-the-article/....

The number of / before and after the-headline-of-the-article varies

Here is some sample data:

'http://catholicphilly.com/2019/03/news/national-news/call-to-end-affordable-care-act-is-immoral-says-cha-president/',
 'https://www.houmatoday.com/news/20190327/new-website-puts-louisiana-art-on-businesses-walls',
 'https://feltonbusinessnews.com/global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429/149601/',
 'http://www.bristolpress.com/BP-General+News/347592/female-music-art-to-take-center-stage-at-swan-day-in-new-britain',
 'https://www.sfgate.com/business/article/Trump-orders-Treasury-HUD-to-develop-new-plan-13721842.php',
 'https://industrytoday.co.uk/it/research-delivers-insight-into-the-global-business-voip-services-market-during-the-period-2018-2025',
 'https://news.yahoo.com/why-mirza-international-limited-nse-233259149.html',
 'https://www.indianz.com/IndianGaming/2019/03/27/indian-gaming-industry-grows-in-revenues.asp',
 'https://www.yahoo.com/entertainment/facebook-instagram-banning-pro-white-210002719.html',
 'https://www.marketwatch.com/press-release/fluence-receives-another-aspiraltm-bulk-order-with-partner-itest-in-china-2019-03-27',
 'https://www.valleymorningstar.com/news/elections/top-firms-decry-religious-exemption-bills-proposed-in-texas/article_68a5c4d6-2f72-5a6e-8abd-4f04a44ee74f.html',
 'https://tucson.com/news/national/correction-trump-investigations-sater-lawsuit-story/article_ed20e441-de30-5b57-aafd-b1f7d7929f71.html',
 'https://www.publicradiotulsa.org/post/weather-channel-sued-125-million-over-death-storm-chase-collision',

I want to extract the-headline-of-the-article only.

ie.

call-to-end-affordable-care-act-is-immoral-says-cha-president
global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429
correction-trump-investigations-sater-lawsuit-story

I am sure this is possible, but am relatively new with regex in python.

In pseudocode, I was thinking:

split everything by /
keep only the chunk that contains -
replace all - with \s

Is this possible in python (I am a python n00b)?

You have a good algorithm. Why don't you go ahead and implement it? — CinCout
– CinCout, Commented Apr 5, 2019 at 9:15
I don't think your algo would be good enough to return the correct segment in the tucson sample. You might need to extract words from each path segment and return the words of the segment with the most words that can be found in an English dictionnary — Aaron
– Aaron, Commented Apr 5, 2019 at 9:26

butterflyknife · Accepted Answer · 2019-04-05 09:49:59Z

2

urls = [...]
for url in urls:
    bits = url.split('/') # Split each url at the '/'
    bits_with_hyphens = [bit.replace('-', ' ') for bit in bits if '-' in bit] # [1]
    print (bits_with_hyphens)

[1] Note that your algorithm assumes that only one of the fragments after splitting the url will have a hyphen, which is not correct given your examples. So at [1], I'm keeping all the bits that do so.

Output:

['national news', 'call to end affordable care act is immoral says cha president']
['new website puts louisiana art on businesses walls']
['global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits 456 69429']
['BP General+News', 'female music art to take center stage at swan day in new britain']
['Trump orders Treasury HUD to develop new plan 13721842.php']
['research delivers insight into the global business voip services market during the period 2018 2025']
['why mirza international limited nse 233259149.html']
['indian gaming industry grows in revenues.asp']
['facebook instagram banning pro white 210002719.html']
['press release', 'fluence receives another aspiraltm bulk order with partner itest in china 2019 03 27']
['top firms decry religious exemption bills proposed in texas', 'article_68a5c4d6 2f72 5a6e 8abd 4f04a44ee74f.html']
['correction trump investigations sater lawsuit story', 'article_ed20e441 de30 5b57 aafd b1f7d7929f71.html']
['weather channel sued 125 million over death storm chase collision']

PS. I think your algorithm could do with a bit of thought. Problems that I see:

more than one bit might contain a hyphen, where:
- both only contain dictionary words (see first and fourth output)
- one of them is "clearly" not a headline (see second and third from bottom)
spurious string fragments at the end of the real headline: eg "13721842.php", "revenues.asp", "210002719.html"
Need to substitute in a space for characters other than '/', (see fourth, "General+News")

edited Apr 5, 2019 at 9:49

answered Apr 5, 2019 at 9:34

butterflyknife

1,59412 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

DirtyBit Over a year ago

national news is not a part of headlines!

tripleee Over a year ago

As the answer states, that's on purpose. You can't know in the general case, can you? Though it's probably feasible to come up with a heuristic e.g. to dump any extracted piece with just one hyphen if there are also extracted pieces with more than one.

frank Over a year ago

is there a way to keep the bit with the most number of hyphens? If so, then take the max count. I assume there is a count function in strings in python

tripleee · Accepted Answer · 2019-04-05 10:08:57Z

1

Here's a slightly different variation which seems to produce good results from the samples you provided.

Out of the parts with dashes, we trim off any trailing hex strings and file name extension; then, we extract the one with the largest number of dashes from each URL, and finally replace the remaining dashes with spaces.

import re

regex = re.compile(r'(-[0-9a-f]+)*(\.[a-z]+)?$', re.IGNORECASE)

for url in urls:
    parts = url.split('/')
    trimmed = [regex.sub('', x) for x in parts if '-' in x]
    longest = sorted(trimmed, key=lambda x: -len(x.split('-')))[0]
    print(longest.replace('-', ' '))

Output:

call to end affordable care act is immoral says cha president
new website puts louisiana art on businesses walls
global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits
female music art to take center stage at swan day in new britain
Trump orders Treasury HUD to develop new plan
research delivers insight into the global business voip services market during the period
why mirza international limited nse
indian gaming industry grows in revenues
facebook instagram banning pro white
fluence receives another aspiraltm bulk order with partner itest in china
top firms decry religious exemption bills proposed in texas
correction trump investigations sater lawsuit story
weather channel sued 125 million over death storm chase collision

My original attempt would clean out the numbers from the end of the URL only after extracting the longest, and it worked for your samples; but trimming off trailing numbers immediately when splitting is probably more robust against variations in these patterns.

edited Apr 5, 2019 at 10:08

answered Apr 5, 2019 at 9:55

tripleee

192k37 gold badges318 silver badges367 bronze badges

2 Comments

frank Over a year ago

closest solution. I had thought along the same line, but there are examples where the longest string after the split is NOT the headline bit. I was thinking of adding a count for the chunk with the max count of -, under the assumption that a headline will have >3 -, while on occasion there are 2 - in a non-headline chunk

tripleee Over a year ago

From your samples it also looks like the last one after the removal of long hex or number string sequences could also usually be the correct one. Without more samples, this is speculative, of course.

DirtyBit · Accepted Answer · 2019-04-05 09:42:42Z

Since the url's are not in a consistent pattern, Stating the fact that the first and the third url are of different pattern than those of the rest.

Using r.split():

s = ['http://catholicphilly.com/2019/03/news/national-news/call-to-end-affordable-care-act-is-immoral-says-cha-president/',
 'https://www.houmatoday.com/news/20190327/new-website-puts-louisiana-art-on-businesses-walls',
 'https://feltonbusinessnews.com/global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429/149601/',
 'http://www.bristolpress.com/BP-General+News/347592/female-music-art-to-take-center-stage-at-swan-day-in-new-britain',
 'https://www.sfgate.com/business/article/Trump-orders-Treasury-HUD-to-develop-new-plan-13721842.php',
 'https://industrytoday.co.uk/it/research-delivers-insight-into-the-global-business-voip-services-market-during-the-period-2018-2025',
 'https://news.yahoo.com/why-mirza-international-limited-nse-233259149.html',
 'https://www.indianz.com/IndianGaming/2019/03/27/indian-gaming-industry-grows-in-revenues.asp',
 'https://www.yahoo.com/entertainment/facebook-instagram-banning-pro-white-210002719.html',
 'https://www.marketwatch.com/press-release/fluence-receives-another-aspiraltm-bulk-order-with-partner-itest-in-china-2019-03-27',
 'https://www.valleymorningstar.com/news/elections/top-firms-decry-religious-exemption-bills-proposed-in-texas/article_68a5c4d6-2f72-5a6e-8abd-4f04a44ee74f.html',
 'https://tucson.com/news/national/correction-trump-investigations-sater-lawsuit-story/article_ed20e441-de30-5b57-aafd-b1f7d7929f71.html',
 'https://www.publicradiotulsa.org/post/weather-channel-sued-125-million-over-death-storm-chase-collision']



for url in s:
  url = url.replace("-", " ")
  if url.rsplit('/', 1)[1] == '':   # For case 1 and 3rd url
       if url.rsplit('/', 2)[1].isdigit():   # For 3rd case url
            print(url.rsplit('/', 3)[1])
       else:
           print(url.rsplit('/', 2)[1])
  else:
       print(url.rsplit('/', 1)[1])   # except 1st and 3rd case urls

OUTPUT:

call to end affordable care act is immoral says cha president
new website puts louisiana art on businesses walls
global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits 456 69429
female music art to take center stage at swan day in new britain
Trump orders Treasury HUD to develop new plan 13721842.php
research delivers insight into the global business voip services market during the period 2018 2025
why mirza international limited nse 233259149.html
indian gaming industry grows in revenues.asp
facebook instagram banning pro white 210002719.html
fluence receives another aspiraltm bulk order with partner itest in china 2019 03 27
article_68a5c4d6 2f72 5a6e 8abd 4f04a44ee74f.html
article_ed20e441 de30 5b57 aafd b1f7d7929f71.html
weather channel sued 125 million over death storm chase collision

Hardcoding a special case is probably not a good idea is this is intended to cope with additional variations which are not exhibited in the samples the OP provided.
@tripleee Indeed, I won't stretch the current approach. will add an alternative instead.

Collectives™ on Stack Overflow

Extract information part f URL in python

3 Answers 3

3 Comments

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related