Extract URL from text file - Python

Question

I am trying to extract a URL from a text file which contains a source code of a website. I want to get the website link inside href and I wrote some code I borrowed from stackoverflow but I can't get it to work.

with open(sourcecode.txt) as f:
    urls = f.readlines()

urls = ([s.strip('\n') for s in urls ]) 

print(url)

You should probably have a look at HTML parsing libraries like BeautifulSoup. — gmolau
– gmolau, Commented Jun 21, 2018 at 21:01

Ereli · Accepted Answer · 2018-06-21 21:05:54Z

3

Using a regexp, you can extract all urls from the text file, without the need to loop line by line:

import re
with open('/home/username/Downloads/Stack_Overflow.html') as f:
    urls = f.read()
    links = re.findall('"((http)s?://.*?)"', urls)
for url in links:
    print(url[0])

answered Jun 21, 2018 at 21:05

Ereli

1,0161 gold badge22 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

wpercy · Accepted Answer · 2018-06-21 21:57:03Z

0

You can use regular expressions for this.

import re

with open('sourcecode.txt') as f:
    text = f.read()

href_regex = r'href=[\'"]?([^\'" >]+)'
urls = re.findall(href_regex, text)

print(urls)

You're probably getting an error like 'sourcecode' is not defined; this is because the parameter that you pass to open() needs to be a string (see above)

edited Jun 21, 2018 at 21:57

answered Jun 21, 2018 at 21:00

wpercy

10.2k4 gold badges35 silver badges50 bronze badges

2 Comments

tom786 Over a year ago

NameError: name 're' is not defined

wpercy Over a year ago

re is the regular expression module, it's part of the standard library. import re

Collectives™ on Stack Overflow

Extract URL from text file - Python

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related