0

I am trying to extract a URL from a text file which contains a source code of a website. I want to get the website link inside href and I wrote some code I borrowed from stackoverflow but I can't get it to work.

with open(sourcecode.txt) as f:
    urls = f.readlines()

urls = ([s.strip('\n') for s in urls ]) 

print(url)
2
  • It also give an error source is not defined Commented Jun 21, 2018 at 21:00
  • 1
    You should probably have a look at HTML parsing libraries like BeautifulSoup. Commented Jun 21, 2018 at 21:01

2 Answers 2

3

Using a regexp, you can extract all urls from the text file, without the need to loop line by line:

import re
with open('/home/username/Downloads/Stack_Overflow.html') as f:
    urls = f.read()
    links = re.findall('"((http)s?://.*?)"', urls)
for url in links:
    print(url[0])
Sign up to request clarification or add additional context in comments.

Comments

0

You can use regular expressions for this.

import re

with open('sourcecode.txt') as f:
    text = f.read()

href_regex = r'href=[\'"]?([^\'" >]+)'
urls = re.findall(href_regex, text)

print(urls)

You're probably getting an error like 'sourcecode' is not defined; this is because the parameter that you pass to open() needs to be a string (see above)

2 Comments

NameError: name 're' is not defined
re is the regular expression module, it's part of the standard library. import re

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.