0

I want to capture texts from the below link and save it. http://forecast.weather.gov/product.php?site=NWS&issuedby=FWD&product=RR5&format=CI&version=44&glossary=0

I need to save only the texts after .A, so I do not need the other texts in the page. Moreover, there are 50 different links at top of the page that I want to get all of the data from all of them.

I have written the below code but it returns nothing, how can specifically get part that I need?

import urllib
import re
htmlfile=urllib.urlopen("http://forecast.weather.gov/product.php?site=NWS&issuedby=FWD&product=RR5&format=CI&version=1&glossary=0")
htmltext=htmlfile.read()
regex='<pre class="glossaryProduct">(.+?)</pre>'
pattern=re.compile(regex)
out=re.findall(pattern, htmltext)
print (out)

I also used the following that returns all the content of the page:

import urllib
file1 = urllib.urlopen('http://forecast.weather.gov/product.php?site=NWS&issuedby=FWD&product=RR5&format=txt&version=1&glossary=0')
s1 = file1.read()
print(s1)

Can you help me to do so?

1
  • Heed one of the commandments of modern programming: Do not regex x/html content Commented Feb 27, 2017 at 19:07

1 Answer 1

1

Your regex is not capturing anything because your content starts with a newline, and you did not enable your . to include newlines. If you change your compile line to

pattern=re.compile(regex,re.S)

It should work.

Also you may want to look at:

https://regex101.com

It shows you EXACTLY what your regex is doing. When i put the S flag on the right side, it started working exactly as it should:

Image of regex working with the S flag

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you. I will check it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.