5

I am currently trying to scrape some data from a webpage. The data I need is within the <meta> tag of the html source. Scraping the data and saving it to a String with BeautifulSoup is no problem.

The String contains 2 numbers I want to extract. Each of those numbers (review scores from 1-100) should be assigned to a distinct variable for further processing.

test_str = "<meta content=\"Overall Rating: 79/100 ... Some Info ... Score: 86/100 \"/>"

The first value is 79/100 and the second is 86/100, but I only need 79 and 86. So far I have created a regex search to find those values and then .replace("/100") to clean things up.

But with my code, I only get the value for the first regex search match, which is 79. I tried getting the second value with m.group(1) but it doesn't work.

What am I missing ?

test_str = "<meta content=\"Overall Rating: 79/100 ... Some Info ... Score: 86/100 \"/>"

m = re.search("../100", test_str)
if m:
    found = m.group(0).replace("/100","")
    print found

    # output -> 79

Thanks for your help.

Best regards!

3
  • re.findall return an array of matches Commented May 21, 2017 at 10:33
  • Are you scraping the web page and then take the entire HTML source and apply regex to it? I'm asking because your code sample shows no beautifulsoup-related code. Commented May 21, 2017 at 10:35
  • 1
    Thanks! @Tomalak No I just save the data in a String using meta_description = soup.find("meta", {"name": "rating-data"}). I just didn't include the part of BeautifulSoup to keep things simple. Commented May 21, 2017 at 10:49

2 Answers 2

3
test_str = "<meta content=\"Overall Rating: 79/100 ... Some Info ... Score: 86/100 \"/>"    
m =  re.findall('(\d+(?=\/100))', test_str)
# m = ['79', '86']

I changed .. with /d+ so you can search for either 1 digit or 2

I also use a positive lookahead (?=...), so the .replace becomes unnecessary

Example at Regex101

Sign up to request clarification or add additional context in comments.

1 Comment

Np glad I could help :)
3

I dont know why most people are not suggesting back references to a named group.

You can do something like below, syntax might not be perfect.

test_str = "<meta content=\"Overall Rating: 79/100 ... Some Info ... Score: 86/100 \"/>"

pattern = "^<meta content=\"Overall Rating: (?P<rating>.*?) ... Some Info ... (?P<score>.*?)$"

match = re.match(pattern, test_str)

match.group('rating')
match.group('score')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.