Find all text in HTML web scrape after a specific string (python3)

Question

I have scraped a website in order to find trending TV shows.

my html output looks something like this (obviously much longer but for brevitys sake):

<span class="btn-utility-container"><a class="btn-utility btn-watchlist " 
data-button="watchlist" data-id="299830" data-name="American Pickers"

I want to find and then extract the data that comes after data-name=" and then ends with the next "

so in this case, the output would be: American Pickers (no quotations)

For reference, here is my code that does not work

wikis = ["http://www.tvguide.com/trending-tonight/"]
for wiki in wikis:
    website = requests.get(wiki)
    caps = re.findall(b'data-name=">(.|\n)*?<\/">', website.content) #Relevant line

Apologies, was playing around with different code and I left it in -- removed — user3682157
– user3682157, Commented Oct 21, 2015 at 20:33

OneCricketeer · Accepted Answer · 2015-10-21 20:46:37Z

1

If you just want the name of the show in data-name, then you grab it like this.

caps = re.findall(b'data-name="(.*?)"', website.content)

Or this if you are not a fan of dot-star

caps = re.findall(b'data-name="([^"]*?)"', website.content)

edited Oct 21, 2015 at 20:46

answered Oct 21, 2015 at 20:37

OneCricketeer

193k20 gold badges146 silver badges276 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user3682157 Over a year ago

absolutely works so thank you! is there a way to remove the 'b (byte) attached to each title in the output?

OneCricketeer Over a year ago

Great! Maybe remove the b in front of the 'data-name=...'? I left it in because that was in your initial code.

user3682157 Over a year ago

figured it out: y = [x.decode('utf-8') for x in caps]

Collectives™ on Stack Overflow

Find all text in HTML web scrape after a specific string (python3)

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related