0

I have scraped a website in order to find trending TV shows.

my html output looks something like this (obviously much longer but for brevitys sake):

<span class="btn-utility-container"><a class="btn-utility btn-watchlist " 
data-button="watchlist" data-id="299830" data-name="American Pickers"

I want to find and then extract the data that comes after data-name=" and then ends with the next "

so in this case, the output would be: American Pickers (no quotations)

For reference, here is my code that does not work

wikis = ["http://www.tvguide.com/trending-tonight/"]
for wiki in wikis:
    website = requests.get(wiki)
    caps = re.findall(b'data-name=">(.|\n)*?<\/">', website.content) #Relevant line
2
  • What are you doing with soup? It isn't used Commented Oct 21, 2015 at 20:32
  • Apologies, was playing around with different code and I left it in -- removed Commented Oct 21, 2015 at 20:33

1 Answer 1

1

If you just want the name of the show in data-name, then you grab it like this.

caps = re.findall(b'data-name="(.*?)"', website.content)

Or this if you are not a fan of dot-star

caps = re.findall(b'data-name="([^"]*?)"', website.content)
Sign up to request clarification or add additional context in comments.

3 Comments

absolutely works so thank you! is there a way to remove the 'b (byte) attached to each title in the output?
Great! Maybe remove the b in front of the 'data-name=...'? I left it in because that was in your initial code.
figured it out: y = [x.decode('utf-8') for x in caps]

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.