2

I am having some problems trying to manipulate some strings here. I am scraping some data from a website and I am facing 2 challenges:

  1. I am scraping unnecessary data as the website I target has redundant class naming. My goal is to isolate this data and delete it so I can keep only the data I am interested in.

  2. With the data kept, I need to split the string in order to store some information into specific variables.

So initially I was planning to use a simple split() function and store each new string into list and then play with it to keep the parts that I want. Unfortunately, every time I do this, I end up with 3 separate lists that I cannot manipulate/split.

Here is the code:

from selenium import webdriver
from bs4 import BeautifulSoup


driver = webdriver.Chrome('\\Users\\rapha\\Desktop\\10Milz\\4. Python\\Python final\\Scrape\\chromedriver.exe')
driver.get("https://www.atptour.com/en/scores/2020/7851/MS011/match-stats")

content = driver.page_source
soup = BeautifulSoup(content, "html.parser" )

for infos in soup.find_all('h3', class_='section-title'):
    title = infos.get_text()
    title = ' '.join(title.split()) 
    title_list = []
    title_list = title.split(" | ")
    print(title_list)

Here is the "raw data" retrieve

Player Results
Tournament Results
Salvatore Caruso VS. Brandon Nakashima | Indian Wells 2020

And here is what I like to achieve

Variable_1 = Salvatore Caruso
Variable_2 = Brandon Nakashima 
Variable 3 = Indian Wells 
Variable 4 = 2020

Could you please let me know how to proceed here?

1 Answer 1

1

How about this ?

Its not so pretty but will work as long as there is always a VS. and a | separating the names and that the date is always 4 digits for the year.

from selenium import webdriver
from bs4 import BeautifulSoup


driver = webdriver.Chrome('/home/lewis/Desktop/chromedriver')
driver.get("https://www.atptour.com/en/scores/2020/7851/MS011/match-stats")

content = driver.page_source
soup = BeautifulSoup(content, "html.parser" )

text = soup.find_all('h3', class_='section-title')[2].get_text().replace("\n","")
while text.find("  ")> -1:
    text = text.replace("  "," ")
text = text.strip()
#split by two parameters
split = [st.split("|") for st in text.split("VS.")]
#flatten the nested lists
flat_list = [item for sublist in split for item in sublist]
#extract the date from the end of the last item
flat_list.append(flat_list[-1][-4:])
#remove date fromt the 3rd item
flat_list[2] = flat_list[2][:-4]
#strip any leading or trailing white space
final_list = [x.strip() for x in flat_list]

print(final_list)

output

['Salvatore Caruso', 'Brandon Nakashima', 'Indian Wells', '2020']
Sign up to request clarification or add additional context in comments.

6 Comments

Thank you for that, this is quite helpful & indeed answering to part of the challenge. Having said that, your start with the right string. One of the problem I have is that I cannot remove the 2 others (['Player Results'] ['Tournament Results']), I always end up dealing with the 3 at the same time & cannot use specifically the one you used in your example. How should I proceed here?
i've updated it :) please accept my answer if it helped you.
I just checked the update but unfortunately the initial string is still not like the one return by the scrap. You are using a unique string that contains the 3 elements while the scrap returns 3 autonomous string that I can hardly merge or separate actually. If you use my code and print the type of title variable, you'll see it returns 3 strings, but I cannot target specifically each string to separate them or merge them.
See, when using your code, I end up with the same problem as when using mine, it returns separated lists inherited from the separated string blocks. This is what happened after line "flat_list = [item for sublist in split for item in sublist]": ['Player Results', 'ults'] ['Tournament Results', 'ults'] ['Salvatore Caruso ', ' Brandon Nakashima ', ' Indian Wells 2020', '2020'] and then, targeting index [2] does not work; the only index that works is [0]
Still the same problem :/ (I also assumed it came from the spaces & the backslash for lines initially, but it does not seem to make a difference)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.