Retrieve string values with list

Question

I am having some problems trying to manipulate some strings here. I am scraping some data from a website and I am facing 2 challenges:

I am scraping unnecessary data as the website I target has redundant class naming. My goal is to isolate this data and delete it so I can keep only the data I am interested in.
With the data kept, I need to split the string in order to store some information into specific variables.

So initially I was planning to use a simple split() function and store each new string into list and then play with it to keep the parts that I want. Unfortunately, every time I do this, I end up with 3 separate lists that I cannot manipulate/split.

Here is the code:

from selenium import webdriver
from bs4 import BeautifulSoup


driver = webdriver.Chrome('\\Users\\rapha\\Desktop\\10Milz\\4. Python\\Python final\\Scrape\\chromedriver.exe')
driver.get("https://www.atptour.com/en/scores/2020/7851/MS011/match-stats")

content = driver.page_source
soup = BeautifulSoup(content, "html.parser" )

for infos in soup.find_all('h3', class_='section-title'):
    title = infos.get_text()
    title = ' '.join(title.split()) 
    title_list = []
    title_list = title.split(" | ")
    print(title_list)

Here is the "raw data" retrieve

Player Results
Tournament Results
Salvatore Caruso VS. Brandon Nakashima | Indian Wells 2020

And here is what I like to achieve

Variable_1 = Salvatore Caruso
Variable_2 = Brandon Nakashima 
Variable 3 = Indian Wells 
Variable 4 = 2020

Could you please let me know how to proceed here?

Lewis Morris · Accepted Answer · 2020-08-24 14:36:26Z

1

How about this ?

Its not so pretty but will work as long as there is always a VS. and a | separating the names and that the date is always 4 digits for the year.

from selenium import webdriver
from bs4 import BeautifulSoup


driver = webdriver.Chrome('/home/lewis/Desktop/chromedriver')
driver.get("https://www.atptour.com/en/scores/2020/7851/MS011/match-stats")

content = driver.page_source
soup = BeautifulSoup(content, "html.parser" )

text = soup.find_all('h3', class_='section-title')[2].get_text().replace("\n","")
while text.find("  ")> -1:
    text = text.replace("  "," ")
text = text.strip()
#split by two parameters
split = [st.split("|") for st in text.split("VS.")]
#flatten the nested lists
flat_list = [item for sublist in split for item in sublist]
#extract the date from the end of the last item
flat_list.append(flat_list[-1][-4:])
#remove date fromt the 3rd item
flat_list[2] = flat_list[2][:-4]
#strip any leading or trailing white space
final_list = [x.strip() for x in flat_list]

print(final_list)

output

['Salvatore Caruso', 'Brandon Nakashima', 'Indian Wells', '2020']

edited Aug 24, 2020 at 14:36

answered Aug 23, 2020 at 17:11

Lewis Morris

2,2144 gold badges35 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Raphaël Ambit Over a year ago

Thank you for that, this is quite helpful & indeed answering to part of the challenge. Having said that, your start with the right string. One of the problem I have is that I cannot remove the 2 others (['Player Results'] ['Tournament Results']), I always end up dealing with the 3 at the same time & cannot use specifically the one you used in your example. How should I proceed here?

Lewis Morris Over a year ago

i've updated it :) please accept my answer if it helped you.

Raphaël Ambit Over a year ago

I just checked the update but unfortunately the initial string is still not like the one return by the scrap. You are using a unique string that contains the 3 elements while the scrap returns 3 autonomous string that I can hardly merge or separate actually. If you use my code and print the type of title variable, you'll see it returns 3 strings, but I cannot target specifically each string to separate them or merge them.

Raphaël Ambit Over a year ago

See, when using your code, I end up with the same problem as when using mine, it returns separated lists inherited from the separated string blocks. This is what happened after line "flat_list = [item for sublist in split for item in sublist]": ['Player Results', 'ults'] ['Tournament Results', 'ults'] ['Salvatore Caruso ', ' Brandon Nakashima ', ' Indian Wells 2020', '2020'] and then, targeting index [2] does not work; the only index that works is [0]

Raphaël Ambit Over a year ago

Still the same problem :/ (I also assumed it came from the spaces & the backslash for lines initially, but it does not seem to make a difference)

|

Collectives™ on Stack Overflow

Retrieve string values with list

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related