2

I'm learning web scraping and found a fun challenge scraping a Javascript handlebars table from this page: Samsung Knox Devices

I eventually got the output I wanted, but I think it feels "hacky", so I'd appreciate any refinements to make it more elegant.

Desired output is a dataframe/csv table with columns = Device, Model_Nums, OS/Platform, Knox Version. Don't need anything else on the page, and I will split/expand and melt the Model Nums separately.

import pandas as pd

# Libraries for this task: 
from bs4 import BeautifulSoup
from selenium import webdriver

# Because the target table is built using Javascript handlebars, we have to use Selenium and a webdriver
driver = webdriver.Edge("MY_PATH") # REPLACE WITH >YOUR< PATH!

# Point the driver at the target webpage:
driver.get('https://www.samsungknox.com/en/knox-platform/supported-devices')

# Get the page content
html = driver.page_source
# Typically I'd do something like: soup = BeautifulSoup(html, "lxml")
# Link below suggested the following, which works; I don't know if it matters
sp = BeautifulSoup(html, "html.parser")

# The 'table here is really a bunch of nested divs 
tables = soup.find_all("div", class_='table-row')
# https://www.angularfix.com/2021/09/how-to-extract-text-from-inside-div-tag.html
rows = []
for t in tables:
    row = t.text
    rows.append(row)

# These are the table-row div classes within each table-row from the output at the previous step that I want:    
    # div class="supported-devices pivot-fixed"
    # div class="model"
    # div class="operating system"
    # div class="knox-version"

# Define div class names:
targets = ["supported-devices pivot-fixed", "model", "operating-system", "knox-version"]

# Create an empty list and loop through each target div class; append to list
data = []
for t in targets:
    hold = sp.find_all("div", class_=t)
    for h in hold:
        row = h.text
        data.append({'column': t, 'value': row}) 

df = pd.DataFrame(data)

# This feels like a hack, but I got stuck and it works, so \shrug/
# Create Series from filtered df based on 'column' value (corresponding to the the four "targets" above)
name = pd.Series(df['value'][df['column']=='supported-devices pivot-fixed']).reset_index(drop=True)
model = pd.Series(df['value'][df['column']=='model']).reset_index(drop=True)
os = pd.Series(df['value'][df['column']=='operating-system']).reset_index(drop=True)
knox = pd.Series(df['value'][df['column']=='knox-version']).reset_index(drop=True)
# Concatenate Series into df
df2 = pd.concat([df_name, df_model, df_os, df_knox], axis=1)

# Make the first row the column names:
new_header = df2.iloc[0] #grab the first row for the header
sam_knox_table = df2[1:] #take the data less the header row
sam_knox_table.columns = new_header #set the header row as the df header

# Bob's your uncle
sam_knox_table.to_csv('sam_knox.csv', index=False)

1 Answer 1

2

To scrape the texts from the DEVICE and MODEL CODE column you need to create a list of the desired texts using list comprehension inducing WebDriverWait for the visibility_of_all_elements_located() and then write it into a DataFrame using Pandas and you can use the following locator strategies:

  • Code Block:

    driver.get("https://www.samsungknox.com/en/knox-platform/supported-devices")
    devices = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.table-row:not(.table-header) > div.supported-devices")))]
    models = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.table-row:not(.table-header) > div.model")))]
    df = pd.DataFrame(data=list(zip(devices, models)), columns=['DEVICE', 'MODEL CODE'])
    print(df)
    driver.quit()
    
  • Console Output:

                   DEVICE                                      MODEL CODE
    0       Galaxy A42 5G          SM-A426N, SM-A426U, SM-A4260, SM-A426B
    1          Galaxy A52                              SM-A525F, SM-A525M
    2       Galaxy A52 5G                                        SM-A5260
    3       Galaxy A52 5G            SM-A526U, SC-53B, SM-A526W, SM-A526B
    4      Galaxy A52s 5G                              SM-A528B, SM-A528N
    ..                ...                                             ...
    371        Gear Sport                                         SM-R600
    372   Gear S3 Classic                                        SM-R775V
    373  Gear S3 Frontier                                        SM-R765V
    374           Gear S2           SM-R720, SM-R730A, SM-R730S, SM-R730V
    375   Gear S2 Classic  SM-R732, SM-R735, SM-R735A, SM-R735V, SM-R735S
    
    [376 rows x 2 columns]
    
Sign up to request clarification or add additional context in comments.

3 Comments

Gorgeous! I do still want the Platform/OS and Knox Version columns, but it looks like I can copy and modify the "devices" and "models" lines above, yes? Assuming so, this will do the job and will definitely help me improve. Much appreciated!
@Steph Given the outline with the first 2 columns, you can accomodate any/any number of columns.
I was gonna...give me a minute! LOL

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.