0

I have a csv file with the following data: Year, Title, Author. e.g:

Year,Title,Author
2018,Becoming,Michelle Obama
2018,Educated,Tara Westover
2018,Grant,Ron Chernow

I want to add two more columns, one for word count and one for page count.

I have written the following script which opens a web page, searches for the book and extracts word count and page count information.

driver = webdriver.Chrome(chromedriver)
driver.get('https://www.readinglength.com/')
driver.maximize_window()
driver.implicitly_wait(10)
time.sleep(5)
search_box = driver.find_element_by_id("downshift-0-input")
search_box.send_keys(title)
search_box.submit()
driver.implicitly_wait(10)
word_count = driver.find_element_by_xpath("//div[@class='book-data']//div[2]").text
page_count = driver.find_element_by_xpath("//div[@class='book-data']//div[4]").text
print(word_count)
print(page_count)
time.sleep(5)
driver.quit()

I would like to do the following:

Get the title from the csv file and input it into the search. Extract the word count and page count information and add it to the respective row and column. Repeat for every title/row in the csv.

Any help would be greatly appreciated!

1
  • 1
    Did you have a look at the pandas package? This is very convenient. Commented Dec 11, 2019 at 9:13

3 Answers 3

0

In python the best way to cope with .csv-files is to use a package called pandas. Pandas has a function to read a csv https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html From there on, you can do a lot of stuff with your csv (in pandas it is then represented as a special data type called DataFrame). See, for example https://www.geeksforgeeks.org/adding-new-column-to-existing-dataframe-in-pandas/ how to add columns.

Of course, you can read the csv-file using another package - it is called csv and a short tutorial is shown here https://realpython.com/python-csv/

I hope this is going to help you :)

Sign up to request clarification or add additional context in comments.

Comments

0

Using the pandas package seems the most convenient way of doing this. pandas provides the DataFrame class which has nice methods to read and write csv, and also an apply method with which we can create new columns based on values of other columns. Your use case would look something like this (I did not test your code, just pasted it into the fetch_data() function):

import time
import pandas as pd
from selenium import webdriver


def fetch_data(title):
    driver = webdriver.Chrome(chromedriver)
    driver.get('https://www.readinglength.com/')
    driver.maximize_window()  
    driver.implicitly_wait(10)  
    time.sleep(5)  
    search_box = driver.find_element_by_id("downshift-0-input")
    search_box.send_keys(title)
    search_box.submit()
    driver.implicitly_wait(10)
    word_count = driver.find_element_by_xpath("//div[@class='book-data']//div[2]").text
    page_count = driver.find_element_by_xpath("//div[@class='book-data']//div[4]").text
    time.sleep(5) 
    driver.quit()

    return word_count, page_count

def process_file(input_file_path, output_file_path):
    df = pandas.read_csv(input_file_path)
    df[['word_count', 'page_count']] = df['title'].apply(fetch_data).apply(pd.Series)

    df.to_csv(output_file_path)

The main advantage of pandas - performing operations on dataframes quick - is pretty much irrelevant in your case, because the web parsing is ways more time-costly, but doing it this way with pandas is still a very convenient, concise and readable way of writing the code, I'd say.

5 Comments

Hello Jonathan, thank you for helping! I've tried running the code but I get a syntax error on the file paths. I've tried writing them as follows: (r'C:\Users\A\Desktop\Python\books_nf.csv') ('C:\Users\A\Desktop\Python\books_nf.csv') ('books_nf.csv)
@asd7 Please post your code. Else it is hard to track a SyntaxError :) Before, have a look at stackoverflow.com/questions/58774794/… :)
The error: ` File "C:\Users\Adrian\AppData\Local\Temp\atom_script_tempfiles\f5e52ba0-1cfa-11ea-886e-370ff6a7417f", line 46 def process_file(r'C:\Users\Adrian\Desktop\Python\books_nf.csv', r'C:\Users\Adrian\Desktop\Python\books_nf.csv'): ^ SyntaxError: invalid syntax`
The code: def process_file(r'C:\Users\Adrian\Desktop\Python\books_nf.csv', r'C:\Users\Adrian\Desktop\Python\books_nf.csv'): df = pandas.read_csv(r'C:\Users\Adrian\Desktop\Python\books_nf.csv') df[['word_count', 'page_count']] = df['Book'].apply(fetch_data).apply(pd.Series) df.to_csv(r'C:\Users\Adrian\Desktop\Python\books_nf.csv')
@asd7 You have to pass the paths as arguments when you call the function. You changed the definition of the function. So, with my above code, just run process_file(r'C:\Users\Adrian\Desktop\Python\books_nf.csv', r'C:\Users\Adrian\Desktop\Python\books_nf.csv') You also might want to have a look at tutorialspoint.com/python/python_functions.htm to understand what's going on.
0

Something like this should work. Please amend as needed.

import pandas as pd

def web_search(title: str):
    driver = webdriver.Chrome(chromedriver)
    driver.get('https://www.readinglength.com/')
    driver.maximize_window()  
    driver.implicitly_wait(10)  
    time.sleep(5)  
    search_box = driver.find_element_by_id("downshift-0-input")
    search_box.send_keys(title)
    search_box.submit()
    driver.implicitly_wait(10)
    word_count = driver.find_element_by_xpath("//div[@class='book-data']//div[2]").text
    page_count = driver.find_element_by_xpath("//div[@class='book-data']//div[4]").text
    print(word_count)
    print(page_count)
    time.sleep(5) 
    driver.quit()
    return word_count, page_count

df = pd.read_csv(file)

for index, row in df.iterrows():
    print("Retrieving "+ str(row.title))
    word_count, page_count = web_search(row.title)
    df.loc[index,'word_count'] = word_count
    df.loc[index, 'page_count'] = page_count

df.to_csv('newfile.csv')

2 Comments

Hello Julian, Thank you for your help! When I run the code it prints: Retrieving "title" word_count page_count However I haven't been able to write the word and page count into the respective row and column of the csv from which the titles are retrieved. I am not quite sure what the df.loc does, it doesn't affect the output if I comment it. Is it supposed to locate the index to which the word and page count is going to be printed?
df.loc is used to look up the index of the row in the dataframe as we iterate over all rows. Please have a look at the pandas documentation here: pandas.pydata.org/pandas-docs/stable/reference/api/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.