Using python to read data from csv file as input and writing output into csv file

Question

I have a csv file with the following data: Year, Title, Author. e.g:

Year,Title,Author
2018,Becoming,Michelle Obama
2018,Educated,Tara Westover
2018,Grant,Ron Chernow

I want to add two more columns, one for word count and one for page count.

I have written the following script which opens a web page, searches for the book and extracts word count and page count information.

driver = webdriver.Chrome(chromedriver)
driver.get('https://www.readinglength.com/')
driver.maximize_window()
driver.implicitly_wait(10)
time.sleep(5)
search_box = driver.find_element_by_id("downshift-0-input")
search_box.send_keys(title)
search_box.submit()
driver.implicitly_wait(10)
word_count = driver.find_element_by_xpath("//div[@class='book-data']//div[2]").text
page_count = driver.find_element_by_xpath("//div[@class='book-data']//div[4]").text
print(word_count)
print(page_count)
time.sleep(5)
driver.quit()

I would like to do the following:

Get the title from the csv file and input it into the search. Extract the word count and page count information and add it to the respective row and column. Repeat for every title/row in the csv.

Any help would be greatly appreciated!

Did you have a look at the pandas package? This is very convenient. — Jonathan Herrera
– Jonathan Herrera, Commented Dec 11, 2019 at 9:13

teoML · Accepted Answer · 2019-12-11 09:19:43Z

0

In python the best way to cope with .csv-files is to use a package called pandas. Pandas has a function to read a csv https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html From there on, you can do a lot of stuff with your csv (in pandas it is then represented as a special data type called DataFrame). See, for example https://www.geeksforgeeks.org/adding-new-column-to-existing-dataframe-in-pandas/ how to add columns.

Of course, you can read the csv-file using another package - it is called csv and a short tutorial is shown here https://realpython.com/python-csv/

I hope this is going to help you :)

answered Dec 11, 2019 at 9:19

teoML

8465 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Jonathan Herrera · Accepted Answer · 2019-12-11 09:38:46Z

0

Using the pandas package seems the most convenient way of doing this. pandas provides the DataFrame class which has nice methods to read and write csv, and also an apply method with which we can create new columns based on values of other columns. Your use case would look something like this (I did not test your code, just pasted it into the fetch_data() function):

import time
import pandas as pd
from selenium import webdriver


def fetch_data(title):
    driver = webdriver.Chrome(chromedriver)
    driver.get('https://www.readinglength.com/')
    driver.maximize_window()  
    driver.implicitly_wait(10)  
    time.sleep(5)  
    search_box = driver.find_element_by_id("downshift-0-input")
    search_box.send_keys(title)
    search_box.submit()
    driver.implicitly_wait(10)
    word_count = driver.find_element_by_xpath("//div[@class='book-data']//div[2]").text
    page_count = driver.find_element_by_xpath("//div[@class='book-data']//div[4]").text
    time.sleep(5) 
    driver.quit()

    return word_count, page_count

def process_file(input_file_path, output_file_path):
    df = pandas.read_csv(input_file_path)
    df[['word_count', 'page_count']] = df['title'].apply(fetch_data).apply(pd.Series)

    df.to_csv(output_file_path)

The main advantage of pandas - performing operations on dataframes quick - is pretty much irrelevant in your case, because the web parsing is ways more time-costly, but doing it this way with pandas is still a very convenient, concise and readable way of writing the code, I'd say.

edited Dec 11, 2019 at 9:38

answered Dec 11, 2019 at 9:32

Jonathan Herrera

6,3845 gold badges31 silver badges58 bronze badges

5 Comments

asd7 Over a year ago

Hello Jonathan, thank you for helping! I've tried running the code but I get a syntax error on the file paths. I've tried writing them as follows: (r'C:\Users\A\Desktop\Python\books_nf.csv') ('C:\Users\A\Desktop\Python\books_nf.csv') ('books_nf.csv)

Jonathan Herrera Over a year ago

@asd7 Please post your code. Else it is hard to track a SyntaxError :) Before, have a look at stackoverflow.com/questions/58774794/… :)

asd7 Over a year ago

The error: ` File "C:\Users\Adrian\AppData\Local\Temp\atom_script_tempfiles\f5e52ba0-1cfa-11ea-886e-370ff6a7417f", line 46 def process_file(r'C:\Users\Adrian\Desktop\Python\books_nf.csv', r'C:\Users\Adrian\Desktop\Python\books_nf.csv'): ^ SyntaxError: invalid syntax`

asd7 Over a year ago

The code:

def process_file(r'C:\Users\Adrian\Desktop\Python\books_nf.csv', r'C:\Users\Adrian\Desktop\Python\books_nf.csv'):     df = pandas.read_csv(r'C:\Users\Adrian\Desktop\Python\books_nf.csv')     df[['word_count', 'page_count']] = df['Book'].apply(fetch_data).apply(pd.Series)     df.to_csv(r'C:\Users\Adrian\Desktop\Python\books_nf.csv')

Jonathan Herrera Over a year ago

@asd7 You have to pass the paths as arguments when you call the function. You changed the definition of the function. So, with my above code, just run process_file(r'C:\Users\Adrian\Desktop\Python\books_nf.csv', r'C:\Users\Adrian\Desktop\Python\books_nf.csv') You also might want to have a look at tutorialspoint.com/python/python_functions.htm to understand what's going on.

julian · Accepted Answer · 2019-12-12 16:43:08Z

0

Something like this should work. Please amend as needed.

import pandas as pd

def web_search(title: str):
    driver = webdriver.Chrome(chromedriver)
    driver.get('https://www.readinglength.com/')
    driver.maximize_window()  
    driver.implicitly_wait(10)  
    time.sleep(5)  
    search_box = driver.find_element_by_id("downshift-0-input")
    search_box.send_keys(title)
    search_box.submit()
    driver.implicitly_wait(10)
    word_count = driver.find_element_by_xpath("//div[@class='book-data']//div[2]").text
    page_count = driver.find_element_by_xpath("//div[@class='book-data']//div[4]").text
    print(word_count)
    print(page_count)
    time.sleep(5) 
    driver.quit()
    return word_count, page_count

df = pd.read_csv(file)

for index, row in df.iterrows():
    print("Retrieving "+ str(row.title))
    word_count, page_count = web_search(row.title)
    df.loc[index,'word_count'] = word_count
    df.loc[index, 'page_count'] = page_count

df.to_csv('newfile.csv')

edited Dec 12, 2019 at 16:43

answered Dec 11, 2019 at 9:24

julian

4712 silver badges8 bronze badges

2 Comments

asd7 Over a year ago

Hello Julian, Thank you for your help! When I run the code it prints: Retrieving "title" word_count page_count However I haven't been able to write the word and page count into the respective row and column of the csv from which the titles are retrieved. I am not quite sure what the df.loc does, it doesn't affect the output if I comment it. Is it supposed to locate the index to which the word and page count is going to be printed?

julian Over a year ago

df.loc is used to look up the index of the row in the dataframe as we iterate over all rows. Please have a look at the pandas documentation here: pandas.pydata.org/pandas-docs/stable/reference/api/…

Collectives™ on Stack Overflow

Using python to read data from csv file as input and writing output into csv file

3 Answers 3

Comments

5 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

5 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related