Filtering and Format dataframe from Webscrape

Question

I am new to Python, but know R decently. I am trying to webscrape stock price data from yahoo. I successfully retrieved the price data and able to create a dataframe. However, yahoo includes when dividends are paid out. For now, I would like to ignore dividends, but I am having trouble filtering the dataframe to remove when dividends are paid out. Also, I would like to change the format of the Date column, for example, from Mar 14, 2000 to %Y-%m-%d.

From webscrape:

Date           Open            Close
Dec 23, 2019   0.611 Dividend  None
Dec 01, 2019   88.38           88.90

First, I tried do filter on the 'None', but that is an empty dataframe: df.loc[df.Close=='None']

Second I tried to replace the Dividend aspect of the Open column with a function similar to gsub in R, but may have done it incorrectly. The idea being I can remove the value in that cell and replace with a new value, toRemove, then filter on this new value:

re.sub('Dividend','Remove',df.Open,flags=re.I)

Within R, I know you can use str(df) to get the structure of a dataframe and Python uses df.dtypes, but this returned object for me, which I didn't know what to do with in order to fix the date issue.

Code used for Webscrape:

import pandas as pd
import bs4 as bs
import urllib.request

url = 'https://finance.yahoo.com/quote/VT/history?period1=1547078400&period2=1607558400&interval=1mo&filter=history&frequency=1mo'

source = urllib.request.urlopen(url).read()      
soup =bs.BeautifulSoup(source,'lxml')
tr = soup.find_all('tr')

data = []

# formats price data
for table in tr:
    td = table.find_all('td')
    row = [i.text for i in td]
    data.append(row)        

# labels columns
columns = ['Date', 'Open', 'High', 'Low', 'Close', 'AdjClose', 'Volume']

data = data[1:-2]
df = pd.DataFrame(data)
df.columns = columns

Is not better get just <strong> element?

GiovaniSalazar
– GiovaniSalazar

2020-01-10 17:45:08 +00:00
Commented Jan 10, 2020 at 17:45 — GiovaniSalazar
– GiovaniSalazar, Commented Jan 10, 2020 at 17:45
What do you mean by that?

Jack Armstrong
– Jack Armstrong

2020-01-12 19:00:26 +00:00
Commented Jan 12, 2020 at 19:00 — Jack Armstrong
– Jack Armstrong, Commented Jan 12, 2020 at 19:00

Community · Accepted Answer · 2020-06-20 09:12:55Z

1

This answer should answer your date question. As for filtering, you should probably learn to use the df.loc[] functionality. Kaggle has an excellent resource for learning dataframe manipulation in Pandas. Granted, I do not use loc in this solution.

Anyways, using apply and lambda functions, we can quickly iterate over every row and make the changes to your Open column as follows.

df['Open'] = df.apply(lambda row: float(row['Open'].split()[0]), axis=1)

I tested this on your dataframe and it works. In this case, df.apply() with axis=1 will apply some sort of function to every row. Here, we have chosen to use a lambda function. It's worth noting you can name 'row' whatever you want here, but basically it takes in a row named row, and then you can apply any operations you wish to it.

I chose to pull the Open column value for each row with row['Open'], then split that string on spaces using .split(), and from there you can take the first string (which we know to be the number) using indexing with [0]. finallly, I wrapped that in a float() cast to make sure it was a float and not a string.

Learning to use apply() and lambda functions together is extremely valuable in pandas. Also that kaggle site would be worth checking out at least for the pandas tutorials.

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Jan 10, 2020 at 17:56

Saucy Dumpling

1481 silver badge12 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Jack Armstrong Over a year ago

Thank you for the Kaggle suggestion, I have actually done those classes, but still have a lot to retain and review. But your answer for Open, removes the Dividends, while I need to find a way to filter and remove Dividends. My idea involves changing the cell value, then filtering on the new cell value.

Saucy Dumpling Over a year ago

ah I see. Yes they are a lot of information for sure. Could you elaborate on what you're trying to achieve with the dividends column?

Jack Armstrong Over a year ago

Dividends is not a column, but a row. Essentially, remove all the rows that contain dividend info.

Jack Armstrong · Accepted Answer · 2020-01-14 01:29:59Z

0

So I found a solution to solve for the dividend issue. Instead of appending the row, then filtering, don't include the row all together. Essentially,

for table in tr:
        td = table.find_all('td')
        row = [i.text for i in td]
        if len(row)>1: # for each row, but the last "*Close price adjusted for splits...."
            if ('Dividend' not in row[1]): # removes the dividend column
                data.append(row)

answered Jan 14, 2020 at 1:29

Jack Armstrong

1,2595 gold badges31 silver badges70 bronze badges

Collectives™ on Stack Overflow

Filtering and Format dataframe from Webscrape

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related