3

I am new to Python, but know R decently. I am trying to webscrape stock price data from yahoo. I successfully retrieved the price data and able to create a dataframe. However, yahoo includes when dividends are paid out. For now, I would like to ignore dividends, but I am having trouble filtering the dataframe to remove when dividends are paid out. Also, I would like to change the format of the Date column, for example, from Mar 14, 2000 to %Y-%m-%d.

From webscrape:

Date           Open            Close
Dec 23, 2019   0.611 Dividend  None
Dec 01, 2019   88.38           88.90

First, I tried do filter on the 'None', but that is an empty dataframe: df.loc[df.Close=='None']

Second I tried to replace the Dividend aspect of the Open column with a function similar to gsub in R, but may have done it incorrectly. The idea being I can remove the value in that cell and replace with a new value, toRemove, then filter on this new value:

re.sub('Dividend','Remove',df.Open,flags=re.I)

Within R, I know you can use str(df) to get the structure of a dataframe and Python uses df.dtypes, but this returned object for me, which I didn't know what to do with in order to fix the date issue.

Code used for Webscrape:

import pandas as pd
import bs4 as bs
import urllib.request

url = 'https://finance.yahoo.com/quote/VT/history?period1=1547078400&period2=1607558400&interval=1mo&filter=history&frequency=1mo'

source = urllib.request.urlopen(url).read()      
soup =bs.BeautifulSoup(source,'lxml')
tr = soup.find_all('tr')

data = []

# formats price data
for table in tr:
    td = table.find_all('td')
    row = [i.text for i in td]
    data.append(row)        

# labels columns
columns = ['Date', 'Open', 'High', 'Low', 'Close', 'AdjClose', 'Volume']

data = data[1:-2]
df = pd.DataFrame(data)
df.columns = columns
2
  • Is not better get just <strong> element? Commented Jan 10, 2020 at 17:45
  • What do you mean by that? Commented Jan 12, 2020 at 19:00

2 Answers 2

1

This answer should answer your date question. As for filtering, you should probably learn to use the df.loc[] functionality. Kaggle has an excellent resource for learning dataframe manipulation in Pandas. Granted, I do not use loc in this solution.

Anyways, using apply and lambda functions, we can quickly iterate over every row and make the changes to your Open column as follows.

df['Open'] = df.apply(lambda row: float(row['Open'].split()[0]), axis=1)

I tested this on your dataframe and it works. In this case, df.apply() with axis=1 will apply some sort of function to every row. Here, we have chosen to use a lambda function. It's worth noting you can name 'row' whatever you want here, but basically it takes in a row named row, and then you can apply any operations you wish to it.

I chose to pull the Open column value for each row with row['Open'], then split that string on spaces using .split(), and from there you can take the first string (which we know to be the number) using indexing with [0]. finallly, I wrapped that in a float() cast to make sure it was a float and not a string.

Learning to use apply() and lambda functions together is extremely valuable in pandas. Also that kaggle site would be worth checking out at least for the pandas tutorials.

Sign up to request clarification or add additional context in comments.

3 Comments

Thank you for the Kaggle suggestion, I have actually done those classes, but still have a lot to retain and review. But your answer for Open, removes the Dividends, while I need to find a way to filter and remove Dividends. My idea involves changing the cell value, then filtering on the new cell value.
ah I see. Yes they are a lot of information for sure. Could you elaborate on what you're trying to achieve with the dividends column?
Dividends is not a column, but a row. Essentially, remove all the rows that contain dividend info.
0

So I found a solution to solve for the dividend issue. Instead of appending the row, then filtering, don't include the row all together. Essentially,

for table in tr:
        td = table.find_all('td')
        row = [i.text for i in td]
        if len(row)>1: # for each row, but the last "*Close price adjusted for splits...."
            if ('Dividend' not in row[1]): # removes the dividend column
                data.append(row)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.