I am new to Python, but know R decently. I am trying to webscrape stock price data from yahoo. I successfully retrieved the price data and able to create a dataframe. However, yahoo includes when dividends are paid out. For now, I would like to ignore dividends, but I am having trouble filtering the dataframe to remove when dividends are paid out. Also, I would like to change the format of the Date column, for example, from Mar 14, 2000 to %Y-%m-%d.
From webscrape:
Date Open Close
Dec 23, 2019 0.611 Dividend None
Dec 01, 2019 88.38 88.90
First, I tried do filter on the 'None', but that is an empty dataframe: df.loc[df.Close=='None']
Second I tried to replace the Dividend aspect of the Open column with a function similar to gsub in R, but may have done it incorrectly. The idea being I can remove the value in that cell and replace with a new value, toRemove, then filter on this new value:
re.sub('Dividend','Remove',df.Open,flags=re.I)
Within R, I know you can use str(df) to get the structure of a dataframe and Python uses df.dtypes, but this returned object for me, which I didn't know what to do with in order to fix the date issue.
Code used for Webscrape:
import pandas as pd
import bs4 as bs
import urllib.request
url = 'https://finance.yahoo.com/quote/VT/history?period1=1547078400&period2=1607558400&interval=1mo&filter=history&frequency=1mo'
source = urllib.request.urlopen(url).read()
soup =bs.BeautifulSoup(source,'lxml')
tr = soup.find_all('tr')
data = []
# formats price data
for table in tr:
td = table.find_all('td')
row = [i.text for i in td]
data.append(row)
# labels columns
columns = ['Date', 'Open', 'High', 'Low', 'Close', 'AdjClose', 'Volume']
data = data[1:-2]
df = pd.DataFrame(data)
df.columns = columns