Python (venv) 3.12.2 from VSCode (1.96.2) within Anaconda (2.6.4) on an intel iMac, Sequoia 15.1.1.
I'm working on a downloaded model (not my original work) for predicting MLB scores. One of the 17 or 18 different programs in the process retrieves pitcher data by game for multiple seasons. There are 3200+ individual csv files, each with up to several hundred rows x 32 columns.
Each pitchers' games pitched are to be matched to the specific game in which they appeared. This list of games is in a second (single) csv with ALL games for ALL teams listed (100k+ rows x 200 columns).
The match is done by date. This is where my problem begins.
The date format for the "games" file is, for example, 19800421.
The format for the "pitchers" dates are, for example, 4-11-1998 (strings, I believe). However there are also some formatted as 4- 5-1998 (note the space after day-) for games that aren't two digit months. (Not sure why, but there is no way to address the problem at the source given the sheer number of files and rows).
There is a line in the original code that is intended to clean up the first case (dates formatted as 4-11-1998). But there is no code to clean up the second case (4- 5-1998).
When I run the original code, I get the following error message:
ValueError: invalid literal for int() with base 10: '1977-07-26'
This date is the date in the first row of the first pitchers' stats.
I've tried several possible solutions that I found online and those are included in the attached code, commented out.
Here is the minimum amount of code that I believe will demonstrate the problem. This works (or fails, as it were) because I have the entirety of both data sets (pitcher and game data) available, however it's impossible to include all that here. So if more is needed, I'll need to figure out a way to get it to you.
import numpy as np
import pandas as pd
from datetime import datetime
filepath=('/Users/richardcartier/Documents/PythonProjects/pybaseball/BBall_Prediction/data/pitcher_data/')
def load_and_process_pitch_df(p_id, filepath):
fname = filepath + 'pitching_data_'+p_id+'.csv'
pitch_df = pd.read_csv(fname)
#ADDED Convert the 'date' column to datetime format
pitch_df['Date'] = pd.to_datetime(pitch_df['Date'])
#ADDED Change the date format to 'YYYYMMDD'
pitch_df['Date'] = pitch_df['Date'].dt.strftime('%Y%m%d')
# Convert date, fix dblhead_num to be 0,1,2
# This line below (Original code) was giving me the following error: ValueError: invalid literal for int() with base 10: '1977-07-26'. This is the first date in the first pitchers' dataframe. There are other dates in the dataframe that are formatted as '8- 6-1977' with a space where a '1' would be for months 10, 11, 12.
# Original code
#pitch_df['Date'] = (pd.to_datetime(pitch_df.Date).astype(str).str.replace('-','')).astype(int)
# My first attempt at replacing either "-" or "- "
#pitch_df['Date'] = (pd.to_datetime(pitch_df.Date).astype(str).str.multireplace([('- ','-'), ('','')])).astype(int)
# A second attempt
#for r in (("- ", "-"), ("","")):
# Date = Date.replace(*r).astype(int)
# A third attempt
pitch_df['Date'] = datetime('Date')
pitch_df['Date'].strftime('%Y%m%d').astype(int)
pitch_df.dblhead_num.fillna(0, inplace=True)
pitch_df['dblhead_num'] = pitch_df['dblhead_num'].astype(int)
df=pd.read_csv('/Users/richardcartier/Documents/PythonProjects/pybaseball/BBall_Prediction/df_bp3.csv')
start_pitchers_h = df.pitcher_start_id_h.unique()
start_pitchers_v = df.pitcher_start_id_v.unique()
start_pitchers_all = np.union1d(start_pitchers_h, start_pitchers_v)
pitcher_data_dict = {}
for i, p_id in enumerate(start_pitchers_all):
if i%100==0:
print(i)
pitcher_data_dict[p_id] = load_and_process_pitch_df(p_id, filepath)
This is a sample of the pitching data I'm using:
at_vs,Opponent,League,GS,CG,SHO,GF,SV,IP,H,BFP,HR,R,ER,BB,IB,SO,SH,SF,WP,HBP,BK,2B,3B,GDP,ROE,W,L,ERA,Date,dblhead_num
VS,MIL,A,1,1,0,0,0,9,9,38,0,3,2,2,0,11,0,1,0,0,0,2,0,0,0,1,0,2.00,7-26-1977,
AT,CAL,A,1,1,1,0,0,9,3,31,0,0,0,1,0,7,0,0,0,0,0,1,0,0,0,1,0,1.00,7-31-1977,
AT,OAK,A,1,0,0,0,0,7,5,28,0,1,1,2,0,1,1,0,0,1,0,1,0,0,0,1,0,1.08,8- 6-1977,
VS,CAL,A,1,0,0,0,0,5,8,22,2,6,5,1,0,1,0,1,0,0,0,1,0,2,0,0,1,2.40,8-11-1977,
VS,KC,A,1,0,0,0,0,6.1,7,27,0,3,3,0,0,5,0,0,0,0,0,2,0,0,1,0,0,2.72,8-16-1977,
AT,KC,A,1,0,0,0,0,6.1,8,28,0,3,2,3,1,2,0,1,0,0,0,3,1,0,1,0,0,2.74,8-21-1977,
VS,MIN,A,1,0,0,0,0,7,6,28,1,4,4,2,0,4,0,0,0,0,0,2,0,0,0,0,0,3.08,8-26-1977,
VS,CLE,A,1,0,0,0,0,6,7,24,0,3,3,0,0,5,0,0,0,0,0,1,0,0,0,0,0,3.23,8-31-1977,
AT,TOR,A,1,1,1,0,0,9,3,33,0,0,0,2,0,4,0,0,0,0,0,0,1,0,1,1,0,2.78,9- 5-1977,1
VS,DET,A,1,1,0,0,0,9,5,33,0,1,1,2,0,3,0,0,0,0,0,0,0,1,0,1,0,2.57,9-10-1977,
AT,BAL,A,1,0,0,0,0,5,7,21,1,4,4,2,0,1,0,0,0,0,0,2,1,2,0,0,1,2.86,9-16-1977,
AT,DET,A,1,0,0,0,0,6.1,7,28,1,3,3,1,0,2,1,0,0,0,0,1,1,0,1,0,0,2.96,9-22-1977,
VS,TOR,A,1,0,0,0,0,7.1,10,32,1,5,4,1,0,3,0,0,0,0,0,1,0,2,1,1,0,3.12,9-27-1977,1
VS,MIN,A,1,0,0,0,0,4.1,6,20,2,5,5,2,0,2,0,0,1,0,0,1,0,0,0,0,1,10.38,4-11-1978,
AT,OAK,A,1,0,0,0,0,7,4,26,1,1,1,3,1,2,0,0,0,0,0,0,0,0,2,0,0,4.76,4-16-1978,
AT,MIN,A,1,0,0,0,0,6.1,6,29,0,2,2,3,1,3,1,0,0,0,0,1,0,0,1,1,0,4.08,4-21-1978,
VS,TOR,A,1,0,0,0,0,3.1,7,18,0,3,3,1,0,4,0,0,0,0,0,0,0,0,0,0,0,4.71,4-28-1978,
VS,CLE,A,1,0,0,0,0,6,8,27,1,3,3,2,0,3,0,0,0,0,0,2,0,1,0,1,0,4.67,5- 7-1978,
AT,TOR,A,1,0,0,0,0,7.1,8,34,0,5,5,5,1,3,1,0,1,0,0,1,2,1,0,0,1,4.98,5-16-1978,
AT,MIL,A,1,0,0,0,0,7,7,30,0,1,1,3,0,3,0,0,0,0,0,2,0,1,0,0,0,4.35,5-21-1978,
VS,MIL,A,1,0,0,0,0,4.2,6,23,1,4,4,2,0,5,0,0,0,1,0,1,0,0,0,0,0,4.70,5-26-1978,
AT,CHI,A,1,0,0,0,0,4.1,8,20,0,4,2,2,0,0,0,0,0,0,0,1,0,1,1,0,1,4.65,5-31-1978,
And here is a sample of the game data to which I need to match the pitcher data (file df_bp3.csv in the above code):
date,dblheader_code,day_of_week,team_v,league_v,game_no_v,team_h,league_h,game_no_h,runs_v,runs_h,outs_total,day_night,completion_info,forfeit_info,protest_info,ballpark_id,attendance,game_minutes,linescore_v,linescore_h,AB_v,H_v,2B_v,3B_v,HR_v,RBI_v,SH_v,SF_v,HBP_v,BB_v,IBB_v,SO_v,SB_v,CS_v,GIDP_v,CI_v,LOB_v,P_num_v,ERind_v,ERteam_v,WP_v,balk_v,PO_v,ASST_v,ERR_v,PB_v,DP_v,TP_v,AB_h,H_h,2B_h,3B_h,HR_h,RBI_h,SH_h,SF_h,HBP_h,BB_h,IBB_h,SO_h,SB_h,CS_h,GIDP_h,CI_h,LOB_h,P_num_h,ERind_h,ERteam_h,WP_h,balk_h,PO_h,ASST_h,ERR_h,PB_h,DP_h,TP_h,ump_HB_id,ump_HB_name,ump_1B_id,ump_1B_name,ump_2B_id,ump_2B_name,ump_3B_id,ump_3B_name,ump_LF_id,ump_LF_name,ump_RF_id,ump_RF_name,mgr_id_v,mgr_name_v,mgr_id_h,mgr_name_h,pitcher_id_w,pitcher_name_w,pitcher_id_l,pitcher_name_l,pitcher_id_s,pitcher_name_s,GWRBI_id,GWRBI_name,pitcher_start_id_v,pitcher_start_name_v,pitcher_start_id_h,pitcher_start_name_h,batter1_name_v,batter1_id_v,batter1_pos_v,batter2_name_v,batter2_id_v,batter2_pos_v,batter3_name_v,batter3_id_v,batter3_pos_v,batter4_name_v,batter4_id_v,batter4_pos_v,batter5_name_v,batter5_id_v,batter5_pos_v,batter6_name_v,batter6_id_v,batter6_pos_v,batter7_name_v,batter7_id_v,batter7_pos_v,batter8_name_v,batter8_id_v,batter8_pos_v,batter9_name_v,batter9_id_v,batter9_pos_v,batter1_name_h,batter1_id_h,batter1_pos_h,batter2_name_h,batter2_id_h,batter2_pos_h,batter3_name_h,batter3_id_h,batter3_pos_h,batter4_name_h,batter4_id_h,batter4_pos_h,batter5_name_h,batter5_id_h,batter5_pos_h,batter6_name_h,batter6_id_h,batter6_pos_h,batter7_name_h,batter7_id_h,batter7_pos_h,batter8_name_h,batter8_id_h,batter8_pos_h,batter9_name_h,batter9_id_h,batter9_pos_h,misc_info,acqui_info,season,run_diff,home_victory,run_total,date_dblhead,BATAVG_162_h,BATAVG_162_v,OBP_162_h,OBP_162_v,SLG_162_h,SLG_162_v,OBS_162_h,OBS_162_v,SB_162_h,SB_162_v,CS_162_h,CS_162_v,ERR_162_h,ERR_162_v,BATAVG_30_h,BATAVG_30_v,OBP_30_h,OBP_30_v,SLG_30_h,SLG_30_v,OBS_30_h,OBS_30_v,SB_30_h,SB_30_v,CS_30_h,CS_30_v,ERR_30_h,ERR_30_v,implied_prob_h,implied_prob_v,implied_prob_h_mid,over_under_line,over_under_result
19800409,0,Wed,TOR,AL,1,SEA,AL,1,6,8,51,N,,,,SEA02,22588.0,154,200001030,41010020x,36,9,1,1,2,5,0,0,0,5,0,4,2,0,1,0,8,5,6,6,0,0,24,14,1,0,1,0,34,12,2,1,1,8,2,0,0,2,1,4,0,0,1,0,6,3,6,6,0,0,27,16,2,1,1,0,barnl901,Larry Barnett,evanj901,Jim Evans,mckej901,Jim McKean,hendt901,Ted Hendry,,(none),,(none),mattb101,Bobby Mattick,johnd106,Darrell Johnson,parrm101,Mike Parrott,lemad101,Dave Lemanczyk,heavd101,Dave Heaverlo,cox-t101,Ted Cox,lemad101,Dave Lemanczyk,parrm101,Mike Parrott,grifa001,Alfredo Griffin,6,bailb001,Bob Bailor,9,maybj101,John Mayberry,3,veleo101,Otto Velez,10,hower001,Roy Howell,5,bonnb001,Barry Bonnell,7,boser101,Rick Bosetti,8,garcd001,Damaso Garcia,4,white002,Ernie Whitt,2,cruzj002,Julio Cruz,4,crair001,Rod Craig,8,meyed001,Dan Meyer,7,bochb001,Bruce Bochte,3,hortw101,Willie Horton,10,simpj101,Joe Simpson,9,cox-t101,Ted Cox,5,stinb101,Bob Stinson,2,mendm101,Mario Mendoza,6,,Y,1980,2,1,14,
Because of the size of the game file, I can't include more than this one row, so hopefully this will be enough to show what I'm trying to match.
#A third attemptis not inside the function definition.to_datetime()Editlink and paste it again.strptime()himself, it's happening internally to pandas.