2

I am cleaning data that has years in differing formats. There are seven possible values for the year field of my DataFrame: ['2013-14','2014-15','2015-16','2016-17','22017','22018','22019']. I have resolved the problem by manually dealing with each case, as below:

matchups_df.loc[matchups_df['SEASON_ID'] == '22017', 'SEASON_ID'] = '2017-18'
matchups_df.loc[matchups_df['SEASON_ID'] == '22018', 'SEASON_ID'] = '2018-19'
matchups_df.loc[matchups_df['SEASON_ID'] == '22019', 'SEASON_ID'] = '2019-20'

My question is, why does the code below raise the exception ValueError: invalid literal for int() with base 10: '2016-17'? I have removed the relevant portion from the np.where and used it on a filtered version of the DataFrame to only handle the necessary values, but it raises the same exception. Clearly, I have made some type of syntax eror in converting the string to int, but I haven't been to diagnose where the error lies.

matchups_df.insert(loc = 1, column = 'Season', value = (
    np.where(
    (len(matchups_df.SEASON_ID) == 5),
        (
            (matchups_df.SEASON_ID[1:]) +
            "-" +
            (str((matchups_df.SEASON_ID[3:].astype(int))+1))
        ),

        matchups_df.SEASON_ID
    )
                                                        )
                                                            )
1
  • 1
    "Clearly, I have made some type of syntax eror" No; you have a logical error. If you had a syntax error, then a) the reported error would be a SyntaxError rather than ValueError; and b) it would be reported before any attempt at computation. Commented Nov 12, 2020 at 13:44

2 Answers 2

3

Here is necessary use str methods for check length by Series.str.len and indexing for get all values after first by str[1:], also because both Series are processing for convert to numbers is used to_numeric for avoid errors if no match correct format:

matchups_df = pd.DataFrame({'SEASON_ID':['2013-14','2014-15','2015-16','2016-17',
                                         '22017','22018','22019'],
                            'col':range(7)})
    
print (matchups_df)
  SEASON_ID  col
0   2013-14    0
1   2014-15    1
2   2015-16    2
3   2016-17    3
4     22017    4
5     22018    5
6     22019    6

s = matchups_df.SEASON_ID.astype(str)

s1 = np.where(s.str.len() == 5, 
              s.str[1:] + "-" + pd.to_numeric(s.str[3:], errors='coerce')
                                  .fillna(0).astype(int).add(1).astype(str), 
              matchups_df.SEASON_ID)
matchups_df.insert(loc = 1, column = 'Season', value = s1)

print (matchups_df)
  SEASON_ID   Season  col
0   2013-14  2013-14    0
1   2014-15  2014-15    1
2   2015-16  2015-16    2
3   2016-17  2016-17    3
4     22017  2017-18    4
5     22018  2018-19    5
6     22019  2019-20    6

Another solution with custom function:

def f(x):
    if len(x) == 5:
        return x[1:] + "-" + str(int(x[3:]) + 1)
    else:
        return x
s1 = matchups_df.SEASON_ID.astype(str).apply(f)

matchups_df.insert(loc = 1, column = 'Season', value = s1)
print (matchups_df)

  SEASON_ID   Season  col
0   2013-14  2013-14    0
1   2014-15  2014-15    1
2   2015-16  2015-16    2
3   2016-17  2016-17    3
4     22017  2017-18    4
5     22018  2018-19    5
6     22019  2019-20    6
Sign up to request clarification or add additional context in comments.

Comments

1

The underlying issue here:

matchups_df.SEASON_ID[3:]

matchups_df.SEASON_ID is the entire column (a Series). Slicing with [3:] just takes off the first three rows; but instead you wanted to take off the first three characters of each value. Similarly, len(matchups_df.SEASON_ID) == 5 doesn't depend on the cell value (but instead on the length of the column, so all cells (from the fourth onward - so, the fourth and fifth of five) end up being processed - including the ones with strings like 2016-17.

To get the behaviour you want, the .str helper is provided, as shown in @jezrael's answer.

1 Comment

Thank you! I have marked @jezrael's answer as the accepted answer, as it noted both 1) my mistake in slicing the Series instead of the strings in the Series, and 2) the need to call .str in order to produce the desired behavior. I have upvoted your answer however, as it notes the distinction between slicing a Series and slicing the values in in the Series very clearly.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.