Running a Regex loop in a Pandas Dataframe

Question

I currently have a date column that has some issues. I have attempted to fix the problem but cannot come to a conclusion.

Here is the data:

# Import data
df_views = pd.read_excel('PageViews.xlsx')

# Check data types
df_views.dtypes
Out[57]:
Date           object
Customer ID     int64
dtype: object

The date column is not in a 'datetime' data format as expected. Further inspection yields:

df_views.ix[:5]
Date    Customer ID
0   01/25/2016  104064596300
1   02/28/2015  102077474472
2   11/17/2016  106430081724
3   02/24/2016  107770391692
4   10/05/2016  106523680888
5   02/24/2016  107057691592

I quickly check which rows does not follow the proper format xx/xx/xxxx

print (df_views[df_views["Date"].str.len() != 10])
          Date   Customer ID
189513  12/14/  106285770688
189514  10/28/  107520462840
189515  11/01/  102969804360
189516  11/10/  102106417100
189517  02/16/  107810168068
189518  10/25/  102096164504
189519  02/08/  107391760644
189520  02/29/  107353558928
189521  10/24/  107209142140
189522  12/20/  107875461336
189523  12/23/  107736375428
189524  11/12/  106561080372
189525  01/27/  102676548120
189526  11/19/  107733043896
189527  12/31/  107774452412
189528  01/21/  102610956040
189529  01/09/  108052836888
189530  02/21/  106380330112
189531  02/02/  107844459772
189532  12/12/  102006641640
189533  12/16/  106604647688
189534  11/14/  102383102504

I have attempted to create a for loop but cannot figure out how to approach my loop.

Important note: I know that the time period for all observations is between September 2015 through February 2016.

So if the month is 09/10/11/12 - then I can add "2015" to the date, otherwise if the month is 01/02, I can add "2016".

for row in df_views["Date"]:
    if len(row) != 10:
        if row.str.contains("^09|10|11|12\/"):
            row.str.cat("2015")
        elif row.str.contains("^01|02\/"):
            row.str.cat("2016")
    else:
        continue 
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-87-684e121dd62d> in <module>()
      5 for row in df_views["Date"]:
      6     if len(row) != 10:
----> 7         if row.str.contains("^09|10|11|12\/"):
      8             row.str.cat("2015")
      9         elif row.str.contains("^01|02\/"):

AttributeError: 'str' object has no attribute 'str'

row seems to be a string. So all the methods you're trying (contains, cat) are very unlikely to work. — Jean-François Fabre
– Jean-François Fabre ♦, Commented Sep 8, 2016 at 18:34
Avoid using a for loop. Just use the .str operations you're using on the Series directly. You may need to tweak them but it will be faster than explicit iterating. — BrenBarn
– BrenBarn, Commented Sep 8, 2016 at 18:41

Community · Accepted Answer · 2017-05-23 12:22:53Z

1

As @BrenBam has already written in the comment - try to avoid using loops. Pandas gives us tons of vectorized (read fast and efficient) methods:

In [67]: df
Out[67]:
          Date   Customer ID
0   12/14/2001  106285770688
1   10/28/2000  107520462840
2       11/01/  102969804360
3       11/10/  102106417100
4       02/16/  107810168068
5       10/25/  102096164504
6       02/08/  107391760644
7       02/29/  107353558928
8       10/24/  107209142140
9       12/20/  107875461336
10      12/23/  107736375428
11      11/12/  106561080372
12      01/27/  102676548120
13      11/19/  107733043896
14      12/31/  107774452412
15      01/21/  102610956040
16      01/09/  108052836888
17      02/21/  106380330112
18      02/02/  107844459772
19      12/12/  102006641640
20      12/16/  106604647688
21      11/14/  102383102504

In [68]: df.ix[df.Date.str.match(r'^(?:09|10|11|12)\/\d{2}\/$', as_indexer=True), 'Date'] += '2015'

In [69]: df.ix[df.Date.str.match(r'^(?:01|02)\/\d{2}\/$', as_indexer=True), 'Date'] += '2016'

In [70]: df
Out[70]:
          Date   Customer ID
0   12/14/2001  106285770688
1   10/28/2000  107520462840
2   11/01/2015  102969804360
3   11/10/2015  102106417100
4   02/16/2016  107810168068
5   10/25/2015  102096164504
6   02/08/2016  107391760644
7   02/29/2016  107353558928
8   10/24/2015  107209142140
9   12/20/2015  107875461336
10  12/23/2015  107736375428
11  11/12/2015  106561080372
12  01/27/2016  102676548120
13  11/19/2015  107733043896
14  12/31/2015  107774452412
15  01/21/2016  102610956040
16  01/09/2016  108052836888
17  02/21/2016  106380330112
18  02/02/2016  107844459772
19  12/12/2015  102006641640
20  12/16/2015  106604647688
21  11/14/2015  102383102504

edited May 23, 2017 at 12:22

CommunityBot

11 silver badge

answered Sep 8, 2016 at 19:22

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Kevin Over a year ago

Solution works but it ends up adding '2015' or '2016' to the rest of the data frame in scenarios where there wasn't an issue. This regex expression worked on my end: '^(09|10|11|12)\/\d\d\/$'

Collectives™ on Stack Overflow

Running a Regex loop in a Pandas Dataframe

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related