0

I want to add a row in an existing data frame, where I don't have a matching regex value. For example,

import pandas as pd
import numpy as np
import re

lst = ['Sarah Kim', 'Added by January 21']

df = pd.DataFrame(lst)

df.columns = ['Info']

name_pat = r"^[A-Z][a-z]+,?\s+(?:[A-Z][a-z]*\.?\s*)?[A-Z][a-z]+"
date_pat = r"\b(\w*Added on\w*)\b"
title_pat = r"\b(\w*at\w*)\b"

for index, row in dff.iterrows():
    if re.findall(name_pat, str(row['Info'])):
        print("Name matched")
    elif re.findall(title_pat, str(row['Info'])):
        print("Title matched")
        if re.findall(title_pat, str(row['Info'])) == None:
            # Add a row here in the dataframe
    elif re.findall(date_pat, str(row['Info'])):
        print("Date matched")
        if re.findall(date_pat, str(row['Info'])) == None:
            # Add a row here in the dataframe

So here in my dataframe df, I do not have a title, but just Name and Date. While looping df, I want to add an empty column for a title.

The output is:

  Info
0 Sarah Kim
1 Added on January 21

My expected output is:

  Info
0 Sarah Kim
1 None
2 Added on January 21

Is there any way that I can add an empty column, or is there a better way?

+++ The dataset I'm working with is just one column with many rows. The rows have some structure, that repeat data of "name, title, date". For example,

  Info
0 Sarah Kim
1 Added on January 21
2 Jesus A. Moore
3 Marketer
4 Added on May 30
5 Bobbie J. Garcia
6 CEO
7 Anita Jobe
8 Designer
9 Added on January 3
...
998 Michael B. Reedy
999 Salesman
1000 Added on December 13

I have sliced the data frame, so I can only extract data frame looks like this:

  Info
0 Sarah Kim
1 Added on January 21

And I'm trying to run a loop for each section, and if a date or title is missing, I will fill with an empty row. So that in the end, I will have:

  Info
0 Sarah Kim
1 **NULL**
2 Added on January 21
3 Jesus A. Moore
4 Marketer
5 Added on May 30
6 Bobbie J. Garcia
7 CEO
8 **NULL**
9 Anita Jobe
10 Designer
11 Added on January 3
...
998 Michael B. Reedy
999 Salesman
1000 Added on December 13
10
  • 1
    Is there any way that I can add an empty column Yes, have you tried that? The best would be to use vectorized operations for this, you should read the Pandas docs. Commented Feb 13, 2020 at 4:10
  • In any case, there are plenty of resources on the subject, can you clarify what the issue is here? Commented Feb 13, 2020 at 4:11
  • @AMC Can you at least give me the resources on what to research? I don't need an entire code to solve the problem, but more I'm having issues approaching the problem. And yes I tried to add an empty column but none worked. Commented Feb 13, 2020 at 4:15
  • I find the official Pandas documentation to be quite good! Commented Feb 13, 2020 at 4:35
  • Please explain more, not able to understand your issue. Commented Feb 13, 2020 at 8:02

1 Answer 1

1

I see you have a long dataframe with information and each set of information is different. I think the your goal is possibly to have a dataframe where you have 3 columns.

Name,Title and Date

Here is a way I would approach this problem and some code samples. I would take advantage of the df.shift method so I could tie information and use your existing dataframe to create a new one.

I am also making some assumptions based on what you have listed above. First I will assume that only the Title and Date field could be missing. Second I will assume that the order of the is Name,Title and Date like you have mentioned above.

#first step create test data
test_list = ['Sarah Kim','Added on January 21','Jesus A. Moore','Marketer','Added on May 30','Bobbie J. Garcia','CEO','Anita Jobe','Designer','Added on January 3']
test_df =pd.DataFrame(test_list,columns=['Info'])

# second step use your regex to get what type of column each info value is

name_pat = r"^[A-Z][a-z]+,?\s+(?:[A-Z][a-z]*\.?\s*)?[A-Z][a-z]+"
date_pat = r"\b(\w*Added on\w*)\b"
title_pat = r"\b(\w*at\w*)\b"

test_df['Col'] = test_df['Info'].apply(lambda x: 'Name' if re.findall(name_pat, x) else ('Date' if re.findall(date_pat,x) else 'Title'))

# third step is to get the next values from our dataframe using df.shift
test_df['Next_col'] = test_df['Col'].shift(-1)
test_df['Next_col2'] = test_df['Col'].shift(-2)
test_df['Next_val1'] = test_df['Info'].shift(-1)
test_df['Next_val2'] = test_df['Info'].shift(-2)

# Now filter to only the names and apply a function to get our name, title and date
new_df = test_df[test_df['Col']=='Name']

def apply_func(row):
    name = row['Info']
    title = None
    date = None
    if row['Next_col']=='Title':
        title = row['Next_val1']
    elif row['Next_col']=='Date':
        date = row['Next_val1']
    if row['Next_col2']=='Date':
        date = row['Next_val2']
    row['Name'] = name
    row['Title'] = title
    row['date'] = date
    return row

final_df = new_df.apply(apply_func,axis=1)[['Name','Title','date']].reset_index(drop=True)
print(final_df)

               Name     Title                 date
0  Sarah Kim         None      Added on January 21
1  Jesus A. Moore    Marketer  Added on May 30    
2  Bobbie J. Garcia  CEO       None               
3  Anita Jobe        Designer  Added on January 3 

There is probably a way that we could do this in less lines of code. I welcome anyone who can make this more efficient, but I believe this should work. Also if you wanted to flatten this back into an array.

flattened_df = pd.DataFrame(final_df.values.flatten(),columns=['Info'])
print(flattened_df)

                   Info
0   Sarah Kim          
1   None               
2   Added on January 21
3   Jesus A. Moore     
4   Marketer           
5   Added on May 30    
6   Bobbie J. Garcia   
7   CEO                
8   None               
9   Anita Jobe         
10  Designer           
11  Added on January 3 

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.