2

I have a DataFrame that looks like this...

                                     Variable
0                         Religion - Buddhism
1                            Source: Clickerz
2                            Religion - Islam
3                            Source: SRZ FREE
4   Ethnicity - Mixed - White & Black African

I want to manipulate the variablecolumn to create a new column which looks like this...

                                        Variable           New Column
    0                         Religion - Buddhism           Buddhism
    1                            Source: Clickerz           Clickerz
    2                            Religion - Islam            Islam
    3                            Source: SRZ FREE            SRZ FREE
    4   Ethnicity - Mixed - White & Black African         Mixed - White and Black African

So that I can eventually have a DataFrame that looks like this...

                            Variable                      New Column
    0                       Religion                        Buddhism
    1                         Source                        Clickerz
    2                       Religion                           Islam
    3                         Source                        SRZ FREE
    4                      Ethnicity         Mixed - White and Black African

I want to iterate through the Variable column and manipulate the data to create New Column. I was planning on using multiple if statements to find a specific word for example 'Ethnicity' or 'Religion' and then apply a manipulation.

For example...

For row in df['Variable']:

      if 'Religion' in row:

              df['New Column'] = ...
      
      elif 'Ethnicity' in row:

              df['New Column'] = ...

      elif: 'Source' in row:

              df['New Column'] = ...

      else:

              df['New Column'] = 'Not Applicable'

Even though type(row) returns 'str' meaning it is of the class string, this code keeps returning the new column as all 'Not Applicable' meaning it is not detecting any of the strings in any of the rows in the data frame even when I can see they are there.

I am sure there is an easy way to do this...PLEASE HELP!

I have tried the following aswell...

For row in df['Variable']:

  if row.find('Religion') != -1:

          df['New Column'] = ...

  elif row.find('Ethnicity') != -1:

          df['New Column'] = ...

  elif: row.find('Source') != -1:

          df['New Column'] = ...

  else:

          df['New Column'] = 'Not Applicable'

And I continue to get all entries of the new column being 'Not Applicable'. Once again it is not finding the string in the existing column.

Is it an issue with the data type or something?

6 Answers 6

1

You could use a nested for loop:

# For each row in the dataframe
for row in df['column_variable']:
    # Set boolean to indicate if a substring was found
    substr_found = False

    # For each substring
    for sub_str in ["substring1", "substring2"]:
        # If the substring is in the row
        if sub_str in row:
            # Execute code...
            df['new_column'] = ...

            # Substring was found!
            substr_found = True

    # If substring was not found
    if not substr_found:
        # Set invalid code...
        df['new column'] = 'Not Applicable'
Sign up to request clarification or add additional context in comments.

Comments

1

Updated to match your Dataframe!

import pandas as pd

Your Dataframe

lst = []

for i in ['Religion - Buddhism','Source: Clickerz','Religion - Islam','Source: SRZ FREE','Ethnicity - Mixed - White & Black African']:
    item = [i]
    lst.append(item)

df = pd.DataFrame.from_records(lst)
df.columns = ['variable']
print(df)
                                    variable
0                        Religion - Buddhism
1                           Source: Clickerz
2                           Religion - Islam
3                           Source: SRZ FREE
4  Ethnicity - Mixed - White & Black African

Using a For Loop and Partial String matching in conjuction with .loc to set the new values

for x,y in df['variable'].iteritems():
    if 'religion' in y.lower():
        z = y.split('-')
        df.loc[x, 'variable'] = z[0].strip()
        df.loc[x, 'value'] = ''.join(z[1:]).strip()
    if 'source' in y.lower():
        z = y.split(':')
        df.loc[x, 'variable'] = z[0].strip()
        df.loc[x, 'value'] = ''.join(z[1:]).strip()
    if 'ethnicity' in y.lower():
        z = y.split('-')
        df.loc[x, 'variable'] = z[0].strip()
        df.loc[x, 'value'] = ''.join(z[1:]).strip()

print(df)
    variable                         value
0   Religion                      Buddhism
1     Source                      Clickerz
2   Religion                         Islam
3     Source                      SRZ FREE
4  Ethnicity  Mixed  White & Black African

Comments

1

As much as possible, you should avoid looping through rows when manipulating a DataFrame. This article explains what are the more efficient alternatives.

You are basically attempting to translate strings based on some fixed map. Naturally, a dict comes to mind:

substring_map = {
    "at": "pseudo-cat",
    "dog": "true dog",
    "bre": "something else",    
    "na": "not applicable"
}

This map could be read from a file, e.g., a JSON file, in the scenario where you are handling a large number of substrings.

The substring matching logic can now be decoupled from the map definition:

def translate_substring(x):
  for substring, new_string in substring_map.items():
    if substring in x:
      return new_string
  return "not applicable"

Use apply with the 'mapping' function to generate your target column:

df = pd.DataFrame({"name":
  ["cat", "dogg", "breeze", "bred", "hat", "misty"]})

df["new_column"] = df["name"].apply(translate_substring)

# df:
#      name      new_column
# 0     cat      pseudo-cat
# 1    dogg        true dog
# 2  breeze  something else
# 3    bred  something else
# 4     hat      pseudo-cat
# 5   misty  not applicable

This code, applied on pd.concat([df] * 10000) (60,000 rows), runs in 42ms in a Colab notebook. In comparison, using iterrows completes in 3.67s--a 87x speedup.

Comments

0

You can create an empty list, add new values there and the create the new column as last step:

all_data = []
for row in df["column_variable"]:
    if "substring1" in row:
        all_data.append("Found 1")
    elif "substring2" in row:
        all_data.append("Found 2")
    elif "substring3" in row:
        all_data.append("Found 3")
    else:
        all_data.append("Not Applicable")

df["new column"] = all_data

print(df)

Prints:

      column_variable new column
0  this is substring1    Found 1
1  this is substring2    Found 2
2  this is substring1    Found 1
3  this is substring3    Found 3

2 Comments

For some reason when I type in..."if 'substring' in row:" it does not find the substring in the row even though it is clearly there. This is the main problem
@ElliottDavey Please edit your question with sample of your dataframe.
0

Maybe the shortest way I can think of:

#Dummy DataFrame
df = pd.DataFrame([[1,"substr1"],[3,"bla"],[5,"bla"]],columns=["abc","col_to_check"])

substrings = ["substr1","substr2", "substr3"]
content = df["col_to_check"].unique().tolist() # Unique content of column

for subs in substrings: # Go through all your substrings
    if subs in content: # Check if substring is in column
        df[subs] = 0 # Fill your new column with whatever you want

Comments

0

I made a function 'string_splitter' and applied it in a lambda function, this solved the issue.

I created the following function to split strings in different ways based on different substrings contained in the cell.

def string_splitter(cell):

word_list1 = ['Age', 'Disability', 'Religion', 'Gender']
word_list2 = ['Number shortlisted', 'Number Hired', 'Number Interviewed']

if any([word in cell for word in word_list1]):
    
    result = cell.split("-")[1]
    result = result.strip()
    
elif 'Source' in cell:
    
    result = cell.split(":")[1]
    result = result.strip()
    
elif 'Ethnicity' in cell:
    
    result_list = cell.split("-")[1:3]
    result = "-".join(result_list)
    result = result.strip()

elif any([word in cell for word in word_list2]):
    
    result = cell.split(" ")[1]
    result = result.strip()

elif 'Number of Applicants' in cell:
    
    result = cell


return result

I then called string_splitter when using a lambda operation. This applied the function to each cell individually as the code iterates through each row of the specified column in the dataframe. As shown below:

df['Answer'] = df['Visual Type'].apply(lambda x: string_splitter(x))

string_splitter allowed me to create the New column.

I then created another function column_formatter to manipulate the Variable column once the New Column had been made. The second function is shown below:

def column_formatter(cell):

word_list1 = ['Age', 'Gender', 'Ethnicity', 'Religion']
word_list2 = ['Number of Applicants', 'Number Hired', 'Number shortlisted', 'Number Interviewed']

if any([word in cell for word in word_list1]):
    
    result = cell.split("-")[0]
    result = result.strip()

elif 'Source' in cell:
    
    result = cell.split(":")[0]
    result = result.strip()

elif 'Disability' in cell:
    
    result = cell.split(" ")[0]
    result = result.strip()

elif any([word in cell for word in word_list2]):
    
    result = 'Number of Applicants'
    
else:
    
    result = 'Something wrong here'


return result

And then called the function in the same way as follows:

df['Visual Type'] = df['Visual Type'].apply(lambda x: column_formatter(x))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.