Splitting text data into multiple columns using Regex

Question

I'm new to regex and I'd like to split some text data into columns. Looking at 'test-data', the structure is: First/last name, university and country. How can I split this text into three columns (each containing name, university and country)?

test_data = "Bob Smith, São Paulo State University/Department of Production Engineering, Brazil James Smith, São Paulo State University/Department of Production Engineering, Brazil Bob James, São Paulo State University/Department of Production Engineering, Brazil"

test_df = pd.DataFrame([test_data], columns=["test_data"])
split_df = test_df["test_data"].str.split(r'\w+,', expand=True)
split_df.head()

Thanks in advance!

Your test data is bad. Why? There is no delimiter between two entries. — Sushant
– Sushant, Commented Sep 19, 2019 at 3:26
You're right, but can't the commas be used to seperate the data? All entries follow the same format of name, comma, university, comma and country. Thanks. — kishkebab
– kishkebab, Commented Sep 19, 2019 at 3:31
you're missing a comma after each country, so if you try and split by the comma, you end up with the country and the next name in the same list element — elembie
– elembie, Commented Sep 19, 2019 at 3:41

Life is complex · Accepted Answer · 2019-09-19 14:27:51Z

I am unsure how you are generating your input data and I'm also unsure if the data is consistent in a larger set. This answer is based on the current data set structure without modifications. You should be able to add the final output to a dataframe. If you have issues with that, I will add that piece too.

from pprint import pprint

input_string = 'Bob Smith, São Paulo State University/Department of Production Engineering, Brazil James Smith, São Paulo State University/Department of Production Engineering, Brazil Bob James, São Paulo State University/Department of Production Engineering, Brazil'

def split_string_keep_delimiter(string_to_split, delimiter):
  result_list = []
  tokens = string_to_split.split(delimiter)
  for i in range(len(tokens) - 1):
    result_list.append(tokens[i] + delimiter)
  result_list.append(tokens[len(tokens)-1])
  return  result_list

# This is going to split your input text on the word Brazil
# the output is a list
split_input = split_string_keep_delimiter(input_string, "Brazil")
pprint(split_input)
# output
['Bob Smith, São Paulo State University/Department of Production '
'Engineering,Brazil',
'James Smith, São Paulo State University/Department of Production '
'Engineering,Brazil',
'Bob James, São Paulo State University/Department of Production '
'Engineering,Brazil',
'']

# This is going to split the previous list at the commas (,).
# the output is a nested list
results = [item.split(',') for item in split_input if len(item) > 0]
print (results)
# output
[['Bob Smith', ' São Paulo State University/Department of Production Engineering', ' Brazil'], [' James Smith', ' São Paulo State University/Department of Production Engineering', ' Brazil'], [' Bob James', ' São Paulo State University/Department of Production Engineering', ' Brazil']]

# This loops through the results and extracts 4 items from each list.
for item in results:
  name = item[0].strip()
  university_name = item[1].strip().split('/')[0]
  department = item[1].strip().split('/')[1]
  country = item[2].strip()
  print (f'{name} - {university_name} - {department} - {country}')
  # output
  Bob Smith - São Paulo State University - Department of Production Engineering - Brazil
  James Smith - São Paulo State University - Department of Production Engineering - Brazil
  Bob James - São Paulo State University - Department of Production Engineering - Brazil

Thanks - I was able to work with this. I did come up with an alternative that seems to work with most of my dataset: split_df = test_df["test_data"].str.split(r'(.*?,.*?,\s\w+)', expand=True)
How would you do this for all values in a column instead of for an isolated string?

Michael Gardner · Accepted Answer · 2019-09-19 03:52:14Z

If your data is better structured where each column is delimited by a "," then you can do something like below.

IN:

test_data = "São Paulo State University/Department of Production Engineering, Brazil, James Smith, São Paulo State University/Department of Production Engineering, Brazil, Bob James, São Paulo State University/Department of Production Engineering, Brazil, Mike Smith"

df = pd.DataFrame(data = np.array(test_data.split(',')).reshape(-1, 3), columns = ['University', 'Country', 'Name'])

OUT:

|   |                            University                           | Country | Name        |
|---|:---------------------------------------------------------------:|---------|-------------|
| 0 | São Paulo State University/Department of Production Engineering | Brazil  | James Smith |
| 1 | São Paulo State University/Department of Production Engineering | Brazil  | Bob James   |
| 2 | São Paulo State University/Department of Production Engineering | Brazil  | Mike Smith  |

Collectives™ on Stack Overflow

Splitting text data into multiple columns using Regex

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related