0

I'm new to regex and I'd like to split some text data into columns. Looking at 'test-data', the structure is: First/last name, university and country. How can I split this text into three columns (each containing name, university and country)?

test_data = "Bob Smith, São Paulo State University/Department of Production Engineering, Brazil James Smith, São Paulo State University/Department of Production Engineering, Brazil Bob James, São Paulo State University/Department of Production Engineering, Brazil"

test_df = pd.DataFrame([test_data], columns=["test_data"])
split_df = test_df["test_data"].str.split(r'\w+,', expand=True)
split_df.head()

Thanks in advance!

3
  • Your test data is bad. Why? There is no delimiter between two entries. Commented Sep 19, 2019 at 3:26
  • You're right, but can't the commas be used to seperate the data? All entries follow the same format of name, comma, university, comma and country. Thanks. Commented Sep 19, 2019 at 3:31
  • you're missing a comma after each country, so if you try and split by the comma, you end up with the country and the next name in the same list element Commented Sep 19, 2019 at 3:41

2 Answers 2

1

I am unsure how you are generating your input data and I'm also unsure if the data is consistent in a larger set. This answer is based on the current data set structure without modifications. You should be able to add the final output to a dataframe. If you have issues with that, I will add that piece too.

from pprint import pprint

input_string = 'Bob Smith, São Paulo State University/Department of Production Engineering, Brazil James Smith, São Paulo State University/Department of Production Engineering, Brazil Bob James, São Paulo State University/Department of Production Engineering, Brazil'

def split_string_keep_delimiter(string_to_split, delimiter):
  result_list = []
  tokens = string_to_split.split(delimiter)
  for i in range(len(tokens) - 1):
    result_list.append(tokens[i] + delimiter)
  result_list.append(tokens[len(tokens)-1])
  return  result_list

# This is going to split your input text on the word Brazil
# the output is a list
split_input = split_string_keep_delimiter(input_string, "Brazil")
pprint(split_input)
# output
['Bob Smith, São Paulo State University/Department of Production '
'Engineering,Brazil',
'James Smith, São Paulo State University/Department of Production '
'Engineering,Brazil',
'Bob James, São Paulo State University/Department of Production '
'Engineering,Brazil',
'']

# This is going to split the previous list at the commas (,).
# the output is a nested list
results = [item.split(',') for item in split_input if len(item) > 0]
print (results)
# output
[['Bob Smith', ' São Paulo State University/Department of Production Engineering', ' Brazil'], [' James Smith', ' São Paulo State University/Department of Production Engineering', ' Brazil'], [' Bob James', ' São Paulo State University/Department of Production Engineering', ' Brazil']]

# This loops through the results and extracts 4 items from each list.
for item in results:
  name = item[0].strip()
  university_name = item[1].strip().split('/')[0]
  department = item[1].strip().split('/')[1]
  country = item[2].strip()
  print (f'{name} - {university_name} - {department} - {country}')
  # output
  Bob Smith - São Paulo State University - Department of Production Engineering - Brazil
  James Smith - São Paulo State University - Department of Production Engineering - Brazil
  Bob James - São Paulo State University - Department of Production Engineering - Brazil
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks - I was able to work with this. I did come up with an alternative that seems to work with most of my dataset: split_df = test_df["test_data"].str.split(r'(.*?,.*?,\s\w+)', expand=True)
How would you do this for all values in a column instead of for an isolated string?
1

If your data is better structured where each column is delimited by a "," then you can do something like below.

IN:

test_data = "São Paulo State University/Department of Production Engineering, Brazil, James Smith, São Paulo State University/Department of Production Engineering, Brazil, Bob James, São Paulo State University/Department of Production Engineering, Brazil, Mike Smith"

df = pd.DataFrame(data = np.array(test_data.split(',')).reshape(-1, 3), columns = ['University', 'Country', 'Name'])

OUT:

|   |                            University                           | Country | Name        |
|---|:---------------------------------------------------------------:|---------|-------------|
| 0 | São Paulo State University/Department of Production Engineering | Brazil  | James Smith |
| 1 | São Paulo State University/Department of Production Engineering | Brazil  | Bob James   |
| 2 | São Paulo State University/Department of Production Engineering | Brazil  | Mike Smith  |

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.