2
df = pd.DataFrame({'columnA': ['apple:50-100(+)', 'peach:75-125(-)', 'banana:100-150(+)']})

New to regular expressions...if I want to split 'apple:50-100(+)' (and other example strings above) into a DataFrame as below, what's the best way to do that?

Desired output:

enter image description here

10
  • Can you provide some more context for this? How many strings? Where are the strings? What format to they follow? Commented Dec 10, 2019 at 3:09
  • Many strings in the format, 'apple:50-100(+)' and 'peach:50-100(-)'. They are in a column in a DataFrame. Commented Dec 10, 2019 at 3:11
  • Ah, well that's important information! Could you post an example of the column? Commented Dec 10, 2019 at 3:12
  • 1
    Can you share more about the first part of the string? Is it always just a single word, letters a-z? Commented Dec 10, 2019 at 3:20
  • 1
    Please don't post images of code/data/Tracebacks. Just copy the text, paste it in your question and format it as code. Commented Dec 10, 2019 at 3:40

3 Answers 3

4

I can update the regex if you provide more details on the format.

import pandas as pd

df = pd.DataFrame({'columnA': ['apple:50-100(+)', 'peach:75-125(-)', 'banana:100-150(+)']})

pattern = r"(.*):(\d+)-(\d+)\(([+-])\)"

new_df = df['columnA'].str.extract(pattern)

df:

             columnA
0    apple:50-100(+)
1    peach:75-125(-)
2  banana:100-150(+)

new_df:

        0    1    2  3
0   apple   50  100  +
1   peach   75  125  -
2  banana  100  150  +
Sign up to request clarification or add additional context in comments.

4 Comments

This is the correct answer for pandas, TipsyHyena take a look at the other pandas .str accessors here pandas.pydata.org/pandas-docs/stable/reference/…
Do you mind directing me to the documentation for this notation here r"(.*):(\d+)-(\d+)\(([+-])\)"? Not familiar with regex.
Best resource to get started with regex imo is regexone.com. Others may have better recommendations
@TipsyHyena I really like Regex101, it's how I wrote the solution for this, here. regular-expressions.info is also nice as a reference/guide.
1

re.split can be used to split on any string that matches a pattern. For the example you have given the following should work

re.split(r'[\:\-\(\)]+', your_string)

It splits the string on all colons, hyphens and parenthesis (":", "-", "(" and ")")

This results in an empty string as the last member of the list, you can either slice this off

re.split(r'[\:\-\(\)]+', your_string)[:-1]

Or filter out empty values

filter(None, re.split(r'[\:\-\(\)]+', your_string))

2 Comments

assuming the split is by : - +
Using re.split on the example string yields ['apple', ':', '50', '-', '100', '(', '+', ')', '']. Now how can I transform this list into a DataFrame as in the question? pd.DataFrame(re.split('(\:|\-|\(|\))', 'apple:50-100(+)')) isn't quite right.
0

Here is an alternative:

Python 3.7.5 (default, Oct 17 2019, 12:16:48) 
[GCC 9.2.1 20190827 (Red Hat 9.2.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> import pandas as pd
>>> split_it = re.compile(r'(\w+):(\d+)[-](\d+)\((.)\)')
>>> df = pd.DataFrame(split_it.findall('apple:50-100(+)'))
>>> df
       0   1    2  3
0  apple  50  100  +
>>>

2 Comments

Can this function take a dataframe column as input?
Probably yes but it would be better if you edit your post and show us a real sample of your data though.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.