extracting characters from strings and forming new columns in Python

Question

I have a Pandas DataFrame like this :

      Date                   Descriptive          
0  2017-1-1    Time12:30 Id124562 American electronic commerce and cloud computing company based in Seattle     
1  2017-1-2    Time12:40 Id124565 Amazon has separate retail websites for the United States
2  2017-1-3    Time12:45 Id124561 In 2020, Amazon will build a new downtown Seattle building

How can I generate a new DataFrame like this with Python?

         Date        time      id           descriptive
    0  2017-1-1     12:30    124562     American electronic commerce and cloud computing company based in Seattle     
    1  2017-1-2     12:40    124565     Amazon has separate retail websites for the United States
    2  2017-1-3     12:45    124561     In 2020, Amazon will build a new downtown Seattle building

PS: Sorry, I makeup this dataframe to represent the real data cleaning problem i met. Length of id is fixed to 6. Thanks a lot.

user3483203 · Accepted Answer · 2018-05-26 04:59:55Z

2

join with split using expand=True

cols = ['Time', 'Type', 'Price', 'Id']

df.join(
    pd.DataFrame(
        df.Descriptive.str.replace(
            r'(?:{})([^\s]+)'.format('|'.join(cols)),
            r'\1'
            ).str.split(expand=True).values,
            columns = cols
        )
)

# Result

       Date                            Descriptive   Time Type Price        Id
0  2017-1-1    Time12:30 Type021 Price11$ Id124562  12:30  021   11$    124562
1  2017-1-2  Time12:40 Type011 Price11$ Id12456512  12:40  011   11$  12456512
2  2017-1-3  Time12:45 Type031 Price11$ Id12456125  12:45  031   11$  12456125

edited May 26, 2018 at 4:59

answered May 26, 2018 at 4:52

user3483203

51.3k10 gold badges72 silver badges104 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

ah bon Over a year ago

ValueError: Shape of passed values is (140, 37339), indices imply (3, 37339)

user3483203 Over a year ago

Then your data looks very different from what you have provided

ah bon Over a year ago

Thanks a lot. Do you know how to deal with this error?

user3483203 Over a year ago

If you provide a row of your actual data I can show you how.

ah bon Over a year ago

Yeah. df.Descriptive is only a example column of my dataset, there are many other columns as well. But i want to slice this column and join as Time, Type, Price and Id to whole dataset.

|

piRSquared · Accepted Answer · 2018-05-26 06:55:22Z

2

New answer

epat = re.compile('(\w+?)(\d\S*)')

df.join(pd.DataFrame([dict(re.findall(epat, y)) for y in df.Descriptive], df.index))

       Date                            Descriptive        Id Price   Time Type
0  2017-1-1    Time12:30 Type021 Price11$ Id124562    124562   11$  12:30  021
1  2017-1-2  Time12:40 Type011 Price11$ Id12456512  12456512   11$  12:40  011
2  2017-1-3  Time12:45 Type031 Price11$ Id12456125  12456125   11$  12:45  031

I just think this regex pattern matching group naming is elegent

time = 'Time(?P<Time>\d{1,2}:\d{2}) '
typ_ = 'Type(?P<Type>\d+) '
prc_ = 'Price(?P<Price>\d+)\$ '
id__ = 'Id(?P<Id>\d+)$'
pat = f'{time}{typ_}{prc_}{id__}'
df.join(df.Descriptive.str.extract(pat, expand=True))

       Date                            Descriptive   Time Type Price        Id
0  2017-1-1    Time12:30 Type021 Price11$ Id124562  12:30  021    11    124562
1  2017-1-2  Time12:40 Type011 Price11$ Id12456512  12:40  011    11  12456512
2  2017-1-3  Time12:45 Type031 Price11$ Id12456125  12:45  031    11  12456125

edited May 26, 2018 at 6:55

answered May 26, 2018 at 6:05

piRSquared

296k68 gold badges509 silver badges654 bronze badges

2 Comments

ah bon Over a year ago

Thanks for the comments and help. In fact, my actual problem is following: for example, if i want slice df['Descriptive'] to only three parts and concatate with whole dataset: Time, Type and others (including Price and Id), then remove the characters Time and Type, and rename the columns and concatate with whole dataframe. Does this make sense? How should I write with Python?

ah bon Over a year ago

Thanks. It works but creates a new dataframe 37339 rows × 10068 columns. As i explained to @coldspeed, it need to be seperated in three part at first. The last one is unstructual long phrase, not a regular data as example.

niraj · Accepted Answer · 2018-05-26 04:25:05Z

1

You can try something like below:

items = ['Time', 'Type', 'Price', 'Id']
for index, item in enumerate(items):
    df[item] = df['Descriptive'].apply(lambda row: row.split(' ')[index].split(item)[1])

print(df)

Result:

       Date                            Descriptive   Time Type Price        Id
0  2017-1-1    Time12:30 Type021 Price11$ Id124562  12:30  021   11$    124562
1  2017-1-2  Time12:40 Type011 Price11$ Id12456512  12:40  011   11$  12456512
2  2017-1-3  Time12:45 Type031 Price11$ Id12456125  12:45  031   11$  12456125

If for loop is confusing you can try apply without loop:

df['Time'] = df['Descriptive'].apply(lambda row: row.split(' ')[0].split('Time')[1])
df['Type'] = df['Descriptive'].apply(lambda row: int(row.split(' ')[1].split('Type')[1]))
df['Price'] = df['Descriptive'].apply(lambda row: row.split(' ')[2].split('Price')[1])
df['Id'] = df['Descriptive'].apply(lambda row: row.split(' ')[3].split('Id')[1])

answered May 26, 2018 at 4:25

niraj

18.2k4 gold badges36 silver badges50 bronze badges

3 Comments

niraj Over a year ago

Is there something wrong, wouldn't the solution work?

ah bon Over a year ago

Thanks. When I try your solution in my dataset, it says: IndexError: list index out of range. Do you know why?

ah bon Over a year ago

~\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds) 2549 else: 2550 values = self.asobject -> 2551 mapped = lib.map_infer(values, f, convert=convert_dtype) 2552 2553 if len(mapped) and isinstance(mapped[0], Series): pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer() <ipython-input-24-ea7872f2e5a8> in <lambda>(row)

Collectives™ on Stack Overflow

extracting characters from strings and forming new columns in Python

3 Answers 3

6 Comments

2 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related