1

I have a Pandas DataFrame like this :

      Date                   Descriptive          
0  2017-1-1    Time12:30 Id124562 American electronic commerce and cloud computing company based in Seattle     
1  2017-1-2    Time12:40 Id124565 Amazon has separate retail websites for the United States
2  2017-1-3    Time12:45 Id124561 In 2020, Amazon will build a new downtown Seattle building 

How can I generate a new DataFrame like this with Python?

         Date        time      id           descriptive
    0  2017-1-1     12:30    124562     American electronic commerce and cloud computing company based in Seattle     
    1  2017-1-2     12:40    124565     Amazon has separate retail websites for the United States
    2  2017-1-3     12:45    124561     In 2020, Amazon will build a new downtown Seattle building 

PS: Sorry, I makeup this dataframe to represent the real data cleaning problem i met. Length of id is fixed to 6. Thanks a lot.

0

3 Answers 3

2

join with split using expand=True

cols = ['Time', 'Type', 'Price', 'Id']

df.join(
    pd.DataFrame(
        df.Descriptive.str.replace(
            r'(?:{})([^\s]+)'.format('|'.join(cols)),
            r'\1'
            ).str.split(expand=True).values,
            columns = cols
        )
)

# Result

       Date                            Descriptive   Time Type Price        Id
0  2017-1-1    Time12:30 Type021 Price11$ Id124562  12:30  021   11$    124562
1  2017-1-2  Time12:40 Type011 Price11$ Id12456512  12:40  011   11$  12456512
2  2017-1-3  Time12:45 Type031 Price11$ Id12456125  12:45  031   11$  12456125
Sign up to request clarification or add additional context in comments.

6 Comments

ValueError: Shape of passed values is (140, 37339), indices imply (3, 37339)
Then your data looks very different from what you have provided
Thanks a lot. Do you know how to deal with this error?
If you provide a row of your actual data I can show you how.
Yeah. df.Descriptive is only a example column of my dataset, there are many other columns as well. But i want to slice this column and join as Time, Type, Price and Id to whole dataset.
|
2

New answer

epat = re.compile('(\w+?)(\d\S*)')

df.join(pd.DataFrame([dict(re.findall(epat, y)) for y in df.Descriptive], df.index))

       Date                            Descriptive        Id Price   Time Type
0  2017-1-1    Time12:30 Type021 Price11$ Id124562    124562   11$  12:30  021
1  2017-1-2  Time12:40 Type011 Price11$ Id12456512  12456512   11$  12:40  011
2  2017-1-3  Time12:45 Type031 Price11$ Id12456125  12456125   11$  12:45  031

I just think this regex pattern matching group naming is elegent

time = 'Time(?P<Time>\d{1,2}:\d{2}) '
typ_ = 'Type(?P<Type>\d+) '
prc_ = 'Price(?P<Price>\d+)\$ '
id__ = 'Id(?P<Id>\d+)$'
pat = f'{time}{typ_}{prc_}{id__}'
df.join(df.Descriptive.str.extract(pat, expand=True))

       Date                            Descriptive   Time Type Price        Id
0  2017-1-1    Time12:30 Type021 Price11$ Id124562  12:30  021    11    124562
1  2017-1-2  Time12:40 Type011 Price11$ Id12456512  12:40  011    11  12456512
2  2017-1-3  Time12:45 Type031 Price11$ Id12456125  12:45  031    11  12456125

2 Comments

Thanks for the comments and help. In fact, my actual problem is following: for example, if i want slice df['Descriptive'] to only three parts and concatate with whole dataset: Time, Type and others (including Price and Id), then remove the characters Time and Type, and rename the columns and concatate with whole dataframe. Does this make sense? How should I write with Python?
Thanks. It works but creates a new dataframe 37339 rows × 10068 columns. As i explained to @coldspeed, it need to be seperated in three part at first. The last one is unstructual long phrase, not a regular data as example.
1

You can try something like below:

items = ['Time', 'Type', 'Price', 'Id']
for index, item in enumerate(items):
    df[item] = df['Descriptive'].apply(lambda row: row.split(' ')[index].split(item)[1])

print(df)

Result:

       Date                            Descriptive   Time Type Price        Id
0  2017-1-1    Time12:30 Type021 Price11$ Id124562  12:30  021   11$    124562
1  2017-1-2  Time12:40 Type011 Price11$ Id12456512  12:40  011   11$  12456512
2  2017-1-3  Time12:45 Type031 Price11$ Id12456125  12:45  031   11$  12456125

If for loop is confusing you can try apply without loop:

df['Time'] = df['Descriptive'].apply(lambda row: row.split(' ')[0].split('Time')[1])
df['Type'] = df['Descriptive'].apply(lambda row: int(row.split(' ')[1].split('Type')[1]))
df['Price'] = df['Descriptive'].apply(lambda row: row.split(' ')[2].split('Price')[1])
df['Id'] = df['Descriptive'].apply(lambda row: row.split(' ')[3].split('Id')[1])

3 Comments

Is there something wrong, wouldn't the solution work?
Thanks. When I try your solution in my dataset, it says: IndexError: list index out of range. Do you know why?
~\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds) 2549 else: 2550 values = self.asobject -> 2551 mapped = lib.map_infer(values, f, convert=convert_dtype) 2552 2553 if len(mapped) and isinstance(mapped[0], Series): pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer() <ipython-input-24-ea7872f2e5a8> in <lambda>(row)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.