3

Below is my dataframe having a column that are merged together,

   PLUGS\nDESIGN\nGEAR
0  700\nDaewoo 8000  Gearless   
1  300\nHyundai 4400  Gearless   
2  600\nSTX 2600  Gearless   
3  200\nB170 \nGeared   
4  362 Wenchong 1700 Mk II \nGeared   
5  252\nRichMax 1550  Gearless   
6  220\nCV 1100 Plus \nGeared   
7  232\nOrskov Mk VII  Gearless   
8  119\nKouan 1000  Gearless   
9  100\nHanjin 700  Gearless

I want to split the columns into three different columns namely PLUGS, DESIGN, GEAR. Is there any way to do this?

Below is the code which i tried:

new_df[['PLUGS', 'DESIGN', 'GEAR']] = new_df['PLUGS\nDESIGN\nGEAR'].str.split(' ')
                print(new_df)

expected output:

   PLUGS  DESIGN               GEAR
0  700    Daewoo 8000          Gearless   
1  300    Hyundai 4400         Gearless   
2  600    STX 2600             Gearless   
3  200    B170                 Geared   
4  362    Wenchong 1700 Mk II  Geared   
5  252    RichMax 1550         Gearless   
6  220    CV 1100 Plus         Geared   
7  232    Orskov Mk VII        Gearless   
8  119    Kouan 1000           Gearless   
9  100    Hanjin 700           Gearless
8
  • 2
    What is your original CSV? How did you read the file? @Saran Commented Jul 8, 2021 at 5:51
  • I extracted this information from a pdf using camelot Commented Jul 8, 2021 at 5:52
  • Is it possible to have the raw text and not the dataframe please? Commented Jul 8, 2021 at 6:09
  • Is df["PLUGS\nDESIGN\nGEAR"].str.extract(r"^(\d+)[\\n\s]+([^\\]+)[\\n\s]+(.+)$") working on real data? Commented Jul 8, 2021 at 6:20
  • 1
    @KarnKumar Sorry for the late response... Indeed regex seems to work. Thanks for the answer, +1. Commented Jul 12, 2021 at 19:55

2 Answers 2

2

As Suggested in the comment section, the regex should work pretty well here,

DataFrame Sample:

>>> df
                   PLUGS\nDESIGN\nGEAR
0        700\nDaewoo 8000  Gearless
1       300\nHyundai 4400  Gearless
2           600\nSTX 2600  Gearless
3                200\nB170 \nGeared
4  362 Wenchong 1700 Mk II \nGeared
5       252\nRichMax 1550  Gearless
6        220\nCV 1100 Plus \nGeared
7      232\nOrskov Mk VII  Gearless
8         119\nKouan 1000  Gearless
9            100\nHanjin 700  Gearless

Just removing the newline char from the column name to make readability easy for use as well.

>>> df.columns = df.columns.str.replace(r"\\n", " ", regex=True)

Now, Column name does not have any special cars:

>>> df
                     PLUGS DESIGN GEAR
0        700\nDaewoo 8000  Gearless
1       300\nHyundai 4400  Gearless
2           600\nSTX 2600  Gearless
3                200\nB170 \nGeared
4  362 Wenchong 1700 Mk II \nGeared
5       252\nRichMax 1550  Gearless
6        220\nCV 1100 Plus \nGeared
7      232\nOrskov Mk VII  Gearless
8         119\nKouan 1000  Gearless
9            100\nHanjin 700  Gearless

Now, we can use pandas.Series.str.extract. While using regex method, all the Named groups () will become column names in the result.

As, the named group will become columns with predefined names like 0,1,2 thus we can rename them altogether with desired names to get the desired result as follows:

>>> df = df['PLUGS DESIGN GEAR'].str.extract(r"^(\d+)[\\n\s]+([^\\]+)[\\n\s]+([\\n|^Gear][a-z]+)").rename(columns={0: 'PLUGS', 1: 'DESIGN', 2: 'GEAR'})

Result:

>>> print(df)
  PLUGS                DESIGN      GEAR
0   700          Daewoo 8000   Gearless
1   300         Hyundai 4400   Gearless
2   600             STX 2600   Gearless
3   200                 B170     Geared
4   362  Wenchong 1700 Mk II     Geared
5   252         RichMax 1550   Gearless
6   220         CV 1100 Plus     Geared
7   232        Orskov Mk VII   Gearless
8   119           Kouan 1000   Gearless
9   100           Hanjin 700   Gearless

regex Explanation:

You can check at regex101.com

(\d+)[\\n\s]+([^\\]+)[\\n\s]+([\|^Gear][a-z]+)

1st Capturing Group (\d+)

    \d matches a digit (equivalent to [0-9])
    + matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
    Match a single character present in the list below [\\n\s]
    + matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
    \\ matches the character \ literally (case sensitive)
    n matches the character n literally (case sensitive)
    \s matches any whitespace character (equivalent to [\r\n\t\f\v ])

2nd Capturing Group ([^\]+)

    Match a single character not present in the list below [^\\]
    + matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
    \\ matches the character \ literally (case sensitive)
    Match a single character present in the list below [\\n\s]
    + matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
    \\ matches the character \ literally (case sensitive)
    n matches the character n literally (case sensitive)
    \s matches any whitespace character (equivalent to [\r\n\t\f\v ])

3rd Capturing Group ([|^Gear][a-z]+)

Match a single character present in the list below [\|^Gear]
\| matches the character | literally (case sensitive)
^Gear matches a single character in the list ^Gear (case sensitive)
Match a single character present in the list below [a-z]
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
a-z matches a single character in the range between a (index 97) and z (index 122) (case sensitive)
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
Sign up to request clarification or add additional context in comments.

1 Comment

This is nice Sir, one liner with regex.
1

Starting from your Dataframe :

>>> import pandas as pd

>>> df = pd.DataFrame({'PLUGS\nDESIGN\nGEAR': ['700\nDaewoo 8000  Gearless', '300\nHyundai 4400  Gearless', '600\nSTX 2600  Gearless', '200\nB170 \nGeared', '362 Wenchong 1700 Mk II \nGeared', '252\nRichMax 1550  Gearless'], }, 
...                   index = [0, 1, 2, 3, 4, 5]) 
>>> df
    PLUGS\nDESIGN\nGEAR
0   700\nDaewoo 8000 Gearless
1   300\nHyundai 4400 Gearless
2   600\nSTX 2600 Gearless
3   200\nB170 \nGeared
4   362 Wenchong 1700 Mk II \nGeared
5   252\nRichMax 1550 Gearless

You can indeed use the split method on several separators, here \n and space:

>>> df = pd.DataFrame(df['PLUGS\nDESIGN\nGEAR'].str.split('\n| '))
    PLUGS\nDESIGN\nGEAR
0   [700, Daewoo, 8000, , Gearless]
1   [300, Hyundai, 4400, , Gearless]
2   [600, STX, 2600, , Gearless]
3   [200, B170, , Geared]
4   [362, Wenchong, 1700, Mk, II, , Geared]
5   [252, RichMax, 1550, , Gearless]

Then, you can assign the first and last element to the correct column, and the rest to the DESIGN column :

>>> df['PLUGS'] = df['PLUGS\nDESIGN\nGEAR'].str[0]
>>> df['DESIGN'] = df['PLUGS\nDESIGN\nGEAR'].str[1:-1]
>>> df['GEAR'] = df['PLUGS\nDESIGN\nGEAR'].str[-1]
>>> df
    PLUGS\nDESIGN\nGEAR                         PLUGS   DESIGN                      GEAR
0   [700, Daewoo, 8000, , Gearless]             700     [Daewoo, 8000, ]            Gearless
1   [300, Hyundai, 4400, , Gearless]            300     [Hyundai, 4400, ]           Gearless
2   [600, STX, 2600, , Gearless]                600     [STX, 2600, ]               Gearless
3   [200, B170, , Geared]                       200     [B170, ]                    Geared
4   [362, Wenchong, 1700, Mk, II, , Geared]     362     [Wenchong, 1700, Mk, II, ]  Geared
5   [252, RichMax, 1550, , Gearless]            252     [RichMax, 1550, ]           Gearless

The last thing to do is to improve the DESIGN column to map it as a string instead of a list using the join method, and drop the PLUGS\nDESIGN\nGEAR column like so :

>>> df['DESIGN'] = df['DESIGN'].apply(lambda x: ' '.join(map(str, x)))
>>> df.drop(['PLUGS\nDESIGN\nGEAR'], axis=1)
    PLUGS   DESIGN               GEAR
0   700     Daewoo 8000          Gearless
1   300     Hyundai 4400         Gearless
2   600     STX 2600             Gearless
3   200     B170                 Geared
4   362     Wenchong 1700 Mk II  Geared
5   252     RichMax 1550         Gearless

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.