Create new dataframe columns from one column with different values and types

Question

I'm trying to create new columns by the fish species name and have the integer as the value, keeping the indexing to do a dataframe join afterwards.

import pandas as pd
df = pd.read_csv("fishCounts.csv",index_col=0)
countsdf = df[["Fish Count"]].copy()
countsdf.head()
    
Fish Count
0   38 Sand Bass, 16 Sculpin, 10 Blacksmith
1   138 Sculpin, 28 Sand Bass
2   150 Sculpin Released, 102 Sculpin, 40 Sanddab
3   156 Sculpin, 29 Sand Bass, 5 Black Croaker, 3 ...
4   161 Sculpin

countsdf.columns = ["fish"]
countsdf.fish = countsdf.fish.str.split(", ", expand=False)
countsdf.head()

fish
0   [38 Sand Bass, 16 Sculpin, 10 Blacksmith]
1   [138 Sculpin, 28 Sand Bass]
2   [150 Sculpin Released, 102 Sculpin, 40 Sanddab]
3   [156 Sculpin, 29 Sand Bass, 5 Black Croaker, 3...
4   [161 Sculpin]

Here's where I'm not sure where to go. Iterate through the dataframe rows? Make a list of dictionaries? Could I have imported the data differently to make this easier?

Edit: This is what I'm trying to get to.

  Sand Bass   Sculpin   Blacksmith   Sculpin Released  Sanddab  Black Croaker
0        38        16           10
1        28        138
2                  102                            150       40
3        29        156                                                      5
4                  161

don't worry about formatting - since we paste the data back into the IDE based on a delimiter - (comma, space, tab etc) — Umar.H
– Umar.H, Commented Aug 4, 2020 at 21:13

wwnde · Accepted Answer · 2020-08-04 22:40:17Z

something similar to @Manakin

Turn Fish Count int list

df['Fish Count']=df['Fish Count'].str.split(',')

Explode to separate each fish with its id

df2=df.explode('Fish Count')

Create dictionary. Here, I use list comprehension to derive key and value after splitting values in Fish Count by white space after digit

{i:j for i,j in df2['Fish Count'].str.split(r'(?<=\d)\s')}

Outcome

{'38': 'Sand Bass',
 ' 16': 'Sculpin',
 ' 10': 'Blacksmith',
 '138': 'Sculpin',
 ' 28': 'Sand Bass',
 '150': 'Sculpin Released',
 ' 102': 'Sculpin',
 ' 40': 'Sanddab',
 '156': 'Sculpin',
 ' 29': 'Sand Bass',
 ' 5': 'Black Croaker',
 '161': 'Sculpin'}

Can print if needed

print(pd.DataFrame.from_dict({i:j for i,j in df2['Fish Count'].str.split(r'(?<=\d)\s')}, orient='index'))

                     0
38           Sand Bass
 16            Sculpin
 10         Blacksmith
138            Sculpin
 28          Sand Bass
150   Sculpin Released
 102           Sculpin
 40            Sanddab
156            Sculpin
 29          Sand Bass
 5       Black Croaker
161            Sculpin

halfer · Accepted Answer · 2023-07-28 12:56:48Z

We can use str.split and str.extract with stack:

s = df['Fish Count'].str.split(',',expand=True).stack()
s.str.extract('(\d+)(\D+)')

yields -

       0                  1
0 0   38          Sand Bass
  1   16            Sculpin
  2   10         Blacksmith
1 0  138            Sculpin
  1   28          Sand Bass
2 0  150   Sculpin Released
  1  102            Sculpin
  2   40            Sanddab
3 0  156            Sculpin
  1   29          Sand Bass
  2    5      Black Croaker
  3    3                ...
4 0  161            Sculpin

then it's up to you in regards to the format you want/need.

i.e

s.str.extract('(\d+)(\D+)').groupby(level=[1]).agg(list)

                          0                                                  1
0  [38, 138, 150, 156, 161]  [ Sand Bass,  Sculpin,  Sculpin Released,  Scu...
1         [16, 28, 102, 29]       [ Sculpin,  Sand Bass,  Sculpin,  Sand Bass]
2               [10, 40, 5]            [ Blacksmith,  Sanddab,  Black Croaker]
3                       [3]                                             [ ...]

or

s.str.extract('(\d+)(\D+)').unstack(1)

     0                                 1                                  
     0    1    2    3                  0           1               2     3
0   38   16   10  NaN          Sand Bass     Sculpin      Blacksmith   NaN
1  138   28  NaN  NaN            Sculpin   Sand Bass             NaN   NaN
2  150  102   40  NaN   Sculpin Released     Sculpin         Sanddab   NaN
3  156   29    5    3            Sculpin   Sand Bass   Black Croaker   ...
4  161  NaN  NaN  NaN            Sculpin         NaN             NaN   NaN

or

s.str.extract('(\d+)(\D+)').values


array([['38', ' Sand Bass'],
       ['16', ' Sculpin'],
       ['10', ' Blacksmith'],
       ['138', ' Sculpin'],
       ['28', ' Sand Bass'],
       ['150', ' Sculpin Released'],
       ['102', ' Sculpin'],
       ['40', ' Sanddab'],
       ['156', ' Sculpin'],
       ['29', ' Sand Bass'],
       ['5', ' Black Croaker'],
       ['3', ' ...'],
       ['161', ' Sculpin']], dtype=object)

which you can turn into a dict.

# actually i'd use fish : num - 
# sorry closed my ide keys can only be unique in a dict.
{num : fish for num, fish in s.str.extract('(\d+)(\D+)').values}

{'38': ' Sand Bass',
 '16': ' Sculpin',
 '10': ' Blacksmith',
 '138': ' Sculpin',
 '28': ' Sand Bass',
 '150': ' Sculpin Released',
 '102': ' Sculpin',
 '40': ' Sanddab',
 '156': ' Sculpin',
 '29': ' Sand Bass',
 '5': ' Black Croaker',
 '3': ' ...',
 '161': ' Sculpin'}

The other answers have helped but I think you're closer to getting me where I'm trying to go. I added an edit to my question to show what I'm aiming for. I'm trying to make a list of dict items keeping each original index but I'm not getting anywhere.

RichieV · Accepted Answer · 2020-08-04 21:48:20Z

1

First you need to explode the lists you made and then you can use extract with regex twice, once to match numbers and then match text.

With the data

data = '38 Sand Bass, 16 Sculpin, 10 Blacksmith\n138 Sculpin, 28 Sand Bass\n150 Sculpin Released, 102 Sculpin, 40 Sanddab\n156 Sculpin, 29 Sand Bass, 5 Black Croaker\n161 Sculpin'
df = pd.DataFrame(data.split('\n'), columns=['Fish Count'])

Do

countsdf = df['Fish Count'].str.split(', ')
countsdf = countsdf.explode('Fish Count').rename('fish').to_frame()
countsdf['count'] = countsdf.fish.str.extract('([0-9]+)')
countsdf['species'] = countsdf.fish.str.extract('([a-zA-Z]+[ a-zA-Z]*)')
countsdf.drop('fish', axis=1, inplace=True)

Output

   count           species
0     38         Sand Bass
1     16           Sculpin
2     10        Blacksmith
3    138           Sculpin
4     28         Sand Bass
5    150  Sculpin Released
6    102           Sculpin
7     40           Sanddab
8    156           Sculpin
9     29         Sand Bass
10     5     Black Croaker
11   161           Sculpin

edited Aug 4, 2020 at 21:48

answered Aug 4, 2020 at 20:39

RichieV

5,1832 gold badges13 silver badges24 bronze badges

2 Comments

user58008 Over a year ago

My autocomplete didn't show an explode function and threw this error on running: "'DataFrame' object has no attribute 'explode'"

RichieV Over a year ago

it is part of pandas since 0.25.0 pandas.pydata.org/docs/reference/api/…

user58008 · Accepted Answer · 2020-08-07 05:57:20Z

Using @Manakin's answer to get to this multi-indexed dataframe:

       0                  1
0 0   38          Sand Bass
  1   16            Sculpin
  2   10         Blacksmith
1 0  138            Sculpin
  1   28          Sand Bass
2 0  150   Sculpin Released
  1  102            Sculpin
  2   40            Sanddab
3 0  156            Sculpin
  1   29          Sand Bass
  2    5      Black Croaker
4 0  161            Sculpin

I then renamed columns, stripped leading and ending white-space for 'species', switched column order, and set index names.

s.columns = ['num','species']
s.species = s.species.str.strip()
s = s.reindex(['species','num'],axis=1)
s.index.names = ['a','b']
s.head()

        species     num
a   b       
0   0   Sand Bass   38
1         Sculpin   16
2      Blacksmith   10
1   0     Sculpin   138
1       Sand Bass   28

Then I flattened and reset indices, and dropped b index.

s_flat = s.reset_index()
s_reindexed = s_flat.set_index(['a','species'])
s_reindexed = s_reindexed.drop(columns='b')
s_reindexed.head()

               num
a   species     
0 Sand Bass     38
     Sculpin    16
  Blacksmith    10
1    Sculpin    138
   Sand Bass    28

Finally I unstacked and dropped the columnar multi-index level. I had a Null column I had to remove as well

s_reindexed = s_reindexed.unstack(1)
s_reindexed.columns = s_reindexed.columns.droplevel(0)
s_reset = s_reindexed.drop(columns=np.nan)
s_reset .head()

species     Albacore    Barracuda   Barracuda Released  Bat Ray Released    Black Croaker   Black Seabass Released  Blacksmith  Blue Perch  Bluefin Tuna    Bocaccio ...
a                                                                                   
0                NaN          NaN                  NaN               NaN              NaN                      NaN          10         NaN           NaN         NaN ...
1                NaN          NaN                  NaN               NaN              NaN                      NaN         NaN         NaN           NaN         NaN ...
2                NaN          NaN                  NaN               NaN              NaN                      NaN         NaN         NaN           NaN         NaN ...
3                NaN          NaN                  NaN               NaN                5                      NaN         NaN           3           NaN         NaN ...
4                NaN          NaN                  NaN               NaN              NaN                      NaN         NaN         NaN           NaN         NaN ...

Collectives™ on Stack Overflow

Create new dataframe columns from one column with different values and types

4 Answers 4

Comments

1 Comment

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

1 Comment

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related