3

I'm trying to create new columns by the fish species name and have the integer as the value, keeping the indexing to do a dataframe join afterwards.

import pandas as pd
df = pd.read_csv("fishCounts.csv",index_col=0)
countsdf = df[["Fish Count"]].copy()
countsdf.head()
    
Fish Count
0   38 Sand Bass, 16 Sculpin, 10 Blacksmith
1   138 Sculpin, 28 Sand Bass
2   150 Sculpin Released, 102 Sculpin, 40 Sanddab
3   156 Sculpin, 29 Sand Bass, 5 Black Croaker, 3 ...
4   161 Sculpin

countsdf.columns = ["fish"]
countsdf.fish = countsdf.fish.str.split(", ", expand=False)
countsdf.head()

fish
0   [38 Sand Bass, 16 Sculpin, 10 Blacksmith]
1   [138 Sculpin, 28 Sand Bass]
2   [150 Sculpin Released, 102 Sculpin, 40 Sanddab]
3   [156 Sculpin, 29 Sand Bass, 5 Black Croaker, 3...
4   [161 Sculpin]

Here's where I'm not sure where to go. Iterate through the dataframe rows? Make a list of dictionaries? Could I have imported the data differently to make this easier?

Edit: This is what I'm trying to get to.

  Sand Bass   Sculpin   Blacksmith   Sculpin Released  Sanddab  Black Croaker
0        38        16           10
1        28        138
2                  102                            150       40
3        29        156                                                      5
4                  161
3
  • can you add your data as text? just copy it and paste it in Commented Aug 4, 2020 at 20:30
  • 1
    @Manakin Copy and pasting had bad formatting, is this ok? Commented Aug 4, 2020 at 20:51
  • don't worry about formatting - since we paste the data back into the IDE based on a delimiter - (comma, space, tab etc) Commented Aug 4, 2020 at 21:13

4 Answers 4

2

something similar to @Manakin

Turn Fish Count int list

df['Fish Count']=df['Fish Count'].str.split(',')

Explode to separate each fish with its id

df2=df.explode('Fish Count')

Create dictionary. Here, I use list comprehension to derive key and value after splitting values in Fish Count by white space after digit

{i:j for i,j in df2['Fish Count'].str.split(r'(?<=\d)\s')}

Outcome

{'38': 'Sand Bass',
 ' 16': 'Sculpin',
 ' 10': 'Blacksmith',
 '138': 'Sculpin',
 ' 28': 'Sand Bass',
 '150': 'Sculpin Released',
 ' 102': 'Sculpin',
 ' 40': 'Sanddab',
 '156': 'Sculpin',
 ' 29': 'Sand Bass',
 ' 5': 'Black Croaker',
 '161': 'Sculpin'}

Can print if needed

print(pd.DataFrame.from_dict({i:j for i,j in df2['Fish Count'].str.split(r'(?<=\d)\s')}, orient='index'))

                     0
38           Sand Bass
 16            Sculpin
 10         Blacksmith
138            Sculpin
 28          Sand Bass
150   Sculpin Released
 102           Sculpin
 40            Sanddab
156            Sculpin
 29          Sand Bass
 5       Black Croaker
161            Sculpin
Sign up to request clarification or add additional context in comments.

Comments

2

We can use str.split and str.extract with stack:

s = df['Fish Count'].str.split(',',expand=True).stack()
s.str.extract('(\d+)(\D+)')

yields -

       0                  1
0 0   38          Sand Bass
  1   16            Sculpin
  2   10         Blacksmith
1 0  138            Sculpin
  1   28          Sand Bass
2 0  150   Sculpin Released
  1  102            Sculpin
  2   40            Sanddab
3 0  156            Sculpin
  1   29          Sand Bass
  2    5      Black Croaker
  3    3                ...
4 0  161            Sculpin

then it's up to you in regards to the format you want/need.

i.e

s.str.extract('(\d+)(\D+)').groupby(level=[1]).agg(list)

                          0                                                  1
0  [38, 138, 150, 156, 161]  [ Sand Bass,  Sculpin,  Sculpin Released,  Scu...
1         [16, 28, 102, 29]       [ Sculpin,  Sand Bass,  Sculpin,  Sand Bass]
2               [10, 40, 5]            [ Blacksmith,  Sanddab,  Black Croaker]
3                       [3]                                             [ ...]

or

s.str.extract('(\d+)(\D+)').unstack(1)

     0                                 1                                  
     0    1    2    3                  0           1               2     3
0   38   16   10  NaN          Sand Bass     Sculpin      Blacksmith   NaN
1  138   28  NaN  NaN            Sculpin   Sand Bass             NaN   NaN
2  150  102   40  NaN   Sculpin Released     Sculpin         Sanddab   NaN
3  156   29    5    3            Sculpin   Sand Bass   Black Croaker   ...
4  161  NaN  NaN  NaN            Sculpin         NaN             NaN   NaN

or

s.str.extract('(\d+)(\D+)').values


array([['38', ' Sand Bass'],
       ['16', ' Sculpin'],
       ['10', ' Blacksmith'],
       ['138', ' Sculpin'],
       ['28', ' Sand Bass'],
       ['150', ' Sculpin Released'],
       ['102', ' Sculpin'],
       ['40', ' Sanddab'],
       ['156', ' Sculpin'],
       ['29', ' Sand Bass'],
       ['5', ' Black Croaker'],
       ['3', ' ...'],
       ['161', ' Sculpin']], dtype=object)

which you can turn into a dict.

# actually i'd use fish : num - 
# sorry closed my ide keys can only be unique in a dict.
{num : fish for num, fish in s.str.extract('(\d+)(\D+)').values}

{'38': ' Sand Bass',
 '16': ' Sculpin',
 '10': ' Blacksmith',
 '138': ' Sculpin',
 '28': ' Sand Bass',
 '150': ' Sculpin Released',
 '102': ' Sculpin',
 '40': ' Sanddab',
 '156': ' Sculpin',
 '29': ' Sand Bass',
 '5': ' Black Croaker',
 '3': ' ...',
 '161': ' Sculpin'}

1 Comment

The other answers have helped but I think you're closer to getting me where I'm trying to go. I added an edit to my question to show what I'm aiming for. I'm trying to make a list of dict items keeping each original index but I'm not getting anywhere.
1

First you need to explode the lists you made and then you can use extract with regex twice, once to match numbers and then match text.

With the data

data = '38 Sand Bass, 16 Sculpin, 10 Blacksmith\n138 Sculpin, 28 Sand Bass\n150 Sculpin Released, 102 Sculpin, 40 Sanddab\n156 Sculpin, 29 Sand Bass, 5 Black Croaker\n161 Sculpin'
df = pd.DataFrame(data.split('\n'), columns=['Fish Count'])

Do

countsdf = df['Fish Count'].str.split(', ')
countsdf = countsdf.explode('Fish Count').rename('fish').to_frame()
countsdf['count'] = countsdf.fish.str.extract('([0-9]+)')
countsdf['species'] = countsdf.fish.str.extract('([a-zA-Z]+[ a-zA-Z]*)')
countsdf.drop('fish', axis=1, inplace=True)

Output

   count           species
0     38         Sand Bass
1     16           Sculpin
2     10        Blacksmith
3    138           Sculpin
4     28         Sand Bass
5    150  Sculpin Released
6    102           Sculpin
7     40           Sanddab
8    156           Sculpin
9     29         Sand Bass
10     5     Black Croaker
11   161           Sculpin

2 Comments

My autocomplete didn't show an explode function and threw this error on running: "'DataFrame' object has no attribute 'explode'"
it is part of pandas since 0.25.0 pandas.pydata.org/docs/reference/api/…
0

Using @Manakin's answer to get to this multi-indexed dataframe:

       0                  1
0 0   38          Sand Bass
  1   16            Sculpin
  2   10         Blacksmith
1 0  138            Sculpin
  1   28          Sand Bass
2 0  150   Sculpin Released
  1  102            Sculpin
  2   40            Sanddab
3 0  156            Sculpin
  1   29          Sand Bass
  2    5      Black Croaker
4 0  161            Sculpin

I then renamed columns, stripped leading and ending white-space for 'species', switched column order, and set index names.

s.columns = ['num','species']
s.species = s.species.str.strip()
s = s.reindex(['species','num'],axis=1)
s.index.names = ['a','b']
s.head()

        species     num
a   b       
0   0   Sand Bass   38
1         Sculpin   16
2      Blacksmith   10
1   0     Sculpin   138
1       Sand Bass   28

Then I flattened and reset indices, and dropped b index.

s_flat = s.reset_index()
s_reindexed = s_flat.set_index(['a','species'])
s_reindexed = s_reindexed.drop(columns='b')
s_reindexed.head()

               num
a   species     
0 Sand Bass     38
     Sculpin    16
  Blacksmith    10
1    Sculpin    138
   Sand Bass    28

Finally I unstacked and dropped the columnar multi-index level. I had a Null column I had to remove as well

s_reindexed = s_reindexed.unstack(1)
s_reindexed.columns = s_reindexed.columns.droplevel(0)
s_reset = s_reindexed.drop(columns=np.nan)
s_reset .head()

species     Albacore    Barracuda   Barracuda Released  Bat Ray Released    Black Croaker   Black Seabass Released  Blacksmith  Blue Perch  Bluefin Tuna    Bocaccio ...
a                                                                                   
0                NaN          NaN                  NaN               NaN              NaN                      NaN          10         NaN           NaN         NaN ...
1                NaN          NaN                  NaN               NaN              NaN                      NaN         NaN         NaN           NaN         NaN ...
2                NaN          NaN                  NaN               NaN              NaN                      NaN         NaN         NaN           NaN         NaN ...
3                NaN          NaN                  NaN               NaN                5                      NaN         NaN           3           NaN         NaN ...
4                NaN          NaN                  NaN               NaN              NaN                      NaN         NaN         NaN           NaN         NaN ...

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.