1

I am trying to divide a large data set into smaller parts for an analysis. I been using a for-loop to divide the data set before implementing the decision trees. Please see a small version of the data set below:

ANZSCO4_CODE          Skill_name              Cluster         date
  1110                  computer                 S              1
  1110                  communication            C              1
  1110                  SAS                      S              2
  1312                  IT support               S              1
  1312                  SAS                      C              2
  1312                  IT support               S              1
  1312                  SAS                      C              1

First step I create an empty dictionary:

d = {}

and the lists:

 list = [1110, 1322, 2111]
 s_type = ['S','C']

Then run the following loop:

for i in list:
    d[i]=pd.DataFrame(df1[df1['ANZSCO4_CODE'].isin([i])] )

The result is a dictionary with 2 data sets inside.

As a next step I would like to subdivide the data sets into S and C. I run the following code:

for i in list:
    d[i]=pd.DataFrame(df1[df1['ANZSCO4_CODE'].isin([i])] )

    for b in s_type:
         d[i]=  d[i][d[i]['SKILL_CLUSTER_TYPE']==b]

As a final result I would expect to have 4 separate data sets, being: 1110 x S, 1110 x C , 1312 x S and 1312 and C.

However when I implement the second code I get only 2 data sets inside the dictionary and they are empty.

2
  • 1
    can you please show me, what is in the list variable? Commented Jul 25, 2018 at 4:55
  • @user2906838 , sorry I missed that. It is edit now Commented Jul 25, 2018 at 5:06

2 Answers 2

2

Maybe something like this works:

from collections import defaultdict

d = defaultdict(pd.DataFrame)

# don't name your list "list"
anzco_list = [1110, 1312]
s_type = ['S','C']

for i in anzco_list:
    for b in s_type:
        d[i][b] = df1[(df1['ANZSCO4_CODE'] == i) & (df1['SKILL_CLUSTER_TYPE'] == b)]

Then you can access your DataFrames like this:

d[1112]['S']
Sign up to request clarification or add additional context in comments.

2 Comments

thanks for your support. I'm getting the following error :ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series
You can use jezrael's answer below. That seems like a better way to do it.
1

I think there was empty DataFrames, because in data was not values from list called L (Dont use variable name list, because python reserved word).

from  itertools import product

L = [1110, 1312, 2111]
s_type = ['S','C']

Then create all combinations all lists:

comb = list(product(L, s_type))
print (comb)
[(1110, 'S'), (1110, 'C'), (1312, 'S'), (1312, 'C'), (2111, 'S'), (2111, 'C')]

And last create dictionary of DataFrames:

d = {}
for i, j in comb:
    d['{}x{}'.format(i, j)] = df1[(df1['ANZSCO4_CODE'] == i) & (df1['Cluster'] == j)]

Or use dictionary comprehension:

d = {'{}x{}'.format(i, j): df1[(df1['ANZSCO4_CODE'] == i) & (df1['Cluster'] == j)] 
      for i, j in comb}

print (d['1110xS'])
   ANZSCO4_CODE Skill_name Cluster
0          1110   computer       S
2          1110        SAS       S

EDIT:

If need all combinations of possible data by columns use groupby:

d = {'{}x{}x{}'.format(i,j,k): df2 
      for (i,j, k), df2 in df1.groupby(['ANZSCO4_CODE','Cluster','date'])}
print (d)
{'1110xCx1':    ANZSCO4_CODE     Skill_name Cluster  date
1          1110  communication       C     1, '1110xSx1':    ANZSCO4_CODE Skill_name Cluster  date
0          1110   computer       S     1, '1110xSx2':    ANZSCO4_CODE Skill_name Cluster  date
2          1110        SAS       S     2, '1312xCx1':    ANZSCO4_CODE Skill_name Cluster  date
6          1312        SAS       C     1, '1312xCx2':    ANZSCO4_CODE Skill_name Cluster  date
4          1312        SAS       C     2, '1312xSx1':    ANZSCO4_CODE  Skill_name Cluster  date
3          1312  IT support       S     1
5          1312  IT support       S     1}

print (d.keys())
dict_keys(['1110xCx1', '1110xSx1', '1110xSx2', '1312xCx1', '1312xCx2', '1312xSx1'])

Another different approach is if need processes each group is use GroupBy.apply:

def func(x):
    print (x)
    #some code for process each group
    return x

   ANZSCO4_CODE     Skill_name Cluster  date
1          1110  communication       C     1
   ANZSCO4_CODE     Skill_name Cluster  date
1          1110  communication       C     1
   ANZSCO4_CODE Skill_name Cluster  date
0          1110   computer       S     1
   ANZSCO4_CODE Skill_name Cluster  date
2          1110        SAS       S     2
   ANZSCO4_CODE Skill_name Cluster  date
6          1312        SAS       C     1
   ANZSCO4_CODE Skill_name Cluster  date
4          1312        SAS       C     2
   ANZSCO4_CODE  Skill_name Cluster  date
3          1312  IT support       S     1
5          1312  IT support       S     1

df2 = df1.groupby(['ANZSCO4_CODE','Cluster','date']).apply(func)
print (df2)

6 Comments

Hi, I'm getting the following error : TypeError: 'list' object is not callable when calling comb = list(product(L, s_type))
Trying naming your list something other than list.
@Ian_De_Oliveira - Problem is before is used variable list, solution is restart your IDE or use list = builtins.list
@Ian_De_Oliveira - And exactly this is reason why is necessary dont use variable list ;)
@jezrael and @ Ashish Acharya , thanks for both responses, also thanks for advising me to do not use list..@jezrael in a hypothetical scenario if I add a date variable would I be able to do combinations with 3 constraints?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.