2

I have a df that looks like this:

df = pd.DataFrame({
    'job_title':['Senior Data Scientist', 'Junior Data Analyst', 'Data Engineer Intern', 'Data Engieneer/Architect', 'Data Analyst/Visualisation'],
    'location':['Berlin', 'Frankfurt', 'Munich','Frankfurt', 'Munich'],
    'job_desc':['something something Python, R, Tableau something', 'something R and SQL',
                 'something Power Bi, Excel and Python','something Power Bi, Excel and Python somthing', 'Power BI and R something']})
        

My objective is to now plot the necessary skills that appear in the job description in job_desc column depending on the job title in job_title. Whats important is that the job titles in 'job_title' need to somehow filtered according to the three roles I mention below.

My idea was to do the following:

  1. create sub data frames according to the job title for Data Scientist, Data Analyst and Data Engineer
  2. creating new dfs from the ones I created that count the skills in the job_desc
  3. plot the skills in a bar plot with three sub bar plot according to the role

To do this I have done the following:

1.)

# creating sub datasets according to the three roles above to look further into the different skillset

# data analyst dataset
dfa = df[df['job_title'].str.contains('Data Ana')]

# data scientist dataset
dfs = df[df['job_title'].str.contains('Data Sci')]

# data engineer dataset
dfe = df[df['job_title'].str.contains('Data Eng')]

2.) Here I created a loop and stored the obtained information in a nested dictionary. At first I tried to directly store the data from the loop in new data frames, but I read here that it is best to do so using dictionaries.

# looping through each sub dataset to get the skill count
list = [dfa, dfs, dfe]

#creating an empty dictionary to store the new information in
dict_of_df = {}

for li in range(len(list)):

    # counting the skills in each df of the list
    python = list[li].job_desc.str.count('Python').sum()     
    R = list[li].job_desc.str.count('R ').sum()         
    tableau = list[li].job_desc.str.count('Tableau').sum()     
    pbi = list[li].job_desc.str.count('Power BI').sum()  
    excel = list[li].job_desc.str.count('Excel').sum()   
    sql = list[li].job_desc.str.count('SQL').sum()
    
    #creating a dictionary with the skills and their counts
    skills = ['python', 'R', 'pbi', 'tableau', 'excel', 'sql']
    counts = [python, R, tableau, pbi, excel, sql]
    dic = {'Skills': skills, 'Counts': counts}
        
    #appending the information in the empty dictionary
    dict_of_df['df_{}'.format(li)] = dic

This results in the following output:

dict_of_df = {{'df_0': {'Skills': ['python', 'R', 'pbi', 'tableau', 'excel', 'sql'], 'Counts': [0, 2, 0, 1, 0, 1]}, 'df_1': {'Skills': ['python', 'R', 'pbi', 'tableau', 'excel', 'sql'], 'Counts': [1, 0, 1, 0, 0, 0]}, 'df_2': {'Skills': ['python', 'R', 'pbi', 'tableau', 'excel', 'sql'], 'Counts': [2, 0, 0, 0, 2, 0]}}}

The dictionary contains the correct information and my desired output would then be to have three dfs from df_0,df_1 and df_2 in this format:

Skills  Counts
0   python  0
1   R   1
2   pbi 0
3   tableau 0
4   excel   0
5   sql 1

But this I am not able to do, I tried to apply what I have found in these posts

Creating multiple dataframes from a dictionary in a loop

Construct pandas DataFrame from items in nested dictionary

Construct a pandas DataFrame from items in a nested dictionary with lists as inner values

Python Pandas: Convert nested dictionary to dataframe

However, all of the above posts have different dictionary structures as mine seems to be double nested. I also have the impression that my way may be too over complicating things.

1 Answer 1

2

Don't overcomplicate things, here is the simplified approach:

skills = ['python', 'R', 'pbi', 'tableau', 'excel', 'sql']
pattern = r'(?i)\b(%s)\b' % '|'.join(skills)

s = df.set_index('job_title')['job_desc'].str.extractall(pattern)[0].droplevel(1) # -- step 1
s = pd.crosstab(s.index, s, rownames=['job_title'], colnames=['skills']) # -- step 2

Explained

Build a regex pattern using the skills then use extractall to find all the matching occurrences from each row of the job description column

# -- step 1

job_title
Senior Data Scientist     Python
Senior Data Scientist          R
Senior Data Scientist    Tableau
Junior Data Analyst            R
Junior Data Analyst          SQL
Data Engineer Intern       Excel
Data Engineer Intern      Python
Name: 0, dtype: object

Create a frequency table using crosstab

# -- step 2

skills                 Excel  Python  R  SQL  Tableau
job_title                                            
Data Engineer Intern       1       1  0    0        0
Junior Data Analyst        0       0  1    1        0
Senior Data Scientist      0       1  1    0        1

That's it...Now depending upon how you would want to visualize the above data you can either use barplot or heat map. Personally I would prefer heatmap

import seaborn as sns

sns.heatmap(s, cmap='Blues')

enter image description here

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you very much @Shubam Sharma. Unfortunately, your code does not work for my df as the job_titles are all fairly different. I updated my df here in the question to make it clearer. To use your code I first need to replace the strings in the job_title column to either Data Scientist, Data Analyst or Data Engineer. So far I tried the following: df1 = df[np.logical_or(df_c['job_title'].str.contains('Data Ana'), df['job_title'].str.contains('Data Sci'), df['job_title'].str.contains('Data Eng'))], but this did not work.
So I now used: df.job_title = df.job_title.apply(lambda x: 'Data Analyst' if 'Data Ana' in x else x) df.job_title = df.job_title.apply(lambda x: 'Data Scientist' if 'Data Sci' in x else x) df.job_title = df.job_title.apply(lambda x: 'Data Engineer' if 'Data En' in x else x) and then your code and got the output I wanted! Thanks again. Could you just explain what this r'(?i)\b(%s)\b' % '|'.join(skills) in the pattern expressions means and does and the step one part of you code in more detail please?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.