3

I have a dataframe composed of text job descriptions, and 3 empty columns

   index   job_description                 level_1      level_2        level_3
    0      this job requires masters in..    0             0              0
    1      bachelor degree needed for..      0             0              0
    2      ms is preferred or phd..          0             0              0

I'm trying to go through each job description string and count the frequency of each degree level that was mentioned in the job description. A sample output should look like this.

   index   job_description                 level_1      level_2        level_3
    0      this job requires masters in..    0             1              0
    1      bachelor degree needed for..      1             0              0
    2      ms is preferred or phd..          0             1              1

I created the dictionaries to do the comparison as seen below, but I'm somewhat clueless on how I can look for those words in the strings of the dataframe "job description" column and populate the dataframe columns depending on whether the words exist or not.

my_dict_1 = dict.fromkeys(['bachelors', 'bachelor', 'ba','science
                           degree','bs','engineering degree'], 1)
my_dict_2 = dict.fromkeys(['masters', 'ms', 'master'], 1)
my_dict_3 = dict.fromkeys(['phd','p.h.d'], 1)

I really appreciate the support on this..

3 Answers 3

2

How about something like this?

Since each of your three dictionaries correspond to different columns you want to create, we can create another dictionary mapping with the soon-to-be-column names as keys, and the strings to search for at each particular level as values (really, you don't even need a dictionary for storing the my_dict_<x> items - you could use a set instead - but it's not a huge deal):

>>> lookup = {'level_1': my_dict_1, 'level_2': my_dict_2, 'level_3': my_dict_3}
>>> lookup
{'level_1': {'bachelors': 1, 'bachelor': 1, 'ba': 1, 'science degree': 1, 'bs': 1, 'engineering degree': 1}, 'level_2': {'masters': 1, 'ms': 1, 'master': 1}, 'level_3': {'phd': 1, 'p.h.d': 1}}

Then, go through each proposed column in the dictionary you just created and assign a new column which creates the output you want, checking for each level specified in each my_dict_<x> object whether at least one belongs in the job description in each row...

>>> for level, values in lookup.items():
...     df[level] = df['job_description'].apply(lambda x: 1 if any(v in x for v in values) else 0)
... 
>>> df
              job_description  level_1  level_2  level_3
0     masters degree required        0        1        0
1  bachelor's degree required        1        0        0
2    bachelor degree required        1        0        0
3                phd required        0        0        1

Another solution, using scikit-learn's CountVectorizer class, which counts the frequencies of tokens (words, basically) occurring in strings:

>>> from sklearn.feature_extraction.text import CountVectorizer

Specify a particular vocabulary - forget about all other words that aren't "academic credential" keywords:

>>> vec = CountVectorizer(vocabulary={value for level, values in lookup.items() for value in values})
>>> vec.vocabulary
{'master', 'p.h.d', 'ba', 'ms', 'engineering degree', 'masters', 'phd', 'bachelor', 'bachelors', 'bs', 'science degree'}

Fit that transformer to the text iterable, df['job_description']:

>>> result = vec.fit_transform(df['job_description'])

Taking a deeper look at the results:

>>> pd.DataFrame(result.toarray(), columns=vec.get_feature_names())
   ba  bachelor  bachelors  bs  engineering degree  master  masters  ms  p.h.d  phd  science degree
0   0         0          0   0                   0       0        1   0      0    0               0
1   0         1          0   0                   0       0        0   0      0    0               0
2   0         1          0   0                   0       0        0   0      0    0               0
3   0         0          0   0                   0       0        0   0      0    1               0

This last approach might require a bit more work if you want to get back to your level_<x> column structure, but I thought I'd just show it as a different way of thinking about encoding those datapoints.

Sign up to request clarification or add additional context in comments.

Comments

2

A slightly different approach is to store keywords and job descriptions as sets, and then compute set intersections. You could generate the intersection matrix compactly by vectorizing set.intersection:

import pandas as pd

df = pd.read_csv(
    pd.compat.StringIO(
        """   index   job_description                 level_1      level_2        level_3
        0      this job requires masters in..    0             0              0
            1      bachelor degree needed for..      0             0              0
                2      ms is preferred or phd ..          0             0              0"""
    ),
    sep=r"  +",
)


levels = pd.np.array(
    [
        {"bachelors", "bachelor", "ba", "science degree", "bs", "engineering degree"},
        {"masters", "ms", "master"},
        {"phd", "p.h.d"},
    ]
)

df[["level_1", "level_2", "level_3"]] = (
    pd.np.vectorize(set.intersection)(
        df.job_description.str.split().apply(set).values[:, None], levels
    )
    .astype(bool)
    .astype(int)
)

   index                 job_description  level_1  level_2  level_3
0      0  this job requires masters in..        0        1        0
1      1    bachelor degree needed for..        1        0        0
2      2       ms is preferred or phd ..        0        1        1

Comments

1

I think we can do like this:

# create a level based mapper dict
mapper = {'level_1':['bachelors', 'bachelor', 'ba','science degree','bs','engineering degree'],
          'level_2': ['masters', 'ms', 'master'],
          'level_3': ['phd','p.h.d']}

# convert list to set
mapper = {k:set(v) for k,v in mapper.items}

# remove dots from description
df['description'] = df['description'].str.replace('.','')

# check if any word of description is available in the mapper dict
df['flag'] = df['description'].str.split(' ').apply(set).apply(lambda x: [k for k,v in mapper.items() if any([y for y in x if y in v])])

# convert the list into new rows
df1 = df.set_index(['index','description'])['flag'].apply(pd.Series).stack().reset_index().drop('level_2', axis=1)
df1.rename(columns={0:'flag'}, inplace=True)

# add a flag column , this value will be use as filler
df1['val'] = 1

# convert the data into wide format
df1 = df1.set_index(['index','description','flag'])['val'].unstack(fill_value=0).reset_index()
df1.columns.name = None

print(df1)

   index                   description  level_1  level_2  level_3
0      0  this job requires masters in        0        1        0
1      1  bachelor degree needed for 0        1        0        0
2      2        ms is preferred or phd        0        1        1

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.