Codes to apply an extension of the method in Webb (2020) to extract Tasks from text describing an occupation or a technology. A task is defined as a (verb,noun) pair where the noun is the direct object, or a conjunct of the direct object, of the verb.
- Step 1: Using SpaCy's DependencyParser and the
en_core_web_trf
model, add dependency and parts-of-speech tags to all tokens in a text. - Step 2: For all words identified as verbs by the POS Tagger, identify the first direct object that lies on the verb's subtree defined by the .head relationship.
- Step 3: Use SpaCy's Lemmatizer to lemmatize verbs.
- Step 4: Use NLTK's WordNet Corpus to translate nouns to a common level of generality (WordNet Level 4).
Once verb-noun tasks have been extracted from the O*NET Data, we can do cool stuff with the vector representation of occupations in the verb-noun space. Here's a natural one: How "similar" are different occupations?
To answer this, here's one approach based on the idea of TF-IDF Measures. Let
-
Pick an appropriate scale from O*NET for each raw task associated with each occupation. There are two scales, importance and relevance. Let
$w(o,\hat{t})$ be the scale value of raw task$\hat{t}$ for occupation$o$ . If$w(o,\hat{t})>0$ I will use the notation$\hat{t}\in o$ . -
After the decomposition of raw task
$\hat{t}$ into verb-noun tasks, I will use the notation$t\in \hat{t}$ if verb-noun task$t$ was extracted from raw task$\hat{t}$ . The notation$t\in o$ will indicate that$t\in\hat{t}$ for at least one$\hat{t}\in o$ . -
Construct a scale value for each verb-noun task by taking the mean scale value across all raw tasks within an occupation mentioning that verb-noun task. That is, for verb-noun task
$t$ ,
where
- Construct the Weighted Term Frequency of task
$t$ for occupation$o$ using the formula
- Construct the Inverse Document Frequency of task
$t$ using the formula
- Construct the TF-IDF Scores
$$w(o,t) = W_{TF}(o,t)\times W_{IDF}(t)$$ .
The TF-IDF Scores give us a vector representation of each occupation in terms of the space of extracted tasks. We can now use these scores to compute the similarity between two occupations. I compute two different measures:
- The Cosine Similarity, defined by
$$CosSim(o,o^{prime}) = \frac{w(o)\cdot w(o^{\prime})}{\vert w(o)\vert\times\vert w(o^{\prime})\vert}$$
where
- The Fraction Similarity, defined by
$$FracSim(o,o^{prime}) = \frac{\mathbf{1}(w(o)>0)\cdot\mathbf{1}(w(o^{\prime})>0)}{T}$$