GitHub - JuWu-19/Simple-NLP-Workflow

Automatic Field of Work Categorization Using NLP

This project applies Natural Language Processing (NLP) techniques to automate the categorization of the "Field of Work" column from questionnaire data stored in an Excel worksheet. The developed solution is flexible and can be adapted to analyze other columns with minimal modifications. The approach ensures proper handling of unavailable data entries and utilizes translation for multilingual processing.

Features

Automatic Categorization:
- Uses an efficient small-scale embedding model (all-MiniLM-L6-v2) to iteratively refine hierarchical category matching.
- Handles up to four hierarchical levels of categories, with each category having at least three child sub-categories to balance comprehensiveness and depth.
- Processes mixed Chinese-English entries using the translation model Helsinki-NLP/opus-mt-zh-en.
Preprocessing:
- Filters and processes missing or invalid data entries such as N/A, NA, or empty cells, ensuring robust performance.
Data Structure:
- The project defines standard industrial sector categories with up to four levels, providing a hierarchical framework for precise categorization.
Reproducibility:
- Modular code design for easy adaptation and scalability to handle different datasets or NLP tasks.

How It Works

Input Data:
- Reads the Excel worksheet with questionnaire responses.
- Encodes all the categories and sub-categories of all hierarchical levels to create embeddings for effective matching and refinement.
Translation:
- Uses the Helsinki-NLP/opus-mt-zh-en model to translate Chinese entries into English.
Categorization:
- Employs the all-MiniLM-L6-v2 model to map each entry to a hierarchical category by refining matches across levels.
Handling Missing Data:
- Filters invalid entries (e.g., N/A, NA, empty cells) to avoid classification errors.
Output:
- Generates an updated Excel worksheet (df_excel_output.xlsx) with additional columns:
  - Unified_Status
  - Unified_Field_of_Work

Future Directions

Remote Server Computation:
- Explore running computation-intensive NLP tasks on remote servers like Google Colab or AWS.
Adaptive Categories:
- Conduct global semantic identification to dynamically establish hierarchical categories that better capture real-world complexities when scaling data.
- Due to the interdisciplinary nature of real-world industrial sectors, tree-like hierarchical categories may cause confusion. For instance, 'device design and manufacturing to support drug research' cannot be simply categorized under level 1 categories like 'education & research' or 'engineering' or 'medicine.' Otherwise, there would be no fitting subcategories under either of the two categories for further refinement. Semantic label-based networks or descriptive methods can be used instead. The 'entity-function-context' ontology can be used to evaluate categories in different levels in parallel way rather than trace the hierarchy of the tree to identify relevant sub-categorical nodes.
Enhanced Models:
- Utilize local open-sourced or commercial large language models (LLMs) or access advanced APIs to improve category representation and matching accuracy.

License

Code: Open License
Example Data: The questionnaire data provided in this repository (6th HKU+ Alumni Talk(1-242).xlsx) is not licensed for distribution. It is included solely for demonstrative purposes.

Dependencies

To install all required dependencies, run:

pip install -r requirements.txt

Outlook

Future improvements aim to optimize computational efficiency and improve the accuracy of hierarchical categorizations, aligning the approach more closely with real-world complexities and large-scale data needs.

For more information or issues, feel free to contact the repository owner. Visit Ju Wu's Website for Contact Info

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
IMG		IMG
__pycache__		__pycache__
6th HKU+ Alumni Talk(1-242).xlsx		6th HKU+ Alumni Talk(1-242).xlsx
README.md		README.md
df_excel_output.xlsx		df_excel_output.xlsx
hierarchy_classifier.py		hierarchy_classifier.py
iso_categories.py		iso_categories.py
main.py		main.py
requirements.txt		requirements.txt
status_classifier.py		status_classifier.py
test_category.ipynb		test_category.ipynb
translator.py		translator.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automatic Field of Work Categorization Using NLP

Features

How It Works

Future Directions

License

Dependencies

Outlook

About

Releases

Packages

Languages

JuWu-19/Simple-NLP-Workflow

Folders and files

Latest commit

History

Repository files navigation

Automatic Field of Work Categorization Using NLP

Features

How It Works

Future Directions

License

Dependencies

Outlook

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages