This project utilizes a data extraction, cleaning, and analysis pipeline to to examine and visualize gender representation in the State Department directories in 1965, 1982, and 2022. I am most interested in the percentage of female officers overtime.
While this project only examines the State Department at three points in time, it can easily be scaled to analyze the State Department at a yearly granularity from 1965 to 2022.
My data source is the State Department Key Officers of Foreign Service Posts series of documents, publicly published from 1965 to 2022. The documents list the assigned officers and their positions for each US embassy.
The data to be extracted generally takes the form of 'RANK(:) First M. Last Name'. Two excerpts of the data from 1965 and 2022 are displayed below. Find these PDFs in the inst folder.
In code.ipynb I design a data pipeline that completes the following tasks:
- Imports and converts the directory PDF into text using the
PyMuPDFPDF manipulation package. - Filters out information that does not include officer ranks or names, such as country names and page numbers, using text analysis tools such as
regex. - Extracts officer rank and names, then uses
name_parserpackage to extract first names. - Uses the
gender_guesserfirst-name gender classification package to classify officer names.
The percentage of female officers at the State Department has increased by 30%, from only 3% in 1965 to nearly 35% in 2022. While this is still much lower than the global representation of female population, which sits at nearly 50%, this demonstrates a significant improvement in gender equality in the State Department.
Required Python packages include:
- pandas
- re
- pymupdf
- nameparser
- gender_guesser
- matplotlib
- seaborn
This project is a replication of fp21's DOS diversity project, which examines the gender and racial diversity of State Department officers between 1965 to 2022.

