Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a DBLP network with SAME_AS edges as training data for our ER model #9

Merged
merged 48 commits into from
Feb 6, 2023

Conversation

rjurney
Copy link
Contributor

@rjurney rjurney commented Aug 24, 2022

Fixes #10

We need a dataset to train and test the GAN ER model being created in #3. See the current README.md for a summary (quoted below). The next step now that the data is scripted to download and parse the XML into type-specific JSON Lines files is to use pandas and networkx to build a network that combines the DBLP types into a graph and then add SAME_AS and NOT_SAME_AS edges using the labels outlined below.

DBLP Training Data

DBLP is a database of scholarly research in computer science.

The datasets we use are the actual DBLP data and a set of labels for entity resolution of authors.

Note that there are additional labels available as XML that we haven't parsed yet at:

Collecting and Preparing the Training Data

The DBLP XML and the 50K ER labels are downloaded, parsed and transformed into a graph via graphlet.dblp.__main__ via:

python -m graphlet.dblp

@rjurney rjurney added entity resolution The process of working out whether multiple records are referencing the same real-world thing datasets Things involving datasets labels Aug 24, 2022
@rjurney rjurney requested a review from tanmoyio August 24, 2022 19:30
rjurney added 26 commits August 24, 2022 18:16
…erence between his datasets and the main DBLP one
…alls get_good_entity_df and unit test called test_entity_schema
… added test get_test_name_with_bad_entity_df which is lazy and tests a variety of error.
@rjurney rjurney merged commit ab37cd4 into main Feb 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Things involving datasets entity resolution The process of working out whether multiple records are referencing the same real-world thing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a DBLP labeled training network with SAME_AS edges for training our entity resolution model
1 participant