-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create a DBLP network with SAME_AS edges as training data for our ER model #9
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…erence between his datasets and the main DBLP one
…alls get_good_entity_df and unit test called test_entity_schema
… added test get_test_name_with_bad_entity_df which is lazy and tests a variety of error.
…me unfixed issues with Pandera remain :)
…g it in, still have to make it work.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
datasets
Things involving datasets
entity resolution
The process of working out whether multiple records are referencing the same real-world thing
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #10
We need a dataset to train and test the GAN ER model being created in #3. See the current README.md for a summary (quoted below). The next step now that the data is scripted to download and parse the XML into type-specific JSON Lines files is to use pandas and networkx to build a network that combines the DBLP types into a graph and then add
SAME_AS
andNOT_SAME_AS
edges using the labels outlined below.DBLP Training Data
DBLP is a database of scholarly research in computer science.
The datasets we use are the actual DBLP data and a set of labels for entity resolution of authors.
Note that there are additional labels available as XML that we haven't parsed yet at:
Collecting and Preparing the Training Data
The DBLP XML and the 50K ER labels are downloaded, parsed and transformed into a graph via
graphlet.dblp.__main__
via: