Skip to content

levin-noro/TGFD-discovery

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

TGFD Discovery

1. Overview

Temporal Graph Functional Dependencies (TGFDs) are a recently defined class of data quality rules for enforcing consistency over evolving graphs. This project is an automated solution for discovering TGFDs in large-scale graphs.

2. Datasets

Our system has been tested on the following large-scale graph datasets. We provide a sources for obtaining each dataset. We also provide instructions on how to prepare the datasets for use with our system.

2.1. DBpedia

Source

Dataset preparation

Download as many snapshots from the source as you need and place the snapshots in a new folder on your local machine. Each snapshot file must be placed inside its own subfolder, and the name of the subfolder must be a timestamp of the format YYYYMMDD. Each snapshot file must be in .ttl format and use the .ttl file extension.

2.2. IMDB

Sources

Dataset preparation

Download as many snapshots from the source as you need from a source and place the snapshots in a new folder on your local machine. Each snapshot file must use the naming format imdb-YYYYMMDD.nt. Each snapshot file must be in .nt format and use the .nt file extension.

2.3. Synthetic

Synthetic datasets can be generated using gMark.

Dataset preparation

Once you have successfully generated snapshots using gMark, place the snapshots in a new folder on your local machine. Each snapshot file must use the naming format graphYYYYMMDD.txt.

3. Getting Started

3.1. Prerequisites

  • Java 15
  • Maven

3.2. Instructions

  1. Download/clone this repository.
  2. Using a Java IDE such as IntelliJ, open the VF2SubIso folder as a new project.
  3. Build a jar that uses src/main/java/TgfdDiscovery/TgfdDiscovery.java as its main class. For more information on building jars using IntelliJ, read the IntelliJ documentation.
  4. Execute the jar using the command: java <optional_java_args> -cp path/to/jar TgfdDiscovery.TgfdDiscovery <tgfd_discovery_args>

3.2.1. Arguments for <optional_java_args>

Argument Description
-Xmx<integer>g If you encounter an OutOfMemory error, increase the amount of memory available to java by specifying this argument. For example, -Xmx128g allocates 128 gigabytes of memory for java.

3.2.2. Required arguments for <tgfd_discovery_args>

Argument Description
-loader [dbpedia|imdb|synthetic] Specify one of three loaders (dbpedia, imdb, synthetic) for parsing the dataset.
-path path/to/jar The path to the folder containing the dataset files.
-t <integer> Number of snapshots to use.
-k <integer> Discover graph patterns with up to k edges.
-theta <percent> Specify a support threshold between 0.0 and 1.0.
-a <integer> Number of frequent attributes that will be considered during dependency generation.
-f <integer> Number of frequent edges that will be considered during pattern generation.

3.2.3. Optional arguments for <tgfd_discovery_args>

Argument Description
-interestLabels <comma_seperated_values> Specify an additional list of labels to include in the sets of frequent attributes and edges.
-maxLit <integer> Discover dependencies with up to n literals during dependency generation, where n is the specified integer.
-changefile [all|opt] Build graphs using change-files instead of snapshots. Specify all to consider all changes in a change-file. Specify opt to only consider relevant changes in a change-file. Must be used with -changefilePath.
-incremental Use incremental matching to avoid recomputing matches between snapshots. Must be used with -changefilePath. Works best when the number of changes between snapshots is small.
-changefilePath /path/to/changefiles Path to a folder that contains all change-files. Refer to Section 3.2.5 for instructions on how to generate change-files for a dataset.
-skipK1 Skips discovery of TGFDs for graph patterns of size k = 1.
-dontStore Does not store any snapshots or change-files in memory. Snapshots and change-files will be read from memory as needed. This option reduces memory usage at the expense of increased runtime.
-simplifySuperVertex <integer> Dissolves all vertices in each snapshot that have an in-degree greater than the specified integer.
-k0 Discover TGFDs for graph patterns of size k = 0.

3.2.4. (For developers) Optional arguments for <tgfd_discovery_args>

Argument Description
-noMinimalityPruning Disable the pruning of redundant dependencies during dependency generation.
-noSupportPruning Disable the pruning of low-support graph patterns during pattern generation.
-uninteresting Disable restriction that forces every vertex in a pattern to participate in a dependency. Must be used with -maxLit.
-K Print to file the runtime of each level i in the TGFD discovery process, where 0 <= i <= k.
-validation Run experiment without localized subgraph isomorphism. This is very slow. Only use for validation testing.
-slow Disable pattern matching optimizations for localized subgraph isomorphism.

3.2.5. How to generate change-files

  1. Build a jar that uses src/test/java/testDiffExtractor.java as its main class.
  2. Execute the jar using the command java <optional_java_args> -cp path/to/jar testDiffExtractor <required_args>, where <optional_java_args> are defined in Section 3.2.1 and <required_args> consists of the following three arguments:
    • -path path/to/dataset
    • -loader [dbpedia|imdb|synthetic]. Choose one of three values: dbpedia, imdb, synthetic.
    • -percent <percent>. Specify a value between 0.0 and 1.0.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages