TGFD Discovery

TGFD Discovery

1. Overview

Temporal Graph Functional Dependencies (TGFDs) are a recently defined class of data quality rules for enforcing consistency over evolving graphs. This project is an automated solution for discovering TGFDs in large-scale graphs.

2. Datasets

Our system has been tested on the following large-scale graph datasets. We provide a sources for obtaining each dataset. We also provide instructions on how to prepare the datasets for use with our system.

2.1. DBpedia

Source

https://databus.dbpedia.org/dbpedia/collections/latest-core

Dataset preparation

Download as many snapshots from the source as you need and place the snapshots in a new folder on your local machine. Each snapshot file must be placed inside its own subfolder, and the name of the subfolder must be a timestamp of the format YYYYMMDD. Each snapshot file must be in .ttl format and use the .ttl file extension.

2.2. IMDB

Sources

https://www.imdb.com/interfaces/
ftp://ftp.fu-berlin.de/pub/misc/movies/database/frozendata/

Dataset preparation

Download as many snapshots from the source as you need from a source and place the snapshots in a new folder on your local machine. Each snapshot file must use the naming format imdb-YYYYMMDD.nt. Each snapshot file must be in .nt format and use the .nt file extension.

2.3. Synthetic

Synthetic datasets can be generated using gMark.

Dataset preparation

Once you have successfully generated snapshots using gMark, place the snapshots in a new folder on your local machine. Each snapshot file must use the naming format graphYYYYMMDD.txt.

3. Getting Started

3.1. Prerequisites

Java 15
Maven

3.2. Instructions

Download/clone this repository.
Using a Java IDE such as IntelliJ, open the VF2SubIso folder as a new project.
Build a jar that uses src/main/java/TgfdDiscovery/TgfdDiscovery.java as its main class. For more information on building jars using IntelliJ, read the IntelliJ documentation.
Execute the jar using the command: java <optional_java_args> -cp path/to/jar TgfdDiscovery.TgfdDiscovery <tgfd_discovery_args>

3.2.1. Arguments for `<optional_java_args>`

Argument	Description
`-Xmx<integer>g`	If you encounter an `OutOfMemory` error, increase the amount of memory available to java by specifying this argument. For example, `-Xmx128g` allocates 128 gigabytes of memory for java.

3.2.2. Required arguments for `<tgfd_discovery_args>`

Argument	Description
`-loader [dbpedia\|imdb\|synthetic]`	Specify one of three loaders (dbpedia, imdb, synthetic) for parsing the dataset.
`-path path/to/jar`	The path to the folder containing the dataset files.
`-t <integer>`	Number of snapshots to use.
`-k <integer>`	Discover graph patterns with up to k edges.
`-theta <percent>`	Specify a support threshold between 0.0 and 1.0.
`-a <integer>`	Number of frequent attributes that will be considered during dependency generation.
`-f <integer>`	Number of frequent edges that will be considered during pattern generation.

3.2.3. Optional arguments for `<tgfd_discovery_args>`

Argument	Description
`-interestLabels <comma_seperated_values>`	Specify an additional list of labels to include in the sets of frequent attributes and edges.
`-maxLit <integer>`	Discover dependencies with up to n literals during dependency generation, where n is the specified integer.
`-changefile [all\|opt]`	Build graphs using change-files instead of snapshots. Specify `all` to consider all changes in a change-file. Specify `opt` to only consider relevant changes in a change-file. Must be used with `-changefilePath`.
`-incremental`	Use incremental matching to avoid recomputing matches between snapshots. Must be used with `-changefilePath`. Works best when the number of changes between snapshots is small.
`-changefilePath /path/to/changefiles`	Path to a folder that contains all change-files. Refer to Section 3.2.5 for instructions on how to generate change-files for a dataset.
`-skipK1`	Skips discovery of TGFDs for graph patterns of size k = 1.
`-dontStore`	Does not store any snapshots or change-files in memory. Snapshots and change-files will be read from memory as needed. This option reduces memory usage at the expense of increased runtime.
`-simplifySuperVertex <integer>`	Dissolves all vertices in each snapshot that have an in-degree greater than the specified integer.
`-k0`	Discover TGFDs for graph patterns of size k = 0.

3.2.4. (For developers) Optional arguments for `<tgfd_discovery_args>`

Argument	Description
`-noMinimalityPruning`	Disable the pruning of redundant dependencies during dependency generation.
`-noSupportPruning`	Disable the pruning of low-support graph patterns during pattern generation.
`-uninteresting`	Disable restriction that forces every vertex in a pattern to participate in a dependency. Must be used with `-maxLit`.
`-K`	Print to file the runtime of each level i in the TGFD discovery process, where 0 <= i <= k.
`-validation`	Run experiment without localized subgraph isomorphism. This is very slow. Only use for validation testing.
`-slow`	Disable pattern matching optimizations for localized subgraph isomorphism.

3.2.5. How to generate change-files

Build a jar that uses src/test/java/testDiffExtractor.java as its main class.
Execute the jar using the command java <optional_java_args> -cp path/to/jar testDiffExtractor <required_args>, where <optional_java_args> are defined in Section 3.2.1 and <required_args> consists of the following three arguments:
- -path path/to/dataset
- -loader [dbpedia|imdb|synthetic]. Choose one of three values: dbpedia, imdb, synthetic.
- -percent <percent>. Specify a value between 0.0 and 1.0.

Name		Name	Last commit message	Last commit date
Latest commit History 490 Commits
VF2SubIso		VF2SubIso
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TGFD Discovery

1. Overview

2. Datasets

2.1. DBpedia

Source

Dataset preparation

2.2. IMDB

Sources

Dataset preparation

2.3. Synthetic

Dataset preparation

3. Getting Started

3.1. Prerequisites

3.2. Instructions

3.2.1. Arguments for `<optional_java_args>`

3.2.2. Required arguments for `<tgfd_discovery_args>`

3.2.3. Optional arguments for `<tgfd_discovery_args>`

3.2.4. (For developers) Optional arguments for `<tgfd_discovery_args>`

3.2.5. How to generate change-files

About

Uh oh!

Releases

Packages

Uh oh!

Languages

levin-noro/TGFD-discovery

Folders and files

Latest commit

History

Repository files navigation

TGFD Discovery

1. Overview

2. Datasets

2.1. DBpedia

Source

Dataset preparation

2.2. IMDB

Sources

Dataset preparation

2.3. Synthetic

Dataset preparation

3. Getting Started

3.1. Prerequisites

3.2. Instructions

3.2.1. Arguments for <optional_java_args>

3.2.2. Required arguments for <tgfd_discovery_args>

3.2.3. Optional arguments for <tgfd_discovery_args>

3.2.4. (For developers) Optional arguments for <tgfd_discovery_args>

3.2.5. How to generate change-files

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

3.2.1. Arguments for `<optional_java_args>`

3.2.2. Required arguments for `<tgfd_discovery_args>`

3.2.3. Optional arguments for `<tgfd_discovery_args>`

3.2.4. (For developers) Optional arguments for `<tgfd_discovery_args>`

Packages