-
Please
clonethis repo to your local machine. -
Download
Wikipedia's database from here. -
Import this database into your
MySQLServer. -
Edit
DiscrMetaPath/src/main/java/edu/nd/dsg/util/ConnectionPool.java, changeURL,USER, andPASSto yours. -
Unzip
DiscrMetaPath/data.tar.gztoDiscrMetaPath/, after this you should have all the data underDiscrMetaPath/data -
Build the project by
make wikibuild. The jar file will be generated underDiscrMetaPath/target/ -
Run generated jar file by
java -jar JAR_FILE_YOU_GENERATED
The command line arguments are:
Usage
Generate paths: -GEN [-NoSQL cache types first to speedup] [-all get all paths instead of pathLength == 2] [-p build patent]
Translate paths: -TRANS [-a output all paths] [-nd do not get most discri/similar paths] [-oNum get NUM paths between discri&similar paths] [-p build patent]
Generate Term frequency: -TERM [-BuildWikiTF generate term frequency] [-BuildPatentTF generate term frequency] [-BuildWikiDF generate document frequency] [-BuildPatentDF generate document frequency]
Generate Cos distance frequency(sequential): -COS [-p build patent]
Generate BM25 score: -BM [-ACC accumulative (x,y),(x+y,z),...] [-NODE sequential (x,y),(y,z),...] [-p build patent]
If you only interested in the results we get, you can get the data from result folder. The data format for each file is:
-
For
CrowdFlowerresult files:_unit_id, _golden, _canary, _unit_state, _trusted_judgments, _last_judgment_at, choose_path, // Path that chosen by human choose_path:confidence, end, // End article path_1, // Path between start and end, generated by our algorithm path_2, path_3, path_4, path_5, start // Start article -
For other csv files:
groupId, // Each unique groupId represent for a CrowdFlower task pathId, // Equivalent to CrowFlower's path_* nodeId, // Score of the node at position `i` in path_*