Learn_CMDL is a Java implementation of a score-based learning algorithm for Bayesian networks. A new proposed scoring function for Bayesian networks called Complete minimum description length is implemented. The program receives a data set with multivariate categorical observations and outputs the optimal structure, found by the greedy hill climbing algorithm (GHC).
The program comes packaged as an executable JAR file, already including the required external libraries and can be downloaded here. The source code can be downloaded here.
In order to visualize the output graph in dot format, download graphviz.
The algorithm receives a .csv file such that:
- the first row of each column corresponds to the name of an attribute;
- the other rows correspond to observations of that attribute.
By executing the following .jar file:
$ java -jar Learn_CMDL.jar
The command-line options are the following:
--inputFile <file> Input CSV file to be used for network
learning.
--scoringFunction <arg> Scoring function to be used: CMDL, MDL,
LL and K2. CMDL is used by default.
--numRestarts <int> Number of random restarts for the greedy
hill climber(GHC).
--outputFile <file> Writes output to <file>.
Consider the benchmarck LED data set led_500.csv with 500 instances.Taking the following options:
- CMDL as scoring function;
- 1000 random restarts for GHC;
- The output file to be output_led.dot.
The command to learn the optimal network is:
java -jar Learn_CMDL.jar led_500.csv CMDL 1000 out_cmdl
And outputs the following structure: