Michael Scherk, Boyuan Chen
Duke University
SymMatika discovers symbolic expressions from data, which manifest as either explicit mappings like y=f(x₁, x₂, ...)
or implicit physical invariants. We provide our full codebase here for running-with-code use (including an executable terminal option to run directly from the terminal without any coding) and a user-friendly GUI. Both installations and use instructions are available on this page.
.
├── CMakeLists.txt
├── README.md
├── cmake
│ ├── dependencies.cmake
│ └── optimizations.cmake
├── evolution
│ ├── genetic_algorithm.cpp
│ ├── genetic_algorithm.h
│ ├── genetic_operations.cpp
│ ├── genetic_operations.h
│ ├── motif_finder.cpp
│ └── motif_finder.h
├── figures
│ ├── ModelOverview.jpg
│ ├── SymMatikaEndScreen.png
│ ├── SymMatikaOpenScreen.png
│ └── SymMatikaTerminalGuide.gif
├── fitness_core
│ ├── differentiation.cpp
│ ├── differentiation.h
│ ├── evaluate_tree.cpp
│ ├── evaluate_tree.h
│ ├── fitness.cpp
│ └── fitness.h
├── initialize_model.cpp
├── initialize_model.h
├── main.cpp
├── model.cpp
├── model.h
└── tree_construction
├── build_candidates.cpp
├── build_candidates.h
├── generate_population.cpp
└── generate_population.h
Some basic pre-requisites include: a C++23-compatible compiler and a MacOS or Windows machine (Homebrew
recommended for MacOS; vcpkg
recommended for Windows). If you haven't yet, install Homebrew or vcpkg. We use the C/C++ Homebrew compilers for MacOS or Chocolatey compilers for Windows. Then install the packages below.
# For MacOS Homebrew
brew install cmake symengine llvm gmp mpfr libmpc flint libomp eigen ghostscript # ghostscript only needed for GUI latex rendering
# For Windows vcpkg
.\vcpkg install symengine gmp mpfr libmpc flint eigen3 ghostscript # ghostscript only needed for GUI latex rendering
choco install llvm
Once the previous steps are completed, clone and build.
git clone https://github.com/generalroboticslab/SymMatika
mkdir build
cd build
cmake .. -DCMAKE_CXX_COMPILER=$(brew --prefix llvm)/bin/clang++
cmake --build .
For setting up SymMatika to run locally, first open an IDE with C++23 support (SymMatika was developed with CLion, so we strongly recommend you choose this IDE) and point the IDE to Homebrew C/C++ compilers (/opt/homebrew/opt/llvm/bin/clang
and /opt/homebrew/opt/llvm/bin/clang++
are compiler paths) or Chocolatey C/C++ compilers (C:\Program Files\LLVM\bin\clang-cl.exe
).
- for CLion, do: CLion → Settings → Toolchains, and then update the C/C++ Compiler fields
- for VSCode, ensure your
c_cpp_properties.json
file looks like:
// for MacOS
{
"configurations": [
{
"name": "Mac",
"includePath": [
"${workspaceFolder}/**",
"/opt/homebrew/opt/symengine",
"/opt/homebrew/opt/eigen/include/eigen3",
"/opt/homebrew/opt/libomp/include",
"/opt/homebrew/include"
],
"defines": [],
"compilerPath": "/opt/homebrew/opt/llvm/bin/clang",
"cStandard": "c17",
"cppStandard": "c++23",
"intelliSenseMode": "macos-clang-arm64"
}
],
"version": 4
}
// for Windows
{
"version": 4,
"configurations": [
{
"name": "Win-Clang",
"includePath": [
"${workspaceFolder}/**",
"C:/vcpkg/installed/x64-windows/include",
"C:/vcpkg/installed/x64-windows/include/symengine",
"C:/vcpkg/installed/x64-windows/include/eigen3",
"C:/vcpkg/installed/x64-windows/include/libomp"
],
"defines": [
"_WIN32",
"_OPENMP"
],
"compilerPath": "C:/Program Files/LLVM/bin/clang-cl.exe",
"cStandard": "c17",
"cppStandard": "c++23",
"intelliSenseMode": "windows-clang-x64"
}
]
}
- open the SymMatika directory you previously installed in installation
We outline the following instructions for valid dataset files:
- files must be
.txt
or.csv
- when searching for explicit mappings (i.e. some f s.t. f(x_1, ... , x_n) = y for input variables x_1, ... , x_n and target variable y), the last column must be the target variable and the previous columns are inputs
- when searching for implicit relations (i.e. no known target variable, common in discovering physical invariants), the first column is the trial number (i.e. 1 or 2), the second column is time, and the subsequent columns correspond to variables
To ensure fast runtimes, we recommend using datasets with < 10,000 samples, as runtimes scale massively with large datasets. This is not a requirement, but merely a suggestion. For discovering implicit relations, we can only support discovery in physical systems of up to four variables.
We provide three methods of using SymMatika, one with code and two alternative methods:
To run the model with code, navigate to the main.cpp
file. Once there, you can initialize a DataSet
object by calling the following:
#include <iostream>
#include "model.h"
int main() {
// Explicit dataset
DataSet data = DataSet(
"/Users/generaroboticslab/Desktop/SymMatika/example_data/sine_data.csv", // path to CSV file
true, // true = explicit mapping (y = f(x))
{"x", "y"} // input: x, target: y
);
// Run symbolic regression model
Model symMatika = Model(
data,
10000, // population size
3, // max tree depth
{true, true, true, true, true, true} // enable all operations
);
symMatika.run();
return 0;
}
💡 Tip: For easy file access, place the
.txt
/.csv
file incmake-build-debug
and just pass the filename
DataSet data = DataSet(fileName, systemType, variableList);
filename
= string name of path to data file, place incmake-build-debug
folder for easiest accesssystemType
= True is explicit and False is implicitvariableList
= string vector of input variable names, and a target variable as the last element in the vector if explicitsystemType
Model symMatika = Model(data, popSize, maxDepth, allowedOps);
popSize
= initial starting populations for each island population (recommended 10,000)maxDepth
= maximum depth of binary expression trees; in range of 1-5 in ascending candidate expression complexityallowedOps
= vector of boolean values for each binary and unary operator, True for enable and False for disable (by default, all operators are enabled)
💡 Tip: To ensure fast runtimes, we recommend using datasets with < 10,000 samples, as runtimes scale massively with large datasets. This is not a requirement, but merely a suggestion. For discovering implicit relations, we can only support discovery in physical systems of up to four variables.
Instead of manually coding, we offer an alternative solution for users to easily run SymMatika in the terminal. For running through terminal, navigate to main.cpp
in the SymMatika project and uncomment the function call to easy_run()
:
int main() {
/* INSTRUCTIONS for INSTALLATION are found in the GitHub
* documentation
* - to easily run the model in the terminal, uncomment
* the function call to 'easy_run()'
*
* Also, make sure to have the data files in the CMAKE-BUILD-DEBUG
* folder for easiest use. If this folder isn't present, simply
* reload the CMake project by: File -> Reload CMake Project
*/
// easy_run();
return 0;
}
Then run the program, and follow the instructions in the terminal. SymMatika outputs a LaTeX table of top-solutions and their corresponding predictive errors. We provide a short demonstration below:
We offer a user-friendly GUI for SymMatika. Qt Creator must be used for local use, which can be installed for free at this website. To download, navigate to the Releases tab on the SymMatika GitHub home page. Click the SymMatika GUI Download release and download the .zip
file. Once downloaded, unzip it and then open Qt Creator. Before you open and run the GUI, ensure that you reload cmake-build-debug
for the SymMatika_Engine
subdirectory in the downloaded .zip
folder; this can be done by opening your IDE and pressing Reload Cmake Project
. Click Open Project in the Welcome tab then navigate to and open the CMakeLists.txt
in the root directory. The program will take some time (~5 minutes) to build, since it has external C++ dependencies for tasks such as LaTeX rendering. Once the project is built, click run and open the MainWindow.
Once open, begin by clicking the Data header and upload the data file (recall the file instructions). Then, click the System Type header and check explicit mappings or implicit relations. Go to Variables and add the input variables and optionally the target variable. Then, select binary and unary operators and customize the complexity of generated candidate expressions.
After the model runs, press the "Copy table results" button to copy a LaTeX table of top-solutions and their corresponding predictive errors.
This work is supported by ARO under award W911NF2410405, DARPA FoundSci program under award HR00112490372, DARPA TIAMAT program under award HR00112490419, and by gift support from BMW and OpenAI.
If you found our paper or codebase to be helpful, please consider citing our work:
@misc{scherk2025symmatikastructureawaresymbolicdiscovery,
title={SymMatika: Structure-Aware Symbolic Discovery},
author={Michael Scherk and Boyuan Chen},
year={2025},
eprint={2507.03110},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2507.03110},
}