Tests for the Capela distributed programming environment.
You'll need a Jepsen environment to run these tests.
You'll also need a Capela tarball, which should be named something like
whatever-<version>.tar.gz. That tarball should contain (either at the top
level, or in a single directory):
bin/
uvmc The UVM compiler
uvm_repl The UVM server
sandbox/ A working directory with `.py` files.
To run a single test of Capela, try
lein run test --tarball capela-1.0.tar.gz
Depending on your environment, you may need to specify a username and what nodes you'd like to run with:
lein run test --tarball capela-1.0.tar.gz --username admin --nodes n1,n2,n3
To inject process kills and also single-bit errors in disk files, try
lein run test --tarball capela-1.0.tar.gz --nemesis kill,bitflip-file-chunks
There are lots of parameters to select workloads, faults, timing, concurrency,
request rate, transaction structure, and more. Use lein run test --help to
see a list of all the options. To run a suite of tests with various
combinations of those choices, use test-all:
lein run test-all --tarball capela.tar.gz --time-limit 300 --concurrency 2n --rate 20
Passing a --workload, --nemesis, or --lazyfs to test-all will run just
combinations with that particular workload, nemesis, etc.
Running a single test produces a directory in store/<name>/<timestamp>/. The
currently running (or most recently run) test is in store/current, and the
most recently completed test is store/latest. You can slice and dice the
test.jepsen files at the
repl.
Running lein run serve will launch a web server displaying all the tests in
the store/ directory.
Workloads live in src/jepsen/capela/workload/, and their names are set in
src/jepsen/capela/cli.clj. The workloads are:
wr: Transactions over write-read registers. Stores a map of integer keys in a
dictionary in a single partition. Performs transactions which can read and
write values of specific keys, and checks for various isolation anomalies using
Elle
append: Like wr, but values are lists of unique integers, and transactions
append to those lists, rather than overwriting their value. This is a good deal
more precise than wr, but (surprisingly!) it fails to catch some bugs, like
Lost Update, that wr finds.
multi-wr: Like wr, but shards values across multiple partitions.
multi-append: Like append, but shards values across multiple partitions.
ad_hoc: This runs a series of hand-coded Python programs against the query
endpoint, and compares their results to what normal Python returns.
gen_py An experimental test which generates (very simple) Python programs,
submits them to Capela, and checks them against a local Python interpreter.
side_effects : A sketch of a non-functional test for side effects. Capela's
side effects system was not ready during our collaboration, but this might be a
useful foundationf or later.
Nemeses are controlled by jepsen.capela.cli, with support from Jepsen's
jepsen.nemesis.combined and this test's jepsen.capela.nemesis. Provided
nemeses are:
kill: Kills Capela processes and restarts them. With --lazyfs, also drops
un-fsynced writes.
pause: Pauses processes and resumes them using SIGSTOP and SIGCONT.
partition: Partitions the network in various topologies, using iptables.
packet: Introduces a small amount of latency into network packets.
clock: Adjust node clocks, either in a big jump, or strobing rapidly between
two values.
bitflip-file-chunks: Introduces single-bit errors into .sst and .blob
files in Capela's data directory.
snapshot-file-chunks: Takes snapshots of, and later restores, chunks of
.sst and .blob files.
The Leiningen project file, which pulls in dependencies and controls what gets
run when you say lein run, is project.clj. The source code for the test
harness lives in src; one file per namespace. The .py programs that we
upload to each Capela node at the start of the test are in resources/.
jepsen.capela.cli is the top-level entry point; it parses CLI options, builds a test map, and asks Jepsen to run one or more tests.
jepsen.capela.core provides common fundamentals--mainly port numbers.
jepsen.capela.db handles installing Capela and its prerequisites, killing and pausing nodes, and downloading log files. It also includes a watchdog that restarts Capela when it crashes.
jepsen.capela.client makes HTTP calls to Capela's API, and offers some basic
error handling.
jepsen.capela.nemesis defines fault injection packages. It glues together the
standard packages in jepsen.nemesis.combined with some custom packages, like
file corruption.
jepsen.capela.repl is the namespace you're dropped into for lein repl. It pulls in some namespaces that are handy for working with tests.
jepsen.capela.workload contains the various workloads the test can run.
wr and append tests are great at finding anomalies, but there are lots of
ways to encode them that may discover different parts of Capela's internals.
For example, we might store one value per partition, and create partitions
dynamically throughout the test. We could use alternative data structures, like
an array of values, rather than a map of integer keys to values. We could use
tuples instead of lists.
We could add tests for sets--either using Jepsen's set tests, or by using subset relations to infer version orders, and feeding those to Elle. Values could be stored in Python Sets and Dictionaries.
There are some hints that Capela's partitions might disappear after creation,
returning None from calls to select(). We should write a test which creates
partitions dynamically and reads some or all partitions back throughout the
test, and pass that to Jepsen's set-full checker to make sure partitions are
always available after creation.
Since Capela allows us to write our own custom data structures, we can do some
neat tricks. For example, we could make up an arbitrary datatype T, like a
directed graph where nodes are Capela partitions, and each has outbound edges,
and operations can mutate, read, traverse the graph, etc. Then augment T with
an append-only list of integers L; call the product [T, L] U. Now
generate random operations on T, and number each operation uniquely. Submit
those operations to some instance of U, such that each operation is applied
to its T, and its unique ID is appended to its L log.
From the log, we can reconstruct the exact sequence of operations that Capela
thinks occurred. Compare that to what Jepsen thinks--make sure that every
acknowledged op is in the log, that every op in the log is either :ok or
:info, and so on. Next, build the realtime order over operations from
Jepsen's history, and ensure it's consistent with the log order; this detects
realtime ordering violations. Finally, use the log order to replay the same
operations against a reference implementation of T, and compare the results
at each step to what Capela returned.
In short, the list-append tests convince us that Capela can reliably append things to a list. We then use the correctness of list-append in Capela as a fulcrum to test arbitrary datatypes, while ensuring the verification time remains down in linear (OK, fine, N log N) time. Point is, it's not NP like it would normally be!
We also have some fairly sophisticated queue analysis code in Jepsen already, intended for Kafka-style systems. We could implement the basic Kafka-style API inside of Capela: append something to a totally ordered log and get an offset for it, subscribe or assign yourself to a list, and poll elements from the log. This is fairly straightforward to model, and gives us some nice visualizations for if elements are lost or reordered. I suspect it's mostly redundant with list-append, but we might see interesting things from the subscription-management side of things.
Copyright © Jepsen, LLC
This program and the accompanying materials are made available under the terms of the Eclipse Public License 2.0 which is available at http://www.eclipse.org/legal/epl-2.0.
This Source Code may also be made available under the following Secondary Licenses when the conditions for such availability set forth in the Eclipse Public License, v. 2.0 are satisfied: GNU General Public License as published by the Free Software Foundation, either version 2 of the License, or (at your option) any later version, with the GNU Classpath Exception which is available at https://www.gnu.org/software/classpath/license.html.