Skip to content

Commit 9c35f0e

Browse files
committed
Fix README shenanigans
1 parent 7214019 commit 9c35f0e

File tree

1 file changed

+186
-0
lines changed

1 file changed

+186
-0
lines changed

stringleton/README.md

Lines changed: 186 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,186 @@
1+
# Stringleton
2+
3+
Extremely efficient string interning solution for Rust crates.
4+
5+
*String interning:* The technique of representing all strings which are equal by
6+
a pointer or ID that is unique to the *contents* of that strings, such that O(n)
7+
string equality check becomes a O(1) pointer equality check.
8+
9+
Interned strings in Stringleton are called "symbols", in the tradition of Ruby.
10+
11+
## Distinguishing characteristics
12+
13+
- Ultra fast: Getting the string representation of a `Symbol` is a lock-free
14+
memory load. No reference counting or atomics involved.
15+
- Symbol literals (`sym!(...)`) are "free" at the call-site. Multiple
16+
invocations with the same string value are eagerly reconciled on program
17+
startup using linker tricks.
18+
- Symbols are tiny. Just a single pointer - 8 bytes on 64-bit platforms.
19+
- Symbols are trivially copyable - no reference counting.
20+
- No size limit - symbol strings can be arbitrarily long (i.e., this is not a
21+
"small string optimization" implementation).
22+
- Debugger friendly: If your debugger is able to display a plain Rust `&str`, it
23+
is capable of displaying `Symbol`.
24+
- Dynamic library support: Symbols can be passed across dynamic linking
25+
boundaries (terms and conditions apply - see the documentation of
26+
`stringleton-dylib`).
27+
- `no_std` support: `std` synchronization primitives used in the symbol registry
28+
can be replaced with `once_cell` and `spin`. *See below for caveats.*
29+
- `serde` support - symbols are serialized/deserialized as strings.
30+
- Fast bulk-insertion of symbols at runtime.
31+
32+
## Good use cases
33+
34+
- You have lots of little strings that you need to frequently copy and compare.
35+
- Your strings come from trusted sources.
36+
- You want good debugger support for your symbols.
37+
38+
## Bad use cases
39+
40+
- You have an unbounded number of distinct strings, or strings coming from
41+
untrusted sources. Since symbols are never garbage-collected, this is a source
42+
of memory leaks, which is a denial-of-service hazard.
43+
- You need a bit-stable representation of symbols that does not change between
44+
runs.
45+
- Consider if `smol_str` or `cowstr` is a better fit for such use cases.
46+
47+
## Usage
48+
49+
Add `stringleton` as a dependency of your project, and then you can do:
50+
51+
```rust,ignore
52+
use stringleton::{sym, Symbol};
53+
54+
// Enable the `sym!()` macro in the current crate. This should go at the crate root.
55+
stringleton::enable!();
56+
57+
let foo = sym!(foo);
58+
let foo2 = sym!(foo);
59+
let bar = sym!(bar);
60+
let message = sym!("Hello, World!");
61+
let message2 = sym!("Hello, World!");
62+
63+
assert_eq!(foo, foo2);
64+
assert_eq!(bar.as_str(), "bar");
65+
assert_eq!(message, message2);
66+
assert_eq!(message.as_str().as_ptr(), message2.as_str().as_ptr());
67+
```
68+
69+
## Crate features
70+
71+
- **std** *(enabled by default)*: Use synchronization primitives from the
72+
standard library. Implies `alloc`. When disabled, `critical-section` and
73+
`spin` must both be enabled *(see below for caveats)*.
74+
- **alloc** *(enabled by default)*: Support creating symbols from `String`.
75+
- **serde**: Implements `serde::Serialize` and `serde::Deserialize` for symbols,
76+
which will be serialized/deserialized as plain strings.
77+
- **debug-assertions**: Enables expensive debugging checks at runtime - mostly
78+
useful to diagnose problems in complicated linker scenarios.
79+
- **critical-section**: When `std` is not enabled, this enables `once_cell` as a
80+
dependency with the `critical-section` feature enabled. Only relevant in
81+
`no_std` environments. *[See `critical-section` for more
82+
details.](https://docs.rs/critical-section/latest/critical_section/)*
83+
- **spin**: When `std` is not enabled, this enables `spin` as a dependency,
84+
which is used to obtain global read/write locks on the symbol registry. Only
85+
relevant in `no_std` environments (and is a pessimization in other
86+
environments).
87+
88+
## Efficiency
89+
90+
Stringleton tries to be as efficient as possible, but it may make different
91+
tradeoffs than other string interning libraries. In particular, Stringleton is
92+
optimized towards making the use of the `sym!(...)` macro practically free.
93+
94+
Consider this function:
95+
96+
```rust,ignore
97+
fn get_symbol() -> Symbol {
98+
sym!("Hello, World!")
99+
}
100+
```
101+
102+
This compiles into a single load instruction. Using `cargo disasm` on x86-64
103+
(Linux):
104+
105+
```asm
106+
get_symbol:
107+
8bf0 mov rax, qword ptr [rip + 0x52471]
108+
8bf7 ret
109+
```
110+
111+
This is "as fast as it gets", but the price is that all symbols in the program
112+
are deduplicated when the program starts. Any theoretically faster solution
113+
would need fairly deep cooperation from the compiler aimed at this specific use
114+
case.
115+
116+
Also, symbol literals are *always* a memory load. The compiler cannot perform
117+
optimizations based on the contents of symbols, because it doesn't know how they
118+
will be reconciled until link time. For example, while `sym!(a) != sym!(a)` is
119+
always false, the compiler cannot eliminate code paths relying on that.
120+
121+
## Dynamic libraries
122+
123+
Stringleton relies on magical linker tricks (supported by `linkme` and `ctor`)
124+
to minimize the cost of the `sym!(...)` macro at runtime. These tricks are
125+
broadly compatible with dynamic libraries, but there are a few caveats:
126+
127+
1. When a Rust `dylib` crate appears in the dependency graph, and it has
128+
`stringleton` as a dependency, things should "just work", due to Rust's
129+
[linkage rules](https://doc.rust-lang.org/reference/linkage.html).
130+
2. When a Rust `cdylib` crate appears in the dependency graph, Cargo seems to be
131+
a little less clever, and the `cdylib` dependency may need to use the
132+
`stringleton-dylib` crate instead. Due to Rust's linkage rules, this will
133+
cause the "host" crate to also link dynamically with Stringleton, and
134+
everything will continue to work.
135+
3. When a library is loaded dynamically at runtime, and it does not appear in
136+
the dependency graph, the "host" crate must be prevented from linking
137+
statically to `stringleton`, because it would either cause duplicate symbol
138+
definitions, or worse, the host and client binaries would disagree about
139+
which `Registry` to use. To avoid this, the *host* binary can use
140+
`stringleton-dylib` explicitly instead of `stringleton`, which forces dynamic
141+
linkage of the symbol registry.
142+
4. Dynamically *unloading* libraries is extremely risky (`dlclose()` and
143+
similar). Unloading a library that has any calls to the `sym!(..)` or
144+
`static_sym!(..)` macros is instant UB. Such a library can in principle use
145+
`Symbol::new()`, but probably not `Symbol::new_static()`.
146+
147+
To summarize:
148+
149+
1. When no dynamic libraries are present in the project, it is always best to
150+
use `stringleton` directly.
151+
2. When only normal Rust dynamic libraries (`crate-type = ["dylib"]`) are
152+
present, it is also fine to use `stringleton` directly - Cargo and rustc will
153+
figure out how to link things correctly.
154+
3. `cdylib` dependencies should use `stringleton-dylib`. The host can use
155+
`stringleton`.
156+
4. When loading dynamic libraries at runtime, both sides should use
157+
`stringleton-dylib` instead of `stringleton`.
158+
5. Do not unload dynamic libraries at runtime unless you are really, really sure
159+
what you are doing.
160+
161+
## `no_std` caveats
162+
163+
Stringleton works in `no_std` environments, but it does fundamentally require
164+
two things:
165+
166+
1. Allocator support, in order to maintain the global symbol registry. This is a
167+
`hashbrown` hash map.
168+
2. Some synchronization primitives to control access to the global symbol
169+
registry when new symbols are created.
170+
171+
The latter can be supported by the `spin` and `critical-section` features:
172+
173+
- `spin` replaces `std::sync::RwLock`, and is almost always a worse choice when
174+
`std` is available.
175+
- `critical-section` replaces `std::sync::OnceLock` with
176+
[`once_cell::sync::OnceCell`](https://docs.rs/once_cell/latest/once_cell/sync/struct.OnceCell.html),
177+
and enables the `critical-secion` feature of `once_cell`. Using
178+
`critical-section` requires additional work, because you must manually link in
179+
a crate that provides the relevant synchronization primitive for the target
180+
platform.
181+
182+
Do not use these features unless you are familiar with the tradeoffs.
183+
184+
## Name
185+
186+
The name is a portmanteau of "string" and "singleton".

0 commit comments

Comments
 (0)