Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement server spool to disk & replay #203

Open
tobert opened this issue May 10, 2023 · 7 comments
Open

implement server spool to disk & replay #203

tobert opened this issue May 10, 2023 · 7 comments
Labels
enhancement New feature or request

Comments

@tobert
Copy link
Collaborator

tobert commented May 10, 2023

Via some discussion in #183: It might be useful to have otel-cli be able to write spans to disk, and then replay them. This can be handy in scenarios when the network/collector aren't available when the traces are generated, but will be later.

Also creates a path for having containers with no networking write to a bind mounted path, then an external otel-cli picks them up and sends along.

spool=$(mktemp -d)

# write incoming spans to the /tmp/$spool directory
otel-cli server spool --directory /tmp/$spool --protocol http/protobuf --endpoint localhost:4318

# or, use otel-cli as usual but point endpoint to disk? this would be nice for debugging...
otel-cli exec --endpoint file://$spool -- sleep 1

# read all the spans in this directory and send them upstream
otel-cli server replay --directory /tmp/$spool --endpoint https://my-collector:9999

Seems like plain flock() should be sufficient for locking. I think I would only support protobuf to avoid type erasure. I started looking at how much work it is, and it's not too bad, and not super invasive. I did the first bits to expose the raw protobuf spans and events through CliEvent here: https://github.com/equinix-labs/otel-cli/tree/spool-spans-to-disk

@garthk did I understand your use case correctly?

@tobert tobert added the enhancement New feature or request label May 10, 2023
tobert pushed a commit that referenced this issue May 10, 2023
I probably should remove CliEvent entirely but that's a LOT of work so
not now. This change makes the raw span & events available downstream
without breaking things for now, to make the spool idea in #203 easier.

Also use a more descriptive name for the tracepb import.
@garthk
Copy link

garthk commented May 11, 2023

My use case would be covered by otel-cli writing to files in a directory if can't deliver to the Collector as configured, and then the Collector reading from those files. I'd have my own format in mind—long story—so I'd want to bundle the CLI and Collector with matched exporters and receivers.

To get events in the rough order I'd name the files such that their lexical order roughly matched the time order. A KSUID would do, or zero-padded microseconds since epoch followed by a delimiter and the writer's PID to avoid collisions.

@tobert
Copy link
Collaborator Author

tobert commented May 11, 2023

[c781adf](/equinix-labs/otel-cli/commit/c781adfbe8b98594ea4122d24f0046ce02623952)

Write to file on fail would be a later feature I think. First would be adding the spooler & replayer which solve some of those problems and some I've dealt with IRL. Then later maybe I can add like, alternate endpoints for failover. I'm reluctant to do this though because otel-cli explicitly relies on the collector for most fault management, and doing so removes a lot of complexity from the tool.

The disk format I have in mind is files named ${trace_id}-${span_id} with -event-${ts} added for events. The data is just the otel protobuf serialized in binary to preserve all types and info. The server json functionality already does something like this and it seems to work ok.

Why does ordering matter? As I understand things, I can send unordered spans up to most observability tools and they'll use the internal timestamps to do ordering. The collector doesn't care, Honeycomb doesn't. Ordering is really hard to get just right so I'd rather avoid it altogether if I can. Using trace & span id is simple and light.

I'll probably build this even if it's not exactly what you need right now, because it seems useful. I still intend to move away from using the collector's exporters because they have too many side-effects (e.g. reads environment variables and I can't block it) and features otel-cli doesn't need.

@tobert
Copy link
Collaborator Author

tobert commented May 12, 2023

Back when I was doing tracing work on Tinkerbell, a hard problem was networking during OS installation across root pivots, chroots, containers, and so on. I think the spool idea really shines here.

cc @nshalman

So what if:

# in early boot... and as bind or volume mounts into containers & chroots...
export otel_spool_path="/tmp/otel"
export OTEL_EXPORTER_OTLP_ENDPOINT="file://${otel_spool_path}"
mkdir -p "${otel_spool_path}"

# you could use a tmpfs esp if crossing user boundaries
mount -t tmpfs -o size=100M,mode=1777 none "${otel_spool_path}"
# bind mounts work great across chroots like the example below
mount -o bind "${otel_spool_path}" "/mnt${otel_spool_path}"
# docker volumes make it so you can trace containers with no networking
docker run -ti --volume "${otel_spool_path}:${otel_spool_path}" image:tag \
  --env OTEL_EXPORTER_OTLP_ENDPOINT="${OTEL_EXPORTER_OTLP_ENDPOINT}" \
  otel-cli exec --name "look at me, I'm a container!" \
  /installer.sh

# anywhere in the code, without worrying one bit about networking:
otel-cli exec --name "chroot into mountpoint" \
  /bin/env OTEL_EXPORTER_OTLP_ENDPOINT="${OTEL_EXPORTER_OTLP_ENDPOINT}" \
  /sbin/chroot /mnt \
    /bin/env OTEL_EXPORTER_OTLP_ENDPOINT="${OTEL_EXPORTER_OTLP_ENDPOINT}" \
    /bin/otel-cli exec \
      --endpoint $OTEL_EXPORTER_OTLP_ENDPOINT \
      --name "run the installer inside the chroot" \
      /installer.sh install

This creates a bunch of protobuf files. So later, when networking is on you can do:

# read all the files on disk, send them to the upstream OTLP endpoint, delete the file after
otel-cli replay \
  --delete \
  --spool-path "${otel_spool_path}" \
  --endpoint grpc://otel-collector.mydomain.com

note: env endpoint duplication can go away soon when I finish some other work on envvars and exporters

@tobert
Copy link
Collaborator Author

tobert commented Jun 14, 2023

While working on #205 I noticed that the specs have a section on exporting to files: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/file-exporter.md

@howardjohn
Copy link

Along with the file exporter is the file receiver: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/filereceiver.

I haven't tried it myself but my understanding is this would allow exporting to a file, then later reading with the collector and exporting to wherever to collector supports

@garthk
Copy link

garthk commented Jul 30, 2023

Catching up late… the dev file exporter and receiver could be convenient for this use case of otel-cli queueing its own records until the Collector is [back] up. We'd benefit from adding wildcard matching to the receiver so the Collector could pick up its own records.

@tobert
Copy link
Collaborator Author

tobert commented Jul 30, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants