Skip to content

Improve Neva performance using goperf.dev as a reference #1067

@emil14

Description

@emil14

Go Optimization Guide Map for Neva

This issue uses goperf.dev as an external reference and maps its performance themes to the Neva codebase.

The goal is not to blindly optimize everything. The goal is to identify the places where the guide's ideas are the best fit for Neva's compiler, runtime, CLI, and standard library, and to track related work in one place.

Status as of March 22, 2026

This issue should be treated as an umbrella performance map, not as a flat todo list.

Some of the highest-priority ideas here already have concrete in-flight implementation work:

  • #1004 Redesign runtime Msg
    • Directly targets the runtime.Msg boxing / message-overhead track.
    • Replaces the old interface-based runtime message representation with a tagged-union design.
    • Adds focused runtime microbenchmarks for message operations.
  • #1023 Add runtime benchmark baseline before Msg redesign
    • Adds the benchmark baseline needed to compare #1004 and future runtime changes.
    • Expands benchmarking beyond the old single benchmarks/message_passing case into a broader e2e suite under benchmarks/runtime_bench/**.
  • #996 Proposal: Optimize Message representation via native Go types
    • Still relevant as a likely follow-up direction after #1004, especially for native composite storage / fast paths.
  • #1030
    • Related correctness work for JSON formatting in the same serialization area touched by #1004.

Important status note: as of March 22, 2026, neither #1004 nor #1023 is merged into main, so this issue should distinguish between:

  • work already prototyped in open PRs
  • work still missing from the repository default branch

Reading the map

  • Fit = how relevant the pattern is to Neva today.
  • Priority = suggested order for investigation.
  • Candidate = a concrete improvement idea for this repository.
  • Status = whether the work is already in main, being prototyped in an open PR, or still backlog.

1. Common Go performance patterns mapped to Neva

goperf topic Neva fit Priority Status Candidate for Neva
Avoiding Interface Boxing Very high P0 In progress via #1004 Rework runtime.Msg hot paths, then remeasure and decide whether Neva still needs specialized scalar fast paths, slimmer stream-item representations, or native-composite follow-ups from #996.
Memory Efficiency and Go's Garbage Collector Very high P0 Partially covered by #1004 and #1023 Use the new microbench/e2e baselines to review per-message wrappers, repeated StructMsg construction/copying, and StructMsg.Get scans before introducing pools or unsafe tricks.
Stack Allocations and Escape Analysis Very high P0 Not yet done Run -gcflags=all=-m=2 over runtime/backend packages on top of the benchmark baseline from #1023 and the runtime changes from #1004.
Memory Preallocation High P0 Backlog Extend existing preallocation discipline to runtime helpers such as stream_zip_many, array-port helpers, and suspicious len(chans)^2 capacity calculations.
Zero-Copy Techniques High P1 Backlog Audit stdlib/runtime I/O. http_get still eagerly reads the full response body, and transport semantics around bytes vs string should stay explicit.
Goroutine Worker Pools High P1 Backlog Review fan-out/fan-in helpers that spawn goroutines per slot or per iteration (ReceiveAll, SendAll, struct_builder, stream_zip_many, runtime call orchestration). Neva's model is intentionally parallel, so the goal is bounded concurrency for high-cardinality helpers, not “less concurrency everywhere”.
Batching Operations High P1 Backlog Add batched variants or internal batching only where benchmarks show per-message overhead dominates useful work.
Immutable Data Sharing High P1 Backlog Document and enforce where slices/maps inside messages are treated as immutable payloads versus defensively copied payloads. This is especially relevant for transport-oriented payloads and composite messages.
Atomic Operations and Synchronization Primitives Medium P2 Backlog Measure whether the global send-order counter becomes a contention point under fan-out heavy benchmarks and whether ordering can be made optional on paths that never call Select.
Efficient Context Management Medium P2 Backlog The runtime creates/cancels contexts in Call, and stdlib network code still does not thread request-scoped context into http_get.
Efficient Buffering Medium P2 Backlog Most compiler codegen already uses bytes.Buffer / strings.Builder. Higher-value follow-up is in stdlib I/O and transport-facing components.
Struct Field Alignment Medium P2 Backlog Run a focused field-alignment pass on runtime/compiler structs and compare against the existing betteralign workflow from Makefile.
Object Pooling Medium P2 Backlog Benchmark whether runtime hot paths allocate enough temporary slices/messages to justify scratch-buffer pooling. Avoid applying pools to compiler codegen before measurement.
Lazy Initialization Medium P3 Backlog Cache/lazily initialize cold shared state only if startup profiling shows it matters.
Leveraging Compiler Optimization Flags Already covered well P3 Mostly in main Keep current release defaults; possible follow-up is a dedicated debug profile with -gcflags="all=-N -l".

2. Networking and diagnostics topics: which ones matter to Neva

Most of goperf's networking section targets long-lived Go services. Neva is primarily a compiler/runtime toolchain, so these topics are secondary, but not irrelevant.

goperf topic Relevance to Neva Status / candidate
Benchmarking and Load Testing for Networked Go Apps High as methodology, even outside networking #1023 is the first concrete step here. Follow-up: land it, document how results are tracked, and pair it with lower-level runtime benchmarks from #1004.
Practical Example: Profiling Networked Go Applications with pprof High as diagnostics reference Add a short perf-playbook doc for Neva: go test -bench, -benchmem, CPU profiles, alloc profiles, and escape-analysis reports on runtime/compiler packages.
Efficient Use of net/http, net.Conn, and UDP Medium Refactor http_get to use a reusable http.Client, explicit timeouts, request context propagation, and possibly streaming variants.
Managing 10K+ Concurrent Connections Low today Keep as background reading unless Neva grows daemonized services, language servers, or remote execution endpoints inside this repo.
Scheduler / epoll / TLS / DNS / connection lifecycle pages Low today Background reading only for future infrastructure work.

3. What Neva already does well

Several goperf ideas are already visible either in main or in active related PRs:

  • The compiler already preallocates maps/slices in analysis and code generation paths where sizes are known ahead of time.
  • main already has a benchmark entry point in benchmarks/message_passing, and #1023 expands that into a broader runtime benchmark suite.
  • Release builds already use size-oriented linker flags and reproducibility-friendly build flags.
  • Runtime code already uses strings.Builder, bytes.Buffer, and atomics in a few targeted places instead of overusing them everywhere.

That means the highest-value work is not “introduce basic Go performance hygiene”. The highest-value work is to put numbers around the runtime's message-passing cost model and then use those numbers to drive the next runtime changes.

4. Recommended sequence for Neva

P0 — measure and reduce message overhead

  1. Land and rebase the benchmark/message work already in flight:
    • #1023 for e2e runtime benchmark baselines.
    • #1004 for runtime Msg redesign and lower-level runtime benchmarks.
  2. Run go test -bench=. -benchmem on the touched runtime packages and benchmark suites.
  3. Run escape-analysis reports (-gcflags=all=-m=2) for runtime/backend packages.
  4. Use the results to decide whether boxing, copying, or goroutine churn is the dominant cost.

P1 — fix hot runtime patterns

  1. Audit goroutine-per-slot helpers and compare them with bounded worker patterns.
  2. Audit byte/string conversions and whole-buffer I/O APIs.
  3. Introduce batching only where benchmarks show that message granularity dominates useful work.

P2 — smaller structural wins

  1. Field-alignment pass on selected structs.
  2. Ordering/atomic contention review for the global runtime counter.
  3. Context propagation and timeout cleanup for stdlib HTTP.

P3 — polish and tooling

  1. Add a documented perf-playbook / debug profile.
  2. Cache/lazily initialize cold shared state only if startup profiling shows it matters.
  3. Keep this issue updated as concrete benchmark and runtime work lands.

5. Suggested first issues / PRs

  1. Land and rebase the benchmark/message work already in flight.
  2. Add a perf-playbook doc (bench, benchmem, pprof, escape analysis) that explains how to use the new benchmarks and how to compare results before/after runtime changes.
  3. Run escape-analysis and allocation review on top of #1004 / #1023.
  4. Review goroutine-per-slot helpers for bounded-concurrency alternatives.
  5. Refactor http_get to use context-aware requests and a reusable client, while keeping transport semantics explicit (bytes vs string) and aligning with the current stdlib direction.

Source reference

Primary source: goperf.dev

Key reference pages used while preparing this map:

Metadata

Metadata

Assignees

No one assigned

    Labels

    ideaThinking neededoptimisationMake it fastp2Someday we should do it. I hope

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions