Skip to content

Apache Arrow Substrait deserialization reaches Unreachable() / std::abort() on a SimpleExtensionDeclaration with an unset mapping_type oneof #50231

@OwenSanzas

Description

@OwenSanzas

Summary

A 2-byte Substrait protobuf (12 00) fed to the public
arrow::engine::DeserializePlan API aborts the process. A
SimpleExtensionDeclaration whose mapping_type oneof is left unset falls
through to default: Unreachable() in GetExtensionSetFromMessage, which is an
unconditional std::abort(). The same root cause is reachable through
arrow::engine::DeserializeExpressions (via the ExtendedExpression entry). Any
service that deserializes an untrusted Substrait plan/expression can be crashed
(denial of service) by a trivially malformed input.

Tested at pinned commit 16fe34250a2ef261790b9cc414fdf0831669cf9f
(25.0.0-SNAPSHOT).

Root Cause

GetExtensionSetFromMessage<T> switches on the
SimpleExtensionDeclaration.mapping_type_case() oneof selector to pick which
extension kind the declaration carries. It handles the three defined arms
(kExtensionTypeVariation, kExtensionType, kExtensionFunction) and then
falls into a default: arm that calls arrow::Unreachable(). But the oneof also
has the unset state MAPPING_TYPE_NOT_SET (the protobuf default when a producer
simply omits the field), which is attacker-reachable and lands on exactly that
default: arm. Unreachable() is a hard std::abort() — the DCHECK is a
no-op under NDEBUG (release builds), so the only effect on untrusted input is a
process abort instead of a non-OK Status.

Vulnerable code (cpp/src/arrow/engine/substrait/util_internal.h:65-86):

  std::unordered_map<uint32_t, Id> type_ids, function_ids;
  for (const auto& ext : message.extensions()) {
    switch (ext.mapping_type_case()) {
      case substrait::extensions::SimpleExtensionDeclaration::kExtensionTypeVariation: {
        return Status::NotImplemented("Type Variations are not yet implemented");
      }
      case substrait::extensions::SimpleExtensionDeclaration::kExtensionType: {
        const auto& type = ext.extension_type();
        std::string_view uri = uris[type.extension_uri_reference()];
        type_ids[type.type_anchor()] = Id{uri, type.name()};
        break;
      }
      case substrait::extensions::SimpleExtensionDeclaration::kExtensionFunction: {
        const auto& fn = ext.extension_function();
        std::string_view uri = uris[fn.extension_uri_reference()];
        function_ids[fn.function_anchor()] = Id{uri, fn.name()};
        break;
      }
      default:
        Unreachable();   // <-- MAPPING_TYPE_NOT_SET reaches here -> std::abort()
    }
  }

Unreachable() is an unconditional abort (cpp/src/arrow/util/unreachable.cc:26-29):

[[noreturn]] void Unreachable(const char* message) {
  DCHECK(false) << message;   // no-op under NDEBUG
  std::abort();
}

Two public entries reach the same site:

  • DeserializePlan (serde.cc:217) -> GetExtensionSetFromPlan
    (plan_internal.cc:53) -> GetExtensionSetFromMessage<substrait::Plan>.
  • DeserializeExpressions (serde.cc:251) -> FromProto
    (extended_expression_internal.cc:163) -> GetExtensionSetFromExtendedExpression
    (extended_expression_internal.cc:38) ->
    GetExtensionSetFromMessage<substrait::ExtendedExpression>.

PoC

2 bytes — a Plan with one empty SimpleExtensionDeclaration (field 2,
length-delimited, zero length). The empty declaration leaves mapping_type
unset, hitting default: Unreachable().

12 00
# generate_poc.py — re-create the crash input from bytes
poc = bytes([0x12, 0x00])
open("poc.bin", "wb").write(poc)

Crash input size: 2 bytes (poc/poc.bin, md5 5a68de997d60afa9083b17fe00f7cdf2).
The same root cause is also reachable via DeserializeExpressions with an
ExtendedExpression carrying an empty SimpleExtensionDeclaration.

Reproduction

Build Arrow C++ from source with -DARROW_SUBSTRAIT=ON and AddressSanitizer, then deserialize the
attached 2-byte Substrait plan through the public API:

#include <arrow/engine/substrait/serde.h>
// buf = {0x12, 0x00};
auto result = arrow::engine::DeserializePlan(*buf);   // std::abort() here

A SimpleExtensionDeclaration with an unset mapping_type oneof (the protobuf default) falls through
switch (ext.mapping_type_case()) { ... default: Unreachable(); }, and arrow::Unreachable() calls
std::abort() (silent SIGABRT, exit 134):

GetExtensionSetFromMessage<...>  util_internal.h
DeserializePlan                  serde.cc

The default: Unreachable() arm is still present in current master
(cpp/src/arrow/engine/substrait/util_internal.h:82). PoC: 2 bytes 12 00.

Suggested Fix

Replace the default: Unreachable() arm with a returned non-OK Status so the
unset/unknown oneof is rejected gracefully instead of aborting:

--- a/cpp/src/arrow/engine/substrait/util_internal.h
+++ b/cpp/src/arrow/engine/substrait/util_internal.h
@@
-      default:
-        Unreachable();
+      default:
+        return Status::Invalid(
+            "Invalid Substrait SimpleExtensionDeclaration: mapping_type is unset "
+            "or unknown (case ",
+            static_cast<int>(ext.mapping_type_case()), ")");
     }

GetExtensionSetFromMessage already returns Result<ExtensionSet>, so the
Status::Invalid propagates cleanly back through DeserializePlan /
DeserializeExpressions as a non-OK result.

PoC bytes (self-contained)

The trigger input is 2 bytes (poc/poc.bin).
Recreate it exactly with:

base64 -d > poc.bin <<'B64'
EgA=
B64

Hex: 1200

Credit

Aisle Research (Ze Sheng (O2Lab & TAMU), Dmitrijs Trizna, Luigino Camastra, Guido Vranken).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions