Skip to content

add static size information for hvcat and hvncat #58422

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

johnnychen94
Copy link
Member

@johnnychen94 johnnychen94 commented May 15, 2025

When creating an array using scalar numbers, the array initialization method is currently the fastest way. That is,

A = Matrix{Float64}(under, 2, 2)
A[1, 1] = x
A[2, 1] = y
A[1, 2] = x
A[2, 2] = y

is faster than the concatenation method [x x; y y]. This PR tries to close the performance gaps.

Optimization overview:

  • Static shape information is now preserved during syntax parsing, which improves performance when concatenating scalar values into 2D or higher-dimensional arrays.
  • An additional manual loop unrolling has been introduced for small matrix scenarios to boost performance further.

Key results:

  • For small matrices, the performance of array concatenation using [] is now comparable to that of direct matrix initialization. This resolves a previously significant performance gap.
  • For large matrices, concatenation shows substantial performance improvements. However, it still lags behind direct initialization in terms of speed.

Future work:

  • Noticeable performance differences remain for large matrices.

Benchmark performance

The very first version of this commit is implemented based on 1.10.9 release branch, thus I've also included the 1.10-dev performance.
Notice the performance differences between the initialization method and the concatenation method.

[x x; y y] — 2D Matrix Creation

Same Data Type

using BenchmarkTools

f(x, y) = [x x; y y]
@btime f(x, y) setup=(x = rand(); y = rand())

2×2 Matrix

Julia version Initialization (ns) Concatenation (ns)
1.9.3 18.021 37.258
1.10.9 18.634 263.918
1.10-dev 19.660 19.749
1.12.0-beta3 7.690 31.913
1.13-dev 8.644 8.793

6×6 Matrix

Julia version Initialization (ns) Concatenation (ns)
1.9.3 30.865 144.490
1.10.9 32.856 2198
1.10-dev 32.278 31.363
1.12.0-beta3 19.315 129.225
1.13-dev 18.553 984.615

Mixed Data Types

f(x, y) = [x x; y y]
@btime f(x, y) setup=(x = rand(); y = rand(1:10))

2×2 Matrix

Julia version Initialization (ns) Concatenation (ns)
1.9.3 17.794 72.243
1.10.9 18.587 844.879
1.10-dev 19.467 19.394
1.12.0-beta3 8.066 72.766
1.13-dev 8.812 8.554

6×6 Matrix

Julia version Initialization (ns) Concatenation (ns)
1.9.3 35.994 11284
1.10.9 36.748 23835
1.10-dev 32.352 646.955
1.12.0-beta3 27.590 9873
1.13-dev 29.428 7662

[x x;;; y y] — Higher-Dimensional Array Creation

Same Data Type

g(x, y) = [x x;;; y y]
@btime g(x, y) setup=(x = rand(); y = rand())

1×2×2 Array

Julia version Initialization (ns) Concatenation (ns)
1.9.3 19.754 112.885
1.10.9 20.175 108.488
1.10-dev 21.437 19.700
1.12.0-beta3 10.646 183.057
1.13-dev 9.732 10.775

3×3×3 Array

Julia version Initialization (ns) Concatenation (ns)
1.9.3 43.327 445.046
1.10.9 44.832 473.566
1.10-dev 44.633 48.476
1.12.0-beta3 19.072 496.601
1.13-dev 17.061 38.536

Mixed Data Types

g(x, y) = [x x;;; y y]
@btime g(x, y) setup=(x = rand(); y = rand(1:10))

1×2×2 Array

Julia version Initialization (ns) Concatenation (ns)
1.9.3 19.192 147.851
1.10.9 20.126 147.297
1.10-dev 20.719 19.509
1.12.0-beta3 24.837 270.547
1.13-dev 19.595 10.131

3×3×3 Array

Julia version Initialization (ns) Concatenation (ns)
1.9.3 44.267 753.900
1.10.9 43.800 750.264
1.10-dev 43.620 363.485
1.12.0-beta3 47.225 837.426
1.13-dev 42.404 402.965
benchmark test script
using BenchmarkTools

function hvcat_large_baseline(x, y)
    A = Array{Float64, 2}(undef, 6, 6)
    for i in 1:6
        for j in 1:6
            A[i, j] = i == j ? x : y
        end
    end
    return A
end
hvcat_large(x, y) = [
    x y y y y y
    y x y y y y
    y y x y y y
    y y y x y y
    y y y y x y
    y y y y y x
]

function hvcat_small_baseline(x, y)
    A = Array{Float64, 2}(undef, 2, 2)
    for i in 1:2
        for j in 1:2
            A[i, j] = i == j ? x : y
        end
    end
    return A
end
hvcat_small(x, y) = [
    x y
    y x
]

function hvncat_large_baseline(x, y)
    A = Array{Float64, 3}(undef, 3, 3, 3)
    for i in 1:3
        for j in 1:3
            for k in 1:3
                A[i, j, k] = i == j == k ? x : y
            end
        end
    end
    return A
end
hvncat_large(x, y) = [
    x y y; y y y; y y y;;;
    y y y; y x y; y y y;;;
    y y y; y y y; y y x;;;
]

function hvncat_small_baseline(x, y)
    A = Array{Float64, 3}(undef, 2, 2, 2)
    for i in 1:2
        for j in 1:2
            for k in 1:2
                A[i, j, k] = i == j == k ? x : y
            end
        end
    end
    return A
end
hvncat_small(x, y) = [
    x y; y y;;;
    y y; y x;;;
]

@assert hvcat_large_baseline(1.0, 2.0) == hvcat_large(1.0, 2.0)
@assert hvcat_small_baseline(1.0, 2.0) == hvcat_small(1.0, 2.0)
@assert hvncat_large_baseline(1.0, 2.0) == hvncat_large(1.0, 2.0)
@assert hvncat_small_baseline(1.0, 2.0) == hvncat_small(1.0, 2.0)

@assert hvcat_large_baseline(1.0, 2) == hvcat_large(1.0, 2)
@assert hvcat_small_baseline(1.0, 2) == hvcat_small(1.0, 2)
@assert hvncat_large_baseline(1.0, 2) == hvncat_large(1.0, 2)
@assert hvncat_small_baseline(1.0, 2) == hvncat_small(1.0, 2)

@info "hvcat, large, same type, array initialization"
@btime hvcat_large_baseline(x, y) setup=(x = rand(); y = rand())
@info "hvcat, large, same type, concatenation"
@btime hvcat_large(x, y) setup=(x = rand(); y = rand())
@info "hvcat, large, different type, array initialization"
@btime hvcat_large_baseline(x, y) setup=(x = rand(); y = rand(1:10))
@info "hvcat, large, different type, concatenation"
@btime hvcat_large(x, y) setup=(x = rand(); y = rand(1:10))

@info "hvcat, small, same type, array initialization"
@btime hvcat_small_baseline(x, y) setup=(x = rand(); y = rand())
@info "hvcat, small, same type, concatenation"
@btime hvcat_small(x, y) setup=(x = rand(); y = rand())
@info "hvcat, small, different type, array initialization"
@btime hvcat_small_baseline(x, y) setup=(x = rand(); y = rand(1:10))
@info "hvcat, small, different type, concatenation"
@btime hvcat_small(x, y) setup=(x = rand(); y = rand(1:10))

@info "hvncat, large, same type, array initialization"
@btime hvncat_large_baseline(x, y) setup=(x = rand(); y = rand())
@info "hvncat, large, same type, concatenation"
@btime hvncat_large(x, y) setup=(x = rand(); y = rand())
@info "hvncat, large, different type, array initialization"
@btime hvncat_large_baseline(x, y) setup=(x = rand(); y = rand(1:10))
@info "hvncat, large, different type, concatenation"
@btime hvncat_large(x, y) setup=(x = rand(); y = rand(1:10))

@info "hvncat, small, same type, array initialization"
@btime hvncat_small_baseline(x, y) setup=(x = rand(); y = rand())
@info "hvncat, small, same type, concatenation"
@btime hvncat_small(x, y) setup=(x = rand(); y = rand())
@info "hvncat, small, different type, array initialization"
@btime hvncat_small_baseline(x, y) setup=(x = rand(); y = rand(1:10))
@info "hvncat, small, different type, concatenation"
@btime hvncat_small(x, y) setup=(x = rand(); y = rand(1:10))

fixes #58397

@johnnychen94 johnnychen94 marked this pull request as ready for review May 16, 2025 10:02
@johnnychen94
Copy link
Member Author

@nanosoldier runtests()

@johnnychen94
Copy link
Member Author

johnnychen94 commented May 16, 2025

I'm still not very happy with the performance for large matrix creation when the element types are different:

using BenchmarkTools

f(x, y) = [
    x y y y y y
    y x y y y y
    y y x y y y
    y y y x y y
    y y y y x y
    y y y y y x
]

@btime f(x, y) setup=(x = rand(); y = rand(1:10))

It takes 7 μs on my machine (and I'm expecting ~70ns)

But this looks more like Julia's compiler issue related to tuple splattings (xs...) so I decide to give it up here for this PR.

@nanosoldier
Copy link
Collaborator

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

Report summary

❗ Packages that crashed

1 packages crashed only on the current version.

  • A segmentation fault happened: 1 packages

4 packages crashed on the previous version too.

✖ Packages that failed

532 packages failed only on the current version.

  • Package fails to precompile: 439 packages
  • Illegal method overwrites during precompilation: 7 packages
  • Package has test failures: 4 packages
  • Package tests unexpectedly errored: 11 packages
  • Networking-related issues were detected: 36 packages
  • Tests became inactive: 1 packages
  • Test duration exceeded the time limit: 34 packages

1459 packages failed on the previous version too.

✔ Packages that passed tests

17 packages passed tests only on the current version.

  • Other: 17 packages

4910 packages passed tests on the previous version too.

~ Packages that at least loaded

3 packages successfully loaded only on the current version.

  • Other: 3 packages

2715 packages successfully loaded on the previous version too.

➖ Packages that were skipped altogether

1 packages were skipped only on the current version.

  • Package could not be installed: 1 packages

922 packages were skipped on the previous version too.

This commit removes the dependency of generated function to IdSet
because it is not yet defined during the bootstrap stages. By doing
this, we can avoid the world age issues and use @generated function
in abstractarray.jl.
This significantly improves the performance of multi-dimensional
array creation via scalar numbers using methods of
[a b; c d], [a b;; c d] and typed T[a b; c d], T[a b;; c d]

For small numeric array creation(length <= 16), manual loop unroll
is used to further minimize the overhead, and it now has zero
overhead and is as fast as the array initialization method.
Comment on lines +2313 to +2315
@inline function hvcat_static(::Val{rows}, x::T, xs::Vararg{T}) where {rows, T<:Number}
typed_hvcat_static(T, Val{rows}(), x, xs...)
end
Copy link
Member Author

@johnnychen94 johnnychen94 May 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only regression happens here:

using BenchmarkTools

f(x, y) = [
    x y y y y y
    y x y y y y
    y y x y y y
    y y y x y y
    y y y y x y
    y y y y y x
]

@btime f(x, y) setup=(x = rand(); y = rand())

It takes 984.615ns on my local laptop. In comparison, it takes only 128.225ns for Julia 1.12.

But this can be fixed if we duplicate the generated function definition:

@generated function hvcat_static(::Val{rows}, xs::T...) where {T<:Number, rows}
    # just copied everything from the typed_hvcat_static generated function's implementation
end

By applying this fix, we get 24.995ns.

The only problems with this fix are:

  1. This introduces an ambiguity issue caused by xs::T.... That forces we to write hvcat_static(::Val{rows}, x::T, xs::Vararg{T}) rather than hvcat_static(::Val{rows}, xs::Vararg{T}).
  2. This duplicates the code, and I have almost no idea why this duplication works.
  3. Doing this doesn't fix the performance for mixed element types, i.e., @btime f(x, y) setup=(x = rand(); y = rand(1:10)) (add static size information for hvcat and hvncat #58422 (comment))

Thus, I think the "fix" isn't elegant enough and did not commit it for this review version. If we find it necessary, I can add it up later.

@johnnychen94
Copy link
Member Author

@nanosoldier runtests(["MethodInspector", "PiecewiseIncreasingRanges", "WeaklyHard", "Skipper", "ConvexHulls2d", "ITerm2Images", "LevelSetMethods", "Reactive", "BlockMatching", "RegionTrees", "GridInterpolations", "Measurements", "DataFlowTasks", "HybridArrays", "QuadPol", "Uncertain", "OccurrencesInterface", "Rotations", "Interpolations", "AtmosphericProfilesLibrary", "Lowess", "Curves", "EQDSKReader", "NumericalIntegration", "AbstractImageReconstruction", "PeriodicGraphs", "SkyCoords", "PhasedArray", "XTermColors", "HydrophoneCalibrations", "DiskArrayTools", "Insolation", "NamedTrajectories", "ExampleJuggler", "IsopycnalSurfaces", "ManifoldDiff", "ABCDMatrixOptics", "GeoInterfaceMakie", "TimeseriesFeatures", "GeoTIFF", "OceanGrids", "CharacteristicInvFourier", "TuePlots", "ImageBinarization", "ChebyshevFiltering", "LazIO", "DensityRatioEstimation", "FaradayInternationalReferenceIonosphere", "DomainIntegrals", "ImageTransformations", "ImageMagick", "Sinograms", "ImageContrastAdjustment", "QuantumPropagators", "Manopt", "VisionHeatmaps", "SMLMMetrics", "SatelliteToolboxTransformations", "GeostatInversion", "AlgebraPDF", "SatelliteToolboxGeomagneticField", "AtomicData", "Transparency", "SatelliteToolboxGravityModels", "Ludwig", "RobotZoo", "Phylopic", "Epicrop", "ModelParameters", "PhysicalParticles", "SeparableFunctions", "SmoothedParticles", "Reproject", "GasDispersion", "GridUtilities", "PhysicalTrees", "ImageSmooth", "GraphRecipes", "DataInterpolations", "LimberJack", "MultiCDF", "CTFlows", "ShallowWaters", "OpenQuantumBase", "CartesianGrids", "MicroscopePSFs", "KomaMRIBase", "SatelliteToolbox", "Vlasiator", "ExtendableGrids", "KernelDensity", "PseudoPotentialIO", "CherenkovMediumBase", "Tonari", "NonParametricNORTA", "Jolab", "Coulter", "FreezeCurves", "BasicTextRender", "SimulatedAnnealingABC", "F16Model", "BivariateCopulas", "BifrostTools", "MonteCarloMarkovKernels", "ProbabilityBoundsAnalysis", "QPGreen", "HCIToolbox", "KomaMRIFiles", "WaterModels", "SyntheticEddyMethod", "ClimaAnalysis", "BHAtp", "GlobalSensitivityAnalysis", "Transits", "SimplexGridFactory", "OptimizationProblems", "Muspel", "CoherentTransformations", "ParetoSmooth", "ImageEdgeDetection", "AlphaStableDistributions", "EasyFit", "GeoParams", "NaiveBayes", "AdvancedMH", "FinancialMonteCarlo", "TimeseriesSurrogates", "SMLMSim", "FillOutliers", "Augmentor", "SixelTerm", "SimulationLogs", "FourierTools", "DataAugmentation", "EFIT", "DiffEqBase", "Maxnet", "TransferFunctions", "MLJNaiveBayesInterface", "SkyImages", "SliceSampling", "GridVisualize", "Altro", "StochasticGroundMotionSimulation", "GenomicOffsets", "SimpleDiffEq", "EchelleInstruments", "ContinuousWavelets", "TestImages", "Gadfly", "SatelliteAnalysis", "StateSpaceDynamics", "KomaMRICore", "Algames", "SodShockTube", "AeroFuse", "SeisReconstruction", "KernelDensityEstimatePlotting", "TaylorIntegration", "FourierFlows", "SeisProcessing", "Jadex", "PossibilisticArithmetic", "HCGeoTherm", "Isoplot", "ProfileLikelihood", "ImageQualityIndexes", "SteadyStateDiffEq", "SIMIlluminationPatterns", "Clapeyron", "ConstrainedSystems", "CoordGridTransforms", "SMLMFrameConnection", "CTDirect", "DiffEqPhysics", "PhysicalMeshes", "PowerWaterModels", "Pigeons", "MedImages", "QuantumCollocation", "RigorousCoupledWaveAnalysis", "ParticleInCell", "TinyGibbs", "MPIMeasurements", "PsychometricsBazaarBase", "BoltzmannMachinesPlots", "CausalityTools", "Empirikos", "PsychometricsBazzarBase", "TransitionsInTimeseries", "FMIZoo", "GeophysicalFlows", "BioLab", "KDEstimation", "PsychomotorVigilanceTask", "TimeseriesTools", "PhysicalFFT", "ThinFilmsTools", "Batsrus", "SOLPS2imas", "MRISimulation", "BOSS", "MCMCChains", "PointCloudRasterizers", "NestedSamplers", "DeconvOptim", "Associations", "MCMCTempering", "JustRelax", "BayesQR", "KomaMRIPlots", "FittedItemBanks", "QRDecoders", "GrayCoding", "PlasmaEquilibriumToolkit", "MIRT", "PhysicalFDM", "ProjectManagement", "OmniParserIconDetectors", "SetIntersectionProjection", "WorldOceanAtlasTools", "Sunny", "MCMCChainsStorage", "OpenQuantumTools", "DeepFry", "PiecewiseDeterministicMarkovProcesses", "DifferentialEvolutionMCMC", "MembraneEOS", "SpmGrids", "AstroPropagators", "OrdinaryDiffEqBDF", "SPHtoGrid", "OpticalFibers", "Crispulator", "ProteinEnsembles", "DynACof", "InstrumentOperator", "MPISphericalHarmonics", "YAXArrays", "SpecialExponentialFamilies", "DifferentiableCollisions", "HarmonicPowerModels", "QuantitativeSusceptibilityMappingTGV", "MeshCatMechanisms", "PSSFSS", "ROMEO", "TinnitusReconstructor", "DrillHoles", "DiffEqUncertainty", "ComputerAdaptiveTesting", "JuliaBUGS", "BioFindr", "SimSpin", "DistributedStwdLDA", "VIDA", "ClimateBase", "NonstationaryProcessesBase", "Photometry", "NonstationaryProcesses", "PotentialFlow", "BLAKJac", "GlobalSensitivity", "Jags", "GridPotentialFlow", "ComputationalHeatTransfer", "Extremes", "VortexStepMethod", "BEASTDataPrep", "BAT", "ImmersedLayers", "NetworkHawkesProcesses", "PawsomeTracker", "EnsembleKalmanProcesses", "SorptionModels", "Mimi", "MendelPlots", "MimiFAIRv1_6_2", "IDFCurves", "Mimi_NAS_pH", "MimiMooreEtAlAgricultureImpacts", "EFTfitter", "MimiSNEASY", "ViscousFlow", "RIrtWrappers", "CameraCalibrations", "UncertainData", "MimiPAGE2020", "MatterPower", "ReefModEngine", "MimiSSPs", "UVITTools", "DynamicMovementPrimitives", "DelayDiffEq", "AlgamesDriving", "ApproxMasterEqs", "MimiRFFSPs", "SensitivityRankCondition", "SciMLExpectations", "FNCFunctions", "StellaratorOptimizationMetrics", "ILMPostProcessing", "RandomFeatures", "GCIdentifier", "DeviceLayout", "GradientRobustMultiPhysics", "BloqadeDormandPrince", "Ditherings", "Regions", "Eikonal", "Images", "NBodySimulator", "RvSpectML", "MimiCIAM", "HoloProcessing", "ODEHybrid", "VMEC", "GigaScatter", "PlotPlants", "ImageTracking", "ColorSchemeTools", "ComplexPhasePortrait", "ArgoData", "GeneralizedSDistributions", "ScatteringOptics", "ImageFeatures", "EarthDataLab", "ObjectDetector", "SpaSM", "ImageUtils", "EMpht", "AbstractPlotting", "NEOs", "BLASBenchmarksCPU", "StatisticalRethinking", "AvrutinSearch", "ExtendableFEM", "Unfold", "SafetySignalDetection", "UnfoldStats", "ElectroPhysiology", "Trebuchet", "Fronts", "LongwaveModePropagator", "RayTraceHeatTransfer", "DynamicalSystems", "UnderwaterAcoustics", "Luna", "VirtualAcousticOcean", "AuditoryStimuli", "ClimateTools", "OceanRobots", "OtsuThresholding", "Tapestree", "AcousticsToolbox", "PixelArt", "Taproots", "StarFormationHistories", "AIBECS", "WebToys", "MultiScaleArrays", "MintsMakieRecipes", "GpABC", "DiffusionGarnet", "SimpleDistributionPowerFlow", "ImageColorThresholderApp", "Cropbox", "MakieRichText", "SwarmMakie", "CatmullClark", "InteractiveViz", "TernaryDiagrams", "ClimatePlots", "SeisMakie", "GeometryPrimitives", "Escher", "ActionModels", "StatsPlots", "SimpleCrop", "KomaMRI", "Foresight", "CirculatorySystemModels", "Biplots", "LWFBrook90", "MakieDraw", "LeafGasExchange", "MaxwellBase", "PlotShapefiles", "ITensorNumericalAnalysis", "ProbabilisticEchoInversion", "Sherlock", "CellMLToolkit", "RPRMakie", "StochasticDelayDiffEq", "GasChem", "CornerPlotting", "RFFMakie", "HierarchicalGaussianFiltering", "GraphMakie", "Makie", "ItemResponsePlots", "MaxwellFDFD", "UnfoldBIDS", "AbstractGPsMakie", "FisherPlot", "BloqadeWaveforms", "FastGaussianProcesses", "PlantViz", "GeophysicalModelGenerator", "PairPlots", "Garlic", "MakiePublication", "OpenQuantumSystems", "NestedGraphMakie", "FerriteViz", "InPartSMakie", "OptimizedEinsum", "VlasiatorMakie", "PlantGraphs", "BloqadeKrylov", "PeriodicMatrices", "MakieTeX", "GeoMakie", "PlantGeomTurtle", "ElectrochemicalKinetics", "MaxwellGuidedMode", "SpectralStatistics", "DataDrivenSparse", "DataDrivenDMD", "LowRankIntegrators", "ClapeyronHANNA", "CitableImage", "MapMakie", "AeroBeams", "FrequencySweep", "InformationInequalities", "CitablePhysicalText", "EigenShow", "TestParticleMakie", "VectorModesolver", "RealTimeAudioDiffEq", "AcousticRayTracers", "TopoPlots", "InverseStatMech", "Drifters", "MetaAnalysis", "ThesisArt", "HarmonicPowerFlow", "IndividualDisplacements", "SurfaceCoverage", "RankCompV3", "VisClaw", "BatchReactor", "SignedDistanceField", "Supernovae", "MTKHelpers", "IsoME", "AoGExtensions", "HighFrequencyCovariance", "MethodOfLines", "Chron", "VirtualPlantLab", "SSMPlots", "Lighthouse", "JointSurvivalModels", "ModiaLang", "SubsidenceChron", "SignalTablesInterface_CairoMakie", "Repotomata", "WellPlates", "TrajectoryGamesExamples", "MeshViz", "SymbolicInference", "TrajectoryGamesBase", "Survey", "CalibrateEmulateSample", "EditorsRepo", "AgentsPlots", "ModiaPlot_CairoMakie", "ClimaCoreMakie", "ObservableCortex", "Tortuosity", "MCHammer", "GenomicDiversity", "ModiaPlot_GLMakie", "MRIgeneralizedBloch", "CollectiveSpins", "Jchemo", "ReactionDiffusion", "MCPhylo", "PhysicsInformedRegression", "SignalTablesInterface_WGLMakie", "RainMaker", "ControlBarrierFunctions", "MESTI", "GridapMakie", "UnfoldDecode", "SimulatedNeuralMoments", "MimiBRICK", "IMAS", "FusionSyntheticDiagnostics", "GeoIO", "SPEDAS", "DataDrivenEnzymeRateEqs", "HeartRateVariability", "SphericalClusterMass", "Tyler", "ITensorMakie", "SOLPS2ctrl", "MimiGIVE", "StatisticalRethinkingPlots", "CropRootBox", "SimulationBasedCalibration", "TensegrityEquilibria", "TidierPlots", "InferenceReport", "Coalescent", "QuantumDynamics", "Biofilm", "MRINavigator", "MathepiaModels", "BaryPlots", "HmtGutenberg", "BloqadeODE", "AlgorithmicCompetition", "UnfoldMakie", "BloqadeGates", "CarnotCycles", "MiseEnPage", "BloqadeNoisy", "Tidier", "Controlz", "MagNav"])

@nanosoldier
Copy link
Collaborator

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

Report summary

✖ Packages that failed

9 packages failed only on the current version.

  • Package fails to precompile: 1 packages
  • Package has test failures: 1 packages
  • Test duration exceeded the time limit: 7 packages

56 packages failed on the previous version too.

✔ Packages that passed tests

6 packages passed tests only on the current version.

  • Other: 6 packages

273 packages passed tests on the previous version too.

~ Packages that at least loaded

2 packages successfully loaded only on the current version.

  • Other: 2 packages

185 packages successfully loaded on the previous version too.

➖ Packages that were skipped altogether

1 packages were skipped on the previous version too.

@johnnychen94
Copy link
Member Author

PkgEval and CI tests look good, I'll keep this PR as it is until feedback comes.

@mbauman mbauman removed their request for review May 20, 2025 18:06
@mbauman
Copy link
Member

mbauman commented May 20, 2025

This kinda scares me with respect to TTFX and over specialization and static compilability — especially for such a fundamental syntax. I don't think I can meaningfully review it beyond spreading this FUD; maybe someone closer to static compilation could evaluate its impacts?

@johnnychen94
Copy link
Member Author

johnnychen94 commented May 21, 2025

TTFX is a good point, here's the small experiment I've done for 2x2 hvcat:

@generated function many_cats_in_one_function(::Val{N}) where N
    exprs = Expr[]
    push!(exprs, :(x = [1 2; 3 4]))
    for i in 1:N
        push!(exprs, :(x = [(x[1, 1] + 1) (x[2, 2] + 1); (x[1, 2] + 1) (x[2, 1] + 1)]))
    end
    return Expr(:block, exprs...)
end

function many_cats_recursive(n)
    if n == 0
        return [1 2; 3 4]
    else
        x = many_cats_recursive(n - 1)
        return [(x[1, 1] + 1) (x[2, 2] + 1); (x[1, 2] + 1) (x[2, 1] + 1)]
    end
end

@info "Testing many_cats_recursive"
@time "1" @eval many_cats_recursive(1) # only the first run is the real TTFX
@time "5" @eval many_cats_recursive(5) # all subsequent calls are real runtime
@time "10" @eval many_cats_recursive(10)
@time "20" @eval many_cats_recursive(20)
@time "50" @eval many_cats_recursive(50)
@time "100" @eval many_cats_recursive(100)
@time "200" @eval many_cats_recursive(200)
@time "1000" @eval many_cats_recursive(1000)

@info "Testing many_cats_in_one_function"
@time "1" @eval many_cats_in_one_function(Val{1}())
@time "5" @eval many_cats_in_one_function(Val{5}())
@time "10" @eval many_cats_in_one_function(Val{10}())
@time "20" @eval many_cats_in_one_function(Val{20}())
@time "50" @eval many_cats_in_one_function(Val{50}())
@time "100" @eval many_cats_in_one_function(Val{100}())
@time "200" @eval many_cats_in_one_function(Val{200}())
@time "1000" @eval many_cats_in_one_function(Val{1000}())

Based on the data, we get:

  • Only when we have over 50 [x x; y y] assignments in one single function body, will we see a difference in terms of ttfx. But this is a relatively rare and extreme use case.
  • when we have only a few [x x; y y] assignments, the ttfx differences are almost negligible
# of Cats 1.13-dev Recursive 1.13-dev One Function 1.12.0 Recursive 1.12.0 One Function
1 0.011366 0.018347 0.024991 0.020136
2 0.000125 0.011275 0.000126 0.010211
3 0.000128 0.097539 0.000271 0.014777
4 0.000133 0.015556 0.000132 0.016847
5 0.000165 0.019281 0.000125 0.018541
10 0.000128 0.031956 0.000136 0.028788
20 0.000114 0.065612 0.000160 0.054726
50 0.000117 0.155401 0.000128 0.149340
100 0.000120 0.406060 0.000128 0.289541
200 0.000126 0.992100 0.000132 0.710165
1000 0.000160 17.622069 0.000191 4.349432

The above experiment is about 2x2 hvcat where the manual unroll is applied.

The following is about 6x6 hvcat where we call the generic hvcat_fill! normal function.
It suggests that ttfx is getting better. (The slowdown for the recursive version is because of #58422 (comment))

# of Cats 1.13-dev Recursive 1.13-dev One Function 1.12.0 Recursive 1.12.0 One Function
1 0.060155 0.025877 0.156673 0.030071
2 0.000133 0.109858 0.000129 0.040648
3 0.000134 0.038697 0.000121 0.052605
4 0.000220 0.049555 0.000127 0.068721
5 0.000144 0.057182 0.000130 0.094925
10 0.000146 0.097877 0.000130 0.195860
20 0.000139 0.220829 0.000129 0.302984
50 0.000170 0.538179 0.000137 0.731944
100 0.000212 1.062964 0.000133 1.599463
200 0.000270 2.127295 0.000170 3.146528
1000 0.000966 14.939002 0.000476 21.660305

@@ -2276,11 +2276,13 @@
(if (any (lambda (row) (any vararg? row)) rows)
`(call ,@hvcat_rows ,@(map (lambda (x) `(tuple ,@x)) rows))
`(call ,@hvcat
(tuple ,@(map length rows))
(new (curly (top Val) (call (core tuple) ,@(map length rows))))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get some inconsistency checking the result of @which

julia> @which [2 2; 3 3]
hvcat(rows::Tuple{Vararg{Int64}}, xs::T...) where T<:Number
     @ Base abstractarray.jl:2203

julia> Meta.@lower [2 2; 3 3]
:($(Expr(:thunk, CodeInfo(
1 ─ %1 =   builtin Core.tuple(2, 2)
│   %2 =   builtin Core.apply_type(Base.Val, %1)
│   %3 = %new(%2)
│   %4 =   dynamic Base.hvcat_static(%3, 2, 2, 3, 3)
└──      return %4
))))

is it because I need to update JuliaSyntax as well?

@adienes
Copy link
Member

adienes commented May 21, 2025

is this related to / does this address #43844 ?

@mcabbott
Copy link
Contributor

Can you explain a bit where this matters? Do you have some examples of performance-sensitive code which allocates small matrices from numbers?

In my head at least, hvcat / hvncat are great ways to make small matrices in examples / tests. And for REPL / Test, we should care a lot about TTFX and little about throughput.

That said I am surprised how widely the benchmarks above vary between versions.

@oscardssmith
Copy link
Member

One place where this would have been nice is SciML/OrdinaryDiffEq.jl#2712. Constructing a matrix of literal values seems like the sort of thing that should be basically free, so it's somewhat unfortunate when it's expensive.

@johnnychen94
Copy link
Member Author

Can you explain a bit where this matters?

The following is a typical pattern when one wants to solve a problem iteratively (generated from ChatGPT):

version time with static arrays (s) time with normal arrays (s)
1.9 2.42 21.19
1.10 2.68 43.82
1.12 2.35 27.49
1.13-dev 2.53 16.2

We identified issue #58397 when switching from Julia 1.9 to Julia 1.10 for our internal project, written by people who only know MATLAB and barely know Julia.
In their world, there is only one array type.

In our case, we fixed it by teaching them how to use StaticArrays, but it's often the case that only power users of Julia know about StaticArrays and when to use them...

And I won't be surprised if there are cases in the wild where array sizes are not deterministic like this one.

using LinearAlgebra

function lorenz_matrix(u)
    a, b = u[1,1], u[1,2]
    c, d = u[2,1], u[2,2]

    da = 10.0*(b - a)
    db = a*(28.0 - d) - b - 0.1*b*d
    dc = c*(b - c) - 0.1*c^2
    dd = a*c - (8/3)*d - 0.1*d^2

    return [da db; dc dd]
end

function euler_integrate(f, u0, tspan; dt=0.01, max_dt=0.1, min_dt=1e-6)
    t0, tf = tspan
    t = t0
    u = copy(u0)

    history = [(t, copy(u))]

    while t < tf
        # Compute derivative
        du = f(u)

        # Adaptive time stepping
        scale = norm(du) + 1e-10  # Prevent division by zero
        curr_dt = min(max_dt, max(min_dt, 0.1/scale))
        curr_dt = min(curr_dt, tf - t)  # Don't overshoot final time

        # Euler step
        u += curr_dt * du
        t += curr_dt

        push!(history, (t, copy(u)))
    end

    return history
end

# Initial condition
u0 = [1.0  0.01;
      0.01 0.0]

# Integrate
tspan = (0.0, 50.0)
history = @time euler_integrate(lorenz_matrix, u0, tspan)

@johnnychen94
Copy link
Member Author

is this related to / does this address #43844 ?

@adienes Kind of, except there isn't much room for optimizing 1D case [a, b, c] or [a; b; c]. The performances between the array initialization method and the concatenation methods for 1D are almost the same.

This PR intends to fix the performance gap for 2D and higher-dimensional cases. And applies to Number type only.

@nsajko nsajko added the arrays [a, r, r, a, y, s] label May 27, 2025
@BioTurboNick
Copy link
Contributor

I don't have a strong opinion either way, but two questions that may guide decisions:

How does the time compare with GC? If you're at the point where small arrays are a major bottleneck, is eliminating the repeated array allocation usually better than the gain from this?

What are the trade-offs vs. a macro that would just expand the concat syntax into an elementwise fill?

@johnnychen94
Copy link
Member Author

johnnychen94 commented May 27, 2025

How does the time compare with GC? If you're at the point where small arrays are a major bottleneck, is eliminating the repeated array allocation usually better than the gain from this?

What are the trade-offs vs. a macro that would just expand the concat syntax into an elementwise fill?

@BioTurboNick I'm very much aware that there are many work arounds to this performance issue. Reuse the created memory, manually rewrite or via macros, or adopt StaticArrays. They are all doable in a real project.

The real reason I chose to provide a fix here at the language level is that I find it extremely awkward to explain this as a performance tip.
Imagine that you're writing down a performance tip wrt this:

## use []-syntax sugar only for convenience
 
Although relatively cheap, creating arrays using the `[]`-syntax as a syntax sugar introduces some extra overhead.
If performance is critical, it is recommended to create the array using the initialization pattern:

```julia
A = Array{Float64, 2}(undef, 2, 2)
A[1,1] = ...
...
```

or use `@unroll [x x; y y]`.

Consider this a numerical computing language; this performance tip is rather an excuse for something that we should fix, but we didn't.

So if I were to sort the priorities, ensuring our users have a reasonable and consistent performance expectation of []-syntax is more important than worrying about latency.
Not to mention that it is not a marginal, but instead a 10x performance difference here.

@BioTurboNick
Copy link
Contributor

My question comes from my expectation that ever allocating an array on the heap in Julia is expensive and should be avoided in any performance-critical code.

@johnnychen94
Copy link
Member Author

johnnychen94 commented May 27, 2025

With some quick chat with @timholy on Slack, let's summarize the PR:
this is a smart fix to the []-syntax performance gotcha, but the over-specialization might be out of control.
And this is the kind of thing we have no concrete expectation of its output, so more benchmarks are needed.

Here's what I plan to do next:

  • Do a more thorough benchmark to see if there are TTFX regressions in a real-world project.
  • Try to reduce the number of specializations: the specialization on rows is necessary if we want to reach a better performance. But the specializations on T<:Number eltype aren't.
  • try to fix add static size information for hvcat and hvncat #58422 (comment)
  • This at least breaks Zygote AD, a PR to Zygote/ChainRules is needed.
MWE that breaks Zygote AD
using Zygote

f(x, y) = [
    (x + y) (x - y);
    (x * y) (x / y)
]
g(x, y) = hvcat((2, 2), x + y, x - y, x * y, x / y)

@assert f(0.5, 1.5) == g(0.5, 1.5)


df(x, y) = gradient((_x, _y)->sum(f(_x, _y)), x, y)
dg(x, y) = gradient((_x, _y)->sum(g(_x, _y)), x, y)

df(0.5, 1.5) # errors with mutation

@johnnychen94
Copy link
Member Author

johnnychen94 commented May 27, 2025

My question comes from my expectation that ever allocating an array on the heap in Julia is expensive and should be avoided in any performance-critical code.

@BioTurboNick This is very true, and we can even replace Array with Tuples (as what StaticArrays is under the hood).
However, having multiple workarounds to this doesn't necessarily mean the issue doesn't exist or people doesn't benefit from this.
There are a large number of naive Julia users using arrays without knowing all sorts of the memory things behind it.
Every MATLAB user is a potential Julia user :P

We just need to do the fix with less impact in a less controversial way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

the performance of array construction from numbers: SparseArrays piracy and hvcat overhead
8 participants