Open
Description
Hi,
I am still getting used to the library, but was able to isolate an unexpected performance hit. I want to update just a subregion of a pre-allocated 1D tensor. Maybe there is a better pattern to achieve the same result?
#include <chrono>
#include <xtensor/xrandom.hpp>
#include <xtensor/xtensor.hpp>
double mean_milliseconds_from_total(std::chrono::nanoseconds total,
size_t num_repeats) {
std::chrono::duration<double, std::milli> total_ms = total;
return total_ms.count() / (double)num_repeats;
}
int main() {
size_t num_repeats = 100;
xt::xtensor<double, 1> a = xt::random::rand<double>({10000000});
xt::xtensor<double, 1> b = xt::random::rand<double>({10000000});
xt::xtensor<double, 1> c = xt::zeros<double>({10000000});
// case 1: full tensor
auto started = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < num_repeats; ++i)
c = a + b;
auto finished = std::chrono::high_resolution_clock::now();
std::cout << "elapsed time: "
<< mean_milliseconds_from_total(finished - started, num_repeats)
<< "ms" << std::endl;
// case 2: view of tensor with xt::all()
started = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < num_repeats; ++i)
xt::view(c, xt::all()) = xt::view(a + b, xt::all());
finished = std::chrono::high_resolution_clock::now();
std::cout << "elapsed time: "
<< mean_milliseconds_from_total(finished - started, num_repeats)
<< "ms" << std::endl;
// case 3: view of tensor with xt::range()
started = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < num_repeats; ++i)
xt::view(c, xt::range(0, c.size())) =
xt::view(a + b, xt::range(0, c.size()));
finished = std::chrono::high_resolution_clock::now();
std::cout << "elapsed time: "
<< mean_milliseconds_from_total(finished - started, num_repeats)
<< "ms" << std::endl;
return 0;
}
Result:
elapsed time: 8.00238ms
elapsed time: 31.0913ms
elapsed time: 30.9484ms
I understand that introducing views should have a performance hit, but for doing essentially the same task (memory layout, contiguous memory, same range, equal step size of one), it is quite a big hit. Is this expected behavior or am I doing something wrong?
Thanks.
Versions:
- xtl v0.7.5
- xtensor v0.24.7
- Apple clang version 14.0.3 (clang-1403.0.22.14.1)