Skip to content

Calling parallelproj forward Segmentation fault (core dumped) for LAFOV #1567

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
christanmod opened this issue Feb 21, 2025 · 2 comments
Open

Comments

@christanmod
Copy link

Hi Kris and all,

I'm encounting INFO: Calling parallelproj forward Segmentation fault (core dumped) when I did forward projection for a large scanner.
I am using CPU version of parallelproj: 1.10.1, STIR: 6.2.
The error occured several minutes after created parallelproj data-structures. From resource monitor, the task used 274GB RAM before it crashed, in total we have 1.5 TB RAM.
Please see 1. gdb info from Debug_release mode
INFO: Calling parallelproj forward
Thread 1 "forward_project" received signal SIGSEGV, Segmentation fault.
0x00005555557afb3b in stir::ForwardProjectorByBinParallelproj::set_input(stir::DiscretisedDensity<3, float> const&) ()
(gdb) backtrace
#0 0x00005555557afb3b in stir::ForwardProjectorByBinParallelproj::set_input(stir::DiscretisedDensity<3, float> const&) ()
#1 0x00005555556b93d4 in stir::ForwardProjectorByBin::forward_project(stir::ProjData&, stir::DiscretisedDensity<3, float> const&, int, int, bool) ()
#2 0x00005555555af71d in main ()
2. attached valgrind_output (very long).
valgrind_output.log

Please let me if the information is enough, if not I can insert cout to see what happens inside.
Thank you!

@KrisThielemans
Copy link
Collaborator

Sadly, there is no line number information in any of your traces, which seems to say you should recompile STIR in RelWithDebInfo as opposed to Release.

Neverthless, the interesting bit in the valgrind log seems to be

**1164644** new/new[] failed and should throw an exception, but Valgrind
**1164644**    cannot throw exceptions and so is aborting instead.  Sorry.
==1164644==    at 0x484852C: ??? (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==1164644==    by 0x4849085: operator new(unsigned long) (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==1164644==    by 0x368D78: stir::detail::ParallelprojHelper::ParallelprojHelper(stir::ProjDataInfo const&, stir::DiscretisedDensity<3, float> const&) (in /home/li/devel/STIR/install/bin/forward_project)
==1164644==    by 0x364068: stir::ForwardProjectorByBinParallelproj::set_up(std::shared_ptr<stir::ProjDataInfo const> const&, std::shared_ptr<stir::DiscretisedDensity<3, float> const> const&) (in /home/li/devel/STIR/install/bin/forward_project)
==1164644==    by 0x1635E2: main (in /home/li/devel/STIR/install/bin/forward_project)

Confusingly, the gdb and valgrind outputs are in different places.

In ParallelprojHelper, there are few allocations. The only ones that I can see are

detail::ParallelprojHelper::ParallelprojHelper(const ProjDataInfo& p_info, const DiscretisedDensity<3, float>& density)
: xstart(p_info.size_all() * 3),
xend(p_info.size_all() * 3)

Of course, these are huge in this case: it implies that we need 2 std::vector<float> of 3 * total_of_elements_in_the_Quadra_sinogram. It's not so surprising to me that it fails, and it "properly" throws an exception (except that valgrind can't handle that).

For your gdb case, it looks like it did succesfully allocate the ParallelprojHelper data structures, but then segfaults in ForwardProjectorByBinParallelproj::set_input. That's of course a large function, so it's hard to know where it segfaults ATM. Even the non-CUDA version does need another "sinogram-sized allocation" due to a "transpose" between the parallelproj TOF dimension (fastest) and the STIR one (slowest). This extra allocation might be avoidable. But of course, new should have thrown an exception, not a segfault.

Summary: please recompile such that we get line information. Sorry.

@KrisThielemans KrisThielemans changed the title Calling parallelproj forward Segmentation fault (core dumped) Calling parallelproj forward Segmentation fault (core dumped) for LAFOV Feb 21, 2025
@christanmod
Copy link
Author

christanmod commented Apr 3, 2025

Sadly, there is no line number information in any of your traces, which seems to say you should recompile STIR in RelWithDebInfo as opposed to Release.

Neverthless, the interesting bit in the valgrind log seems to be

**1164644** new/new[] failed and should throw an exception, but Valgrind
**1164644**    cannot throw exceptions and so is aborting instead.  Sorry.
==1164644==    at 0x484852C: ??? (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==1164644==    by 0x4849085: operator new(unsigned long) (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==1164644==    by 0x368D78: stir::detail::ParallelprojHelper::ParallelprojHelper(stir::ProjDataInfo const&, stir::DiscretisedDensity<3, float> const&) (in /home/li/devel/STIR/install/bin/forward_project)
==1164644==    by 0x364068: stir::ForwardProjectorByBinParallelproj::set_up(std::shared_ptr<stir::ProjDataInfo const> const&, std::shared_ptr<stir::DiscretisedDensity<3, float> const> const&) (in /home/li/devel/STIR/install/bin/forward_project)
==1164644==    by 0x1635E2: main (in /home/li/devel/STIR/install/bin/forward_project)

Confusingly, the gdb and valgrind outputs are in different places.

In ParallelprojHelper, there are few allocations. The only ones that I can see are

STIR/src/recon_buildblock/Parallelproj_projector/ParallelprojHelper.cxx

Lines 44 to 46 in a7238f8

detail::ParallelprojHelper::ParallelprojHelper(const ProjDataInfo& p_info, const DiscretisedDensity<3, float>& density)
: xstart(p_info.size_all() * 3),
xend(p_info.size_all() * 3)

Of course, these are huge in this case: it implies that we need 2 std::vector<float> of 3 * total_of_elements_in_the_Quadra_sinogram. It's not so surprising to me that it fails, and it "properly" throws an exception (except that valgrind can't handle that).
For your gdb case, it looks like it did succesfully allocate the ParallelprojHelper data structures, but then segfaults in ForwardProjectorByBinParallelproj::set_input. That's of course a large function, so it's hard to know where it segfaults ATM. Even the non-CUDA version does need another "sinogram-sized allocation" due to a "transpose" between the parallelproj TOF dimension (fastest) and the STIR one (slowest). This extra allocation might be avoidable. But of course, new should have thrown an exception, not a segfault.

Summary: please recompile such that we get line information. Sorry.

Hi Kris,

Sorry I didn't get the notification of your reply.
I recompiled with RelWithDebInfo and please see the new gdb output.

WARNING: Expected the number of views (50) to be related to the number of detectors per ring (798), but this is not the case. Continuing anyway (but without adjusting the azimuthal angle offset).

INFO: Tbin -16: -354.252 - -332.782 mm (-2363.31 - -2220.08 ps) = 21.4698

INFO: Tbin -15: -332.782 - -311.312 mm (-2220.08 - -2076.85 ps) = 21.4698

INFO: Tbin -14: -311.312 - -289.843 mm (-2076.85 - -1933.62 ps) = 21.4698

INFO: Tbin -13: -289.843 - -268.373 mm (-1933.62 - -1790.39 ps) = 21.4698

INFO: Tbin -12: -268.373 - -246.903 mm (-1790.39 - -1647.16 ps) = 21.4698

INFO: Tbin -11: -246.903 - -225.433 mm (-1647.16 - -1503.93 ps) = 21.4698

INFO: Tbin -10: -225.433 - -203.963 mm (-1503.93 - -1360.7 ps) = 21.4698

INFO: Tbin -9: -203.963 - -182.493 mm (-1360.7 - -1217.47 ps) = 21.4698

INFO: Tbin -8: -182.493 - -161.024 mm (-1217.47 - -1074.23 ps) = 21.4698

INFO: Tbin -7: -161.024 - -139.554 mm (-1074.23 - -931.003 ps) = 21.4698

INFO: Tbin -6: -139.554 - -118.084 mm (-931.003 - -787.772 ps) = 21.4698

INFO: Tbin -5: -118.084 - -96.6142 mm (-787.772 - -644.54 ps) = 21.4698

INFO: Tbin -4: -96.6142 - -75.1444 mm (-644.54 - -501.309 ps) = 21.4698

INFO: Tbin -3: -75.1444 - -53.6745 mm (-501.309 - -358.078 ps) = 21.4698

INFO: Tbin -2: -53.6745 - -32.2047 mm (-358.078 - -214.847 ps) = 21.4698

INFO: Tbin -1: -32.2047 - -10.7349 mm (-214.847 - -71.6156 ps) = 21.4698

INFO: Tbin 0: -10.7349 - 10.7349 mm (-71.6156 - 71.6156 ps) = 21.4698

INFO: Tbin 1: 10.7349 - 32.2047 mm (71.6156 - 214.847 ps) = 21.4698

INFO: Tbin 2: 32.2047 - 53.6745 mm (214.847 - 358.078 ps) = 21.4698

INFO: Tbin 3: 53.6745 - 75.1444 mm (358.078 - 501.309 ps) = 21.4698

INFO: Tbin 4: 75.1444 - 96.6142 mm (501.309 - 644.54 ps) = 21.4698

INFO: Tbin 5: 96.6142 - 118.084 mm (644.54 - 787.772 ps) = 21.4698

INFO: Tbin 6: 118.084 - 139.554 mm (787.772 - 931.003 ps) = 21.4698

INFO: Tbin 7: 139.554 - 161.024 mm (931.003 - 1074.23 ps) = 21.4698

INFO: Tbin 8: 161.024 - 182.493 mm (1074.23 - 1217.47 ps) = 21.4698

INFO: Tbin 9: 182.493 - 203.963 mm (1217.47 - 1360.7 ps) = 21.4698

INFO: Tbin 10: 203.963 - 225.433 mm (1360.7 - 1503.93 ps) = 21.4698

INFO: Tbin 11: 225.433 - 246.903 mm (1503.93 - 1647.16 ps) = 21.4698

INFO: Tbin 12: 246.903 - 268.373 mm (1647.16 - 1790.39 ps) = 21.4698

INFO: Tbin 13: 268.373 - 289.843 mm (1790.39 - 1933.62 ps) = 21.4698

INFO: Tbin 14: 289.843 - 311.312 mm (1933.62 - 2076.85 ps) = 21.4698

INFO: Tbin 15: 311.312 - 332.782 mm (2076.85 - 2220.08 ps) = 21.4698

INFO: Tbin 16: 332.782 - 354.252 mm (2220.08 - 2363.31 ps) = 21.4698

INFO: Creating parallelproj data-structures
[Thread 0x7fffe3acf640 (LWP 24362) exited]
[Thread 0x7fffe4ad1640 (LWP 24360) exited]
[Thread 0x7fffe52d2640 (LWP 24359) exited]
[Thread 0x7fffe82d8640 (LWP 24353) exited]
[Thread 0x7fffe5ad3640 (LWP 24358) exited]
[Thread 0x7fffe9adb640 (LWP 24350) exited]
[Thread 0x7fffebadf640 (LWP 24346) exited]
[Thread 0x7fffe32ce640 (LWP 24363) exited]
[Thread 0x7fffe42d0640 (LWP 24361) exited]
[Thread 0x7fffe62d4640 (LWP 24357) exited]
[Thread 0x7fffe6ad5640 (LWP 24356) exited]
[Thread 0x7fffe72d6640 (LWP 24355) exited]
[Thread 0x7fffe7ad7640 (LWP 24354) exited]
[Thread 0x7fffe92da640 (LWP 24351) exited]
[Thread 0x7fffea2dc640 (LWP 24349) exited]
[Thread 0x7fffeaadd640 (LWP 24348) exited]
[Thread 0x7fffeb2de640 (LWP 24347) exited]
[Thread 0x7fffec2e0640 (LWP 24345) exited]
[Thread 0x7fffecae1640 (LWP 24344) exited]
[Thread 0x7fffe8ad9640 (LWP 24352) exited]

INFO: done
[New Thread 0x7fffe8ad9640 (LWP 24410)]
[New Thread 0x7fffeb2de640 (LWP 24411)]
[New Thread 0x7fffe62d4640 (LWP 24412)]
[New Thread 0x7fffe7ad7640 (LWP 24413)]
[New Thread 0x7fffecae1640 (LWP 24414)]
[New Thread 0x7fffec2e0640 (LWP 24415)]
[New Thread 0x7fffebadf640 (LWP 24416)]
[New Thread 0x7fffeaadd640 (LWP 24417)]
[New Thread 0x7fffea2dc640 (LWP 24418)]
[New Thread 0x7fffe9adb640 (LWP 24419)]
[New Thread 0x7fffe92da640 (LWP 24420)]
[New Thread 0x7fffe82d8640 (LWP 24421)]
[New Thread 0x7fffe72d6640 (LWP 24422)]
[New Thread 0x7fffe6ad5640 (LWP 24423)]
[New Thread 0x7fffe5ad3640 (LWP 24424)]
[New Thread 0x7fffe52d2640 (LWP 24425)]
[New Thread 0x7fffe4ad1640 (LWP 24426)]
[New Thread 0x7fffe42d0640 (LWP 24427)]
[New Thread 0x7fffe3acf640 (LWP 24428)]
[New Thread 0x7fffe32ce640 (LWP 24429)]

INFO: Calling parallelproj forward

Thread 1 "forward_project" received signal SIGSEGV, Segmentation fault.
0x00005555557664fb in stir::TOF_transpose (offset=0, num_lors_per_chunk=, _helper=..., mem_for_PP=..., STIR_mem=0x7ffc92e8c010) at /home/li/devel/STIR/STIR/src/recon_buildblock/Parallelproj_projector/ForwardProjectorByBinParallelproj.cxx:138
138 STIR_mem[offset + tof_idx * _helper->num_lors + lor_idx] = mem_for_PP[lor_idx * num_tof_bins + tof_idx];

Please let me know if more information is needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants