Skip to content

Conversation

@scottwittenburg
Copy link
Collaborator

Implement aggregation method inspired by the paper "Can I/O Variability Be Reduced on QoS-Less HPC Storage Systems?" (link).

Currently only works with DataSizeBased aggregation.

Does not yet do any actual re-routing, only re-implements writing using a new multithreaded messaging scheme based on the paper above.

Adds some missing functionality to adiosComm (probe api and support for MPI_ANY_SOURCE)

Break out the bits of InitTransports that generate the filename
and open the file, and put them into a new function that can be
called either with or without an mpi communicator.

This supports the rerouting aggregator, currently under development,
by allowing rerouted ranks to change their target subfile near the
end of the writing process, after opening a different file at the
beginning.
Just trying to get a threaded communication scheme going, without
the actual rerouting to start (just have every group write to the
expected file).

Add some api to adiosComm to support the Probe/Iprobe api and make
sure it's working by using it the test.
Cause of race condition was that rank 0 would, maybe due to its extra
pre-write duties, arrive to the party and send its WRITE_SUBMISSION
msg after the SC had already finished writing and exited the comm
loop.

Since all ranks generate the complete partitioning information, we
can make that available so that SCs can pre-populate their writer
queues from that and do away with the WRITE_SUBMISSION message
altogether. This reduces the total number of messages needed, as
well as avoids the race condtion that caused intermittent hangs
of the test.
* Before leaving the comm loop, SC must update its m_DataPos from the
variable that has been tracking it for its group, or else it shares
the wrong file positions with the other ranks. This was causing all
tests with more than 1 timestep to fail.

* When SC comm thread first starts up, it needs to set the current
file position variable from m_DataPos, which was updated correctly
in append mode.  That was causing tests that append to fail.
Avoid the situation with non-blocking sends where the buffer containing
the data to be sent is destroyed before the send completes.
- One certain issue was storing globalState using subcoordinator rank
ids, rather than the index of the groups they coordinate.  Fix that by
adding a mapping from rank id to global state index.
- When rerouting happens on time step zero, opening without append
could result in a zero block appearing where the rerouted rank wrote.
Fix that by allowing to specify forceAppend when rerouted ranks open
their files.
This could happen when global coordinator re-routed another rank
to it's own subfile. Because the update of m_DataPos from the local
tracking varibale, currentFilePos, was happening upon receipt of
the GROUP_CLOSE message, and the global coordinator does not send
itself that message, it wasn't capturing the actual file offset to
be shared with other ranks later in EndStep().

This moves the variable update to the end of the thread method,
just before returning, so that all subcoordinators do it, regardless
of whether they're the global coordinator also.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant