-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load data from multiple files in parallel? #49
Comments
From my experience, the parallel reading on single node doesn't give much advantages. GPFS can transfer a single file with the full single node bandwidth. I think it is also true for a local disk. Probably it may work on PNFS, I didn't try. |
I have to disagree on the GPFS reading performance. One of the reasons for coming up with #32 is the clear benefit of doing even trivial computation per train, i.e. basically just I/O, on a single node. It doesn't scale arbitrarily well, but definitely beyond a few processes. It may be the result of parallel network requests or some other hidden GPFS behaviour distribution the load. Note that these numbers come from loading smallish to medium-sized chunks scattered over several files. From your description it sounds like several indexing steps do exactly this. Maybe it makes sense to expose the API proposed in #32 only internally for situations just like this? |
Looking at the plot in #32, I see that the time for 25 processes is about 10 times less than for 1 process. Since there is no way to read data faster than the maximum network bandwidth, should I conclude that the Infiniband is able to transfer data in one stream using only less than 1/10 of the bandwidth? It's hard to believe. I did the test as I was suggesting. Results are at the end of post. The test was run on node exfl038 with bandwidth 56 Gb/s (4X FDR), total reading data size is 8Gb. In one stream, data is read at a speed of 25.9-29.1 Gb/s (in different test runs) and reaches an optimum of 43-51.6 Gb/s on 64 streams. Thus, maximum possible acceleration is about 1.75 times. I’m not sure that the complication of EXtra-data is worth this acceleration. Perhaps if the implementation is extremely trivial. The scaling of processing may explain the speedup in your tests. Processing scales much better even it is neglectable. I use calculation of average and variance across frames in my tests. In my tests, it scales about 1.7 times each time when doubling the processes up to 32. And since reading takes only 20% on one process, the total time scales well too. There is only one reason for almost ideal scaling of file reading. If you use shared node or connection and have to compete for network bandwidth. A way to start arms race. ;-) You can find my script in
1st run
2nd run
3rd run
|
If we read small data from many files then the performance is far from network bandwidth. Tests done on node max-exfl191. In tests, I read trainId and pulseId arrays (about 780KB, together) from every AGIPD file in 64 processes. Perhaps negotiations longer than actual transfer, but it works. It is strange, actually. In the previous test, I read 4 MB in one reading operation. And that was one chunk in hdf. Indexes are splitted on chunks by 128 bytes. It means thousands reading operation per file. Perhaps this is the answer. If it is the case we need think about data chunking. You can find my script in
|
The latest benchmark (reading small data) is similar to what @tmichela found in #30 - a substantial speedup is possible with more processes, even if it's not necessarily linear. This is a massive improvement for I agree about the pathological chunking - train IDs in raw data are written with one number per HDF5 chunk (although a cluster of 32 chunks seems to get written together). In the case I looked at, fixing this could make it about 10x faster to get the metadata from a non-cached file on GPFS. We raised it with ITDM, but apparently it's not trivial to make the file writing code handle it nicely. [Aside: this almost makes me envy facilities which record to a custom format and then convert the data to HDF5 files afterwards. Writing directly to HDF5 means the file format is a compromise between what makes sense for writing and what makes sense for reading, and of course all the details are determined by people who care a lot about writing it.] Anyway, if the speedup for reading 'real' data isn't so big, maybe this idea isn't such a good one. Or maybe it makes sense only for index-type information rather than big data. Or perhaps we should extend our cache to cover that. |
Why is a proper chunking difficult for writing? Just define chunks in proper size and than write data as it convenient. hdf will care about chunking, will not? |
I don't want to say too much about a system I'm not familiar with, but I think that if they just change the chunk size setting, you end up with 0s at the end of the data. This was the case in old data, and EXtra-data handles it, but it's something they want to avoid. This is not something that HDF5 forces on us, but the net result of various internal API layers involved in writing the files - something grows the files by one chunk at a time, and doesn't truncate them if it stops mid-chunk. |
Actually, I though that you even will not see those 0s at the end. The dataset should contain proper shape excluding meaningless 0s. And additional 0s just take a space. Half-filled chunks in last file in sequence is reasonable price for normal reading performance as for me. |
That's how HDF5 should work, but the internal XFEL APIs that write the file don't tell HDF5 the proper shape. They make the dataset a multiple of the chunk size, so the meaningless 0s are exposed to people reading the files. That got solved by setting the chunk size to 1. I'm not saying this is a good reason. I'm frustrated by it, and I hope it gets fixed soon. I'm just highlighting that we already asked for this to be fixed, and ran into this problem when we did. |
I understand. I just would like to know the current situation and arguments. I'm asking you because you know more than me. Thanks for explanation. |
This topic just came up again. Just loading JUNGFRAU data and doing nothing with it, we can get something like a 6x speedup by splitting it up and loading parts in parallel. Below are results from a modified version of Yohei's test script, loading from I'm wondering about adding a parameter to
Timing scriptimport glob, os, time
import multiprocessing as mp
from extra_data import open_run, RunDirectory
jf_mod_src = 'FXE_XAD_JF1M/DET/JNGFR02:daqOutput'
JungFrau = 'JNGFR02'
expdir = '/gpfs/exfel/exp/FXE/202102/p002536'
runnumber = 'r0072'
run = RunDirectory(expdir+'/proc/'+runnumber)
def sumRuns(run_part):
run_part[jf_mod_src,'data.adc'].ndarray()
if __name__ =='__main__':
print ('##### READ ALL THE DATA (NDARRAY)#####')
start = time.time()
A = run[jf_mod_src,'data.adc'].ndarray()
print (' >>> FINISHED: TIME:'+'{:4.2f}'.format(time.time()-start)+' #####\n')
for N in [4, 8, 12, 16, 20, 24, 48]:
print ('##### READ ALL THE DATA (MULTIPROCESSING:'+str(N)+ ' CORES)#####')
sel = run.select([(jf_mod_src, '*')])
A = sel.split_trains(N)
start = time.time()
with mp.Pool(N) as pool:
Aout = pool.map(sumRuns, A)
print (' >>> FINISHED: TIME:'+'{:4.2f}'.format(time.time()-start)+' #####\n') |
Since #30, EXtra-data can open multiple files in parallel when initially opening a run. But if we then need to pull data from many files - sequence files or detector modules - we still handle these one after the other. I'm noticing this in particular with detector data access - just constructing an object like
DSSC1M
needs to open all the files to get the number of entries recorded for each train, so it can identify gaps in the data. Maybe we could parallelise loading this.Creating the 'virtual overview file' we discussed before would improve this case (because the indexes would be copied into the overview file). But it wouldn't help (much) with loading real data (e.g. pulse IDs): HDF5 would do more of the work, but it would still need to open & read from the separate files.
The text was updated successfully, but these errors were encountered: