Minor README updats.

CIS565-Fall-2015 · Sep 8, 2015 · ea79aaa · ea79aaa
1 parent 40dd226
commit ea79aaa
Showing 1 changed file with 16 additions and 6 deletions.
diff --git a/README.md b/README.md
@@ -14,18 +14,28 @@ Terry Sun; Arch Linux, Intel i5-4670, GTX 750
 
 ![](images/nbody_perf_plot.png)
 
+The graph shows time taken (in ms) to update one frame at block sizes from 16
+to 1024 in steps of 8, for various values of N (planets in the system).
+
 I measured performance by disabling visualization and using `CudaEvent`s to time
 the kernel invocations (measuring the time elapsed for both `kernUpdateVelPos`
-and `kernUpdateAcc`). The graph shows time elapsed (in ms) to update one frame
-at block sizes from 16 to 1024 in steps of 8.
+and `kernUpdateAcc`). The recorded value is an average over 100 frames.
 
 Code for performance measuring can be found on the `performance` branch.
 
 Changing the number of planets, as expected, increases the time elapsed for the
-kernels, due to a for-loop in the acceleration calculation (which increases
-linearly by the number of total planets in the system. More interestingly, it
+kernels, due to a for-loop in the acceleration calculation (which increases the
+time with the number of total planets in the system). More interestingly, it
 also changes the way that performance reacts to block size (see n=4096 in the
-above plot).
+above plot). The difference in performance as block size changes is much greater
+with greater N, and also exhibits different behaviors.
+
+At certain block sizes, the time per frame sharply decreases, such as at N=4096,
+block size=1024, 512, 256, 128. These are points where each block would be
+saturated (ie. no threads are started that are not needed).
+
+I have no idea what's going on with the spikes peaking at N=4096, block size~800
+or N=3072, block size~600.
 
 # Part2: An Even More Basic Matrix Library
 
@@ -41,4 +51,4 @@ host), which are also linear time operations.
 However, matrix multiplication is a O(n^{1.5}) operation on a CPU and becomes a
 O(n) operation on a GPU (becoming O(3n) after taking into account the 2x memory
 copy). So I would expect multiplication to exhibit much better performance on
-the GPU for larger matrices.
+the GPU (except on very small matrices).