diff --git a/README.md b/README.md
index 98dd9a8..1d9d7c8 100644
--- a/README.md
+++ b/README.md
@@ -1,10 +1,36 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Austin Eng
+* Tested on: Windows 10, i7-4770K @ 3.50GHz 16GB, GTX 780 3072MB (Personal Computer)
 
-### (TODO: Your README)
+# Flocking
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+![](images/boids.gif)
+
+Above is a simulation of 50,000 boids on the GPU using a uniform coherent grid to accelerate computation. Without visualization this runs at 850+ fps. Half a million agents can be simulated at 60+ fps.
+
+# Performance Analysis
+
+## Varying Boid Count
+**Tests were done using a block size of 128. cudaTimer was used to measure the number of elapsed milliseconds to compute each frame. The average time for 1000 frames was recorded.** Command line arguments were added in the [profiling branch](https://github.com/austinEng/Project1-CUDA-Flocking/tree/profiling) to make it easier to run the program and vary boid count.
+
+![](images/particleCount_vs_msframe.png)
+
+We can see that using uniform grids greatly improves performance and that adding coherence is even more performant. This makes sense because the uniform grid significantly reduces the number of boids that need to be checked against, and adding coherence makes memory access much faster. It's interesting to note, however, that the graphs of all three methods appear to be piecewise functions. They will increase exponentially at one rate (note the log scale) and then suddenly jump and increase at a different rate. Below is a graph with fine resolution exploring this phenomenon more deeply.
+
+![](images/particleCount_vs_msframe_fine.png)
+
+Note the large jumps at about (5300, 5500), (16400, 16500), (31200, 31300), and (43600, 43700). I think that this may be happening because in some situations, the number of boids does not map well to the underlying architecture and memory access becomes less efficient.
+
+
+## Varying Block Size
+**Tests were done using 5000 boids. cudaTimer was used to measure the number of elapsed milliseconds to compute each frame. The average time for 1000 frames was recorded.**
+
+![](images/blockSize_vs_relative_msframe.png)
+
+Note that the above graph shows the **relative** number of milliseconds per frame, that is, the ratio of elapsed time to the lowest elapsed time for that method. This is done in an effort the normalize the results to better compare how each method is affected by block size. They all show significant slowdowns beginning at a block size of 32 or less. This may be happening because at 32 or fewer threads per block, there is only one warp (group of 32 threads) in a block. These smaller blocks mean that we need a larger number of blocks. With every warp in its own block, we lose the performance benefits of shared memory within a block and instead need to allocate memory for each of our very many blocks.
+
+Its worth noting that the coherent uniform grid saw these performance hits at a block size of 32 while the other methods were only impacted at 16. This may be because the coherent grid benefits more from shared memory access.
+
+All methods also show slightly decreased performance as we increase the block size from 128 towards 1024 (max). Because the GPU I tested can have a maximum of 2048 active threads, this means that as we increase the block size, the number of blocks we can have decreases. At a size of 1024, there can only be two blocks. Threads in a block are organized in warps (groups of 32 threads) and only one warp can execute at a time. Therefore it seems that in an attempt to allow more threads to share the same memory by increasing block size, we've made the execution of the threads slightly more synchronous.
diff --git a/images/blockSize_vs_relative_msframe.png b/images/blockSize_vs_relative_msframe.png
new file mode 100644
index 0000000..178d62c
Binary files /dev/null and b/images/blockSize_vs_relative_msframe.png differ
diff --git a/images/boids.gif b/images/boids.gif
new file mode 100644
index 0000000..83a41c5
Binary files /dev/null and b/images/boids.gif differ
diff --git a/images/particleCount_vs_msframe.png b/images/particleCount_vs_msframe.png
new file mode 100644
index 0000000..ca0b1e3
Binary files /dev/null and b/images/particleCount_vs_msframe.png differ
diff --git a/images/particleCount_vs_msframe_fine.png b/images/particleCount_vs_msframe_fine.png
new file mode 100644
index 0000000..0d4b28a
Binary files /dev/null and b/images/particleCount_vs_msframe_fine.png differ
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index fdd636d..f22098a 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -10,5 +10,5 @@ set(SOURCE_FILES
 
 cuda_add_library(src
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_35
     )
diff --git a/src/kernel.cu b/src/kernel.cu
index 30356b9..2790dd0 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -85,6 +85,7 @@ int *dev_gridCellEndIndices;   // to this cell?
 
 // TODO-2.3 - consider what additional buffers you might need to reshuffle
 // the position and velocity data to be coherent within cells.
+glm::vec3 *dev_pos2;
 
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
@@ -162,6 +163,7 @@ void Boids::initSimulation(int N) {
   gridSideCount = 2 * halfSideCount;
 
   gridCellCount = gridSideCount * gridSideCount * gridSideCount;
+  printf("%i CELLS\n", gridCellCount);
   gridInverseCellWidth = 1.0f / gridCellWidth;
   float halfGridWidth = gridCellWidth * halfSideCount;
   gridMinimum.x -= halfGridWidth;
@@ -169,6 +171,21 @@ void Boids::initSimulation(int N) {
   gridMinimum.z -= halfGridWidth;
 
   // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+  cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+
+  cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+  cudaMalloc((void**)&dev_pos2, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_pos2 failed!");
+
   cudaThreadSynchronize();
 }
 
@@ -230,10 +247,33 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
 * in the `pos` and `vel` arrays.
 */
 __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
-  // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
-  // Rule 2: boids try to stay a distance d away from each other
-  // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+  glm::vec3 posSelf = pos[iSelf];
+
+  glm::vec3 cmass(0.f, 0.f, 0.f);
+  int cmassCount = 0;
+  glm::vec3 collisionV(0.f, 0.f, 0.f);
+  glm::vec3 cvel(0.f, 0.f, 0.f);
+  int cvelCount = 0;
+
+  for (int i = 0; i < N; ++i) {
+    // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+    if (i != iSelf && glm::distance(pos[i], posSelf) < rule1Distance) {
+      cmass += pos[i];
+      cmassCount++;
+    }
+    // Rule 2: boids try to stay a distance d away from each other
+    if (i != iSelf && glm::distance(pos[i], posSelf) < rule2Distance) {
+      collisionV -= pos[i] - posSelf;
+    }
+    // Rule 3: boids try to match the speed of surrounding boids
+    if (i != iSelf && glm::distance(pos[i], posSelf) < rule3Distance) {
+      cvel += vel[i];
+      cvelCount++;
+    }
+  }
+  cmass = cmassCount > 0 ? (cmass / (float)cmassCount) - pos[iSelf] : cmass;
+  cvel = cvelCount > 0 ? cvel / (float)cvelCount : cvel;
+  return cmass * rule1Scale + collisionV * rule2Scale + cvel * rule3Scale;
 }
 
 /**
@@ -242,9 +282,19 @@ __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *po
 */
 __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
   glm::vec3 *vel1, glm::vec3 *vel2) {
+  int index = threadIdx.x + (blockIdx.x * blockDim.x);
+  if (index >= N) {
+    return;
+  }
   // Compute a new velocity based on pos and vel1
+  glm::vec3 vel = vel1[index] + computeVelocityChange(N, index, pos, vel1);
+
   // Clamp the speed
+  vel = glm::normalize(vel) * glm::min(maxSpeed, glm::length(vel));
+
   // Record the new velocity into vel2. Question: why NOT vel1?
+  // (don't want to mutate other velocities that are probably being used by other threads)
+  vel2[index] = vel;
 }
 
 /**
@@ -285,10 +335,23 @@ __device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) {
 __global__ void kernComputeIndices(int N, int gridResolution,
   glm::vec3 gridMin, float inverseCellWidth,
   glm::vec3 *pos, int *indices, int *gridIndices) {
-    // TODO-2.1
-    // - Label each boid with the index of its grid cell.
-    // - Set up a parallel array of integer indices as pointers to the actual
-    //   boid data in pos and vel1/vel2
+  // TODO-2.1
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (index >= N) {
+    return;
+  }
+    
+  // - Label each boid with the index of its grid cell.
+  //int xIdx = (pos[index].x - gridMin.x) * inverseCellWidth;
+  //int yIdx = (pos[index].y - gridMin.y) * inverseCellWidth;
+  //int zIdx = (pos[index].z - gridMin.z) * inverseCellWidth;
+  glm::ivec3 index3D = (pos[index] - gridMin) * inverseCellWidth;
+  
+  gridIndices[index] = gridIndex3Dto1D(index3D.x, index3D.y, index3D.z, gridResolution);
+
+  // - Set up a parallel array of integer indices as pointers to the actual
+  //   boid data in pos and vel1/vel2
+  indices[index] = index;
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
@@ -303,9 +366,51 @@ __global__ void kernResetIntBuffer(int N, int *intBuffer, int value) {
 __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
   int *gridCellStartIndices, int *gridCellEndIndices) {
   // TODO-2.1
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (index >= N || index < 1) { // ignore 0 index b.c. we look at [index-1, index]
+    return;
+  }
   // Identify the start point of each cell in the gridIndices array.
   // This is basically a parallel unrolling of a loop that goes
   // "this index doesn't match the one before it, must be a new cell!"
+  int gIdxPrev = particleGridIndices[index - 1];
+  int gIdx = particleGridIndices[index];
+
+  if (gIdxPrev != gIdx) {
+    gridCellEndIndices[gIdxPrev] = gridCellStartIndices[gIdx] = index;
+  }
+}
+
+__device__ void getGridIndexAndQuadrantMask(glm::vec3 pos, glm::vec3 gridMin, float inverseCellWidth, glm::ivec3 &gIdx, unsigned char &mask) {
+  glm::vec3 partialIndex3D = (pos - gridMin) * inverseCellWidth;
+  gIdx = partialIndex3D;
+  partialIndex3D = partialIndex3D - glm::floor(partialIndex3D);
+
+  mask = ((partialIndex3D.x >= 0.5f) << 0) ||
+  ((partialIndex3D.y >= 0.5f) << 1) ||
+  ((partialIndex3D.z >= 0.5f) << 2);
+}
+
+__device__ int getNeighborCells(int neighborCells[8], const glm::ivec3 &gIdx, const unsigned char &mask, int gridResolution) {
+  int nCells = 0;
+  for (int k = -1; k <= 1; ++k) {
+    for (int j = -1; j <= 1; ++j) {
+      for (int i = -1; i <= 1; ++i) {
+        if (!(
+          (i == -1 && (mask & 0x1)) || (i == 1 && !(mask & 0x1)) ||
+          (j == -1 && (mask & 0x2)) || (j == 1 && !(mask & 0x2)) ||
+          (k == -1 && (mask & 0x4)) || (k == 1 && !(mask & 0x4))
+          )) {
+          glm::ivec3 idx = gIdx + glm::ivec3(i, j, k);
+          if (!glm::any(glm::lessThan(idx, glm::ivec3(0, 0, 0))) &&
+            !glm::any(glm::greaterThanEqual(idx, glm::ivec3(gridResolution, gridResolution, gridResolution)))) {
+            neighborCells[nCells++] = gridIndex3Dto1D(idx.x, idx.y, idx.z, gridResolution);
+          }
+        }
+      }
+    }
+  }
+  return nCells;
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
@@ -316,12 +421,75 @@ __global__ void kernUpdateVelNeighborSearchScattered(
   glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
   // TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
   // the number of boids that need to be checked.
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (index >= N) {
+    return;
+  }
+
   // - Identify the grid cell that this particle is in
+  glm::ivec3 index3D;
+  unsigned char mask;
+  getGridIndexAndQuadrantMask(pos[index], gridMin, inverseCellWidth, index3D, mask);
+
   // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
+  int neighborCells[8];
+  int nCells = getNeighborCells(neighborCells, index3D, mask, gridResolution);
+
+  glm::vec3 posSelf = pos[index];
+  glm::vec3 cmass(0.f, 0.f, 0.f);
+  int cmassCount = 0;
+  glm::vec3 collisionV(0.f, 0.f, 0.f);
+  glm::vec3 cvel(0.f, 0.f, 0.f);
+  int cvelCount = 0;
+
+  for (int c = 0; c < nCells; ++c) {
+    // - For each cell, read the start/end indices in the boid pointer array.
+    int n = neighborCells[c];
+    int start = gridCellStartIndices[n];
+    int end = gridCellEndIndices[n];
+
+    for (int b = start; b < end; ++b) {
+      // - Access each boid in the cell and compute velocity change from
+      //   the boids rules, if this boid is within the neighborhood distance.
+      int particleIndex = particleArrayIndices[b];
+      if (particleIndex == index) continue;
+
+      // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+      if (glm::distance(pos[particleIndex], posSelf) < rule1Distance) {
+        cmass += pos[particleIndex];
+        cmassCount++;
+      }
+      // Rule 2: boids try to stay a distance d away from each other
+      if (glm::distance(pos[particleIndex], posSelf) < rule2Distance) {
+        collisionV -= pos[particleIndex] - posSelf;
+      }
+      // Rule 3: boids try to match the speed of surrounding boids
+      if (glm::distance(pos[particleIndex], posSelf) < rule3Distance) {
+        cvel += vel1[particleIndex];
+        cvelCount++;
+      }
+    }
+  }
+
+  cmass = cmassCount > 0 ? (cmass / (float)cmassCount) - pos[index] : cmass;
+  cvel = cvelCount > 0 ? cvel / (float)cvelCount : cvel;
+
+  glm::vec3 vel = vel1[index] + cmass * rule1Scale + collisionV * rule2Scale + cvel * rule3Scale;
+
   // - Clamp the speed change before putting the new speed in vel2
+  vel = glm::normalize(vel) * glm::min(maxSpeed, glm::length(vel));
+  vel2[index] = vel;
+}
+
+__global__ void kernShuffleParticleBuffers(int N, 
+    glm::vec3* srcPos, glm::vec3* srcVel, 
+    int* tgtIndices, glm::vec3* tgtPos, glm::vec3* tgtVel) {
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (index >= N) {
+    return;
+  }
+  tgtPos[index] = srcPos[tgtIndices[index]];
+  tgtVel[index] = srcVel[tgtIndices[index]];
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
@@ -333,55 +501,164 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   // except with one less level of indirection.
   // This should expect gridCellStartIndices and gridCellEndIndices to refer
   // directly to pos and vel1.
+  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+  if (index >= N) {
+    return;
+  }
+
   // - Identify the grid cell that this particle is in
+  glm::ivec3 index3D;
+  unsigned char mask;
+  getGridIndexAndQuadrantMask(pos[index], gridMin, inverseCellWidth, index3D, mask);
+
   // - Identify which cells may contain neighbors. This isn't always 8.
+  int neighborCells[8];
+  int nCells = getNeighborCells(neighborCells, index3D, mask, gridResolution);
+
+  glm::vec3 posSelf = pos[index];
+  glm::vec3 cmass(0.f, 0.f, 0.f);
+  int cmassCount = 0;
+  glm::vec3 collisionV(0.f, 0.f, 0.f);
+  glm::vec3 cvel(0.f, 0.f, 0.f);
+  int cvelCount = 0;
+
   // - For each cell, read the start/end indices in the boid pointer array.
   //   DIFFERENCE: For best results, consider what order the cells should be
   //   checked in to maximize the memory benefits of reordering the boids data.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
+  for (int c = 0; c < nCells; ++c) {
+    int n = neighborCells[c];
+    int start = gridCellStartIndices[n];
+    int end = gridCellEndIndices[n];
+
+    for (int particleIndex = start; particleIndex < end; ++particleIndex) {
+      // - Access each boid in the cell and compute velocity change from
+      //   the boids rules, if this boid is within the neighborhood distance.
+      if (particleIndex == index) continue;
+
+      // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+      if (glm::distance(pos[particleIndex], posSelf) < rule1Distance) {
+        cmass += pos[particleIndex];
+        cmassCount++;
+      }
+      // Rule 2: boids try to stay a distance d away from each other
+      if (glm::distance(pos[particleIndex], posSelf) < rule2Distance) {
+        collisionV -= pos[particleIndex] - posSelf;
+      }
+      // Rule 3: boids try to match the speed of surrounding boids
+      if (glm::distance(pos[particleIndex], posSelf) < rule3Distance) {
+        cvel += vel1[particleIndex];
+        cvelCount++;
+      }
+    }
+  }
+
+  cmass = cmassCount > 0 ? (cmass / (float)cmassCount) - pos[index] : cmass;
+  cvel = cvelCount > 0 ? cvel / (float)cvelCount : cvel;
+
+  glm::vec3 vel = vel1[index] + cmass * rule1Scale + collisionV * rule2Scale + cvel * rule3Scale;
+
   // - Clamp the speed change before putting the new speed in vel2
+  float len = glm::length(vel);
+  if (len > maxSpeed) {
+    vel = glm::normalize(vel) * glm::min(maxSpeed, glm::length(vel));
+  }
+  vel2[index] = vel;
 }
 
 /**
 * Step the entire N-body simulation by `dt` seconds.
 */
 void Boids::stepSimulationNaive(float dt) {
+  dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
   // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
+  kernUpdateVelocityBruteForce<<<fullBlocksPerGrid, blockSize>>>(numObjects, dev_pos, dev_vel1, dev_vel2);
+  kernUpdatePos<<<fullBlocksPerGrid, blockSize>>>(numObjects, dt, dev_pos, dev_vel2);
+
   // TODO-1.2 ping-pong the velocity buffers
+  std::swap(dev_vel1, dev_vel2);
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
   // TODO-2.1
+
+  dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
   // Uniform Grid Neighbor search using Thrust sort.
   // In Parallel:
+
   // - label each particle with its array index as well as its grid index.
   //   Use 2x width grids.
+  kernComputeIndices<<< fullBlocksPerGrid, blockSize >>>(numObjects, gridSideCount, gridMinimum,
+    gridInverseCellWidth, dev_pos, dev_particleArrayIndices, 
+    dev_particleGridIndices);
+
   // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
   //   are welcome to do a performance comparison.
+  dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+  dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+  thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+  
   // - Naively unroll the loop for finding the start and end indices of each
   //   cell's data pointers in the array of boid indices
+  kernIdentifyCellStartEnd<<< fullBlocksPerGrid, blockSize >>>(numObjects, dev_particleGridIndices,
+  dev_gridCellStartIndices, dev_gridCellEndIndices);
+
   // - Perform velocity updates using neighbor search
+  kernUpdateVelNeighborSearchScattered<<< fullBlocksPerGrid, blockSize >>>(numObjects, gridSideCount, gridMinimum,
+  gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, 
+  dev_gridCellEndIndices, dev_particleArrayIndices, dev_pos, 
+  dev_vel1, dev_vel2);
+  
   // - Update positions
+  kernUpdatePos<<< fullBlocksPerGrid, blockSize >>>(numObjects, dt, dev_pos, dev_vel2);
+
   // - Ping-pong buffers as needed
+  std::swap(dev_vel1, dev_vel2);
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
   // TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
   // Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
+  dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
   // In Parallel:
   // - Label each particle with its array index as well as its grid index.
   //   Use 2x width grids
+  kernComputeIndices << < fullBlocksPerGrid, blockSize >> >(numObjects, gridSideCount, gridMinimum,
+    gridInverseCellWidth, dev_pos, dev_particleArrayIndices,
+    dev_particleGridIndices);
+
   // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
   //   are welcome to do a performance comparison.
+  dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+  dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+  thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+
   // - Naively unroll the loop for finding the start and end indices of each
   //   cell's data pointers in the array of boid indices
+  kernIdentifyCellStartEnd << < fullBlocksPerGrid, blockSize >> >(numObjects, dev_particleGridIndices,
+    dev_gridCellStartIndices, dev_gridCellEndIndices);
+
   // - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
   //   the particle data in the simulation array.
   //   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
+  kernShuffleParticleBuffers << < fullBlocksPerGrid, blockSize >> >(
+    numObjects, dev_pos, dev_vel1, 
+    dev_particleArrayIndices, dev_pos2, dev_vel2);
+ // std::swap(dev_vel1, dev_vel2);
+  //std::swap(dev_pos, dev_pos2);
+
   // - Perform velocity updates using neighbor search
+  kernUpdateVelNeighborSearchCoherent << < fullBlocksPerGrid, blockSize >> >(numObjects, gridSideCount, gridMinimum,
+    gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices,
+    dev_gridCellEndIndices, dev_pos2, dev_vel2, dev_vel1);
+
   // - Update positions
-  // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+  kernUpdatePos << < fullBlocksPerGrid, blockSize >> >(numObjects, dt, dev_pos2, dev_vel1);
+
+  // - Ping-pong buffers as needed
+  //std::swap(dev_vel1, dev_vel2);
+  std::swap(dev_pos, dev_pos2);
 }
 
 void Boids::endSimulation() {
@@ -389,7 +666,8 @@ void Boids::endSimulation() {
   cudaFree(dev_vel2);
   cudaFree(dev_pos);
 
-  // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+  // TODO-2.1 TODO-2.3 - Free any additional buffers her.e
+  cudaFree(dev_pos2);
 }
 
 void Boids::unitTest() {
diff --git a/src/main.cpp b/src/main.cpp
index e416836..da9c0af 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -14,11 +14,11 @@
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
 #define VISUALIZE 1
-#define UNIFORM_GRID 0
-#define COHERENT_GRID 0
+#define UNIFORM_GRID 1
+#define COHERENT_GRID 1
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
-const int N_FOR_VIS = 5000;
+const int N_FOR_VIS = 50000;
 const float DT = 0.2f;
 
 /**