SPP A2 | Floyd Warshal Optimization

Ayush Sharma
2019101004

Floyd–Warshall algorithm is an algorithm for finding shortest paths in a directed weighted graph with positive or negative edge weights. A single execution of the algorithm will find the lengths of shortest paths between all pairs of vertices.

Floyd Warshall Algorithm

Floyd–Warshall algorithm is an algorithm for finding shortest paths in a directed weighted graph with positive or negative edge weights. A single execution of the algorithm will find the lengths of shortest paths between all pairs of vertices.

Optimizations

We were given 10 test cases out of which the largest was having 2229 number of nodes and 155317 number of edges. There can be self loop and multiple edges between same two nodes. Although I profiled all of them but will list timings and cache profile for that largest test case only.

Time taken with the trivial algorithm i.e. no optimisation : ~63 seconds

My machine has total 8 cores

Previous Version (i.e. in Assignment1)

Final Total time:

On Abacus : ~19.832843 (core independent as no multi-programming was there)
~15s on my machine

Process for previous version optimisation

NOTE: Refer https://github.com/ayushsharma-crypto/Optimizing-Techniques for detailed explanation.

Converted 2-D array for storing adjacency matrix into 1-D. Then for accessing A[i][j] we used pointer notation (*(A + i*V + j)), with keeping in mind the fact that generally 2-D matrices are stored in Row Major format in main memory.
Pointer accesing to memory Restricted pointer access instead of array look-ups.
Pre increment over post increment Pre-increment is faster than post-increment because post increment keeps a copy of previous (existing) value and adds 1 in the existing value while pre-increment is simply adds 1 without keeping the existing value.
Used register keyword to reduce fetch time for i,j,k iterators which were being used more often.
Used a temporary variable for storing and reducing memory lookup using pointers i.e. register int * kmj = (matrix + k*V + j); and register int * imk = (matrix + i*V + k);
Tweaked the value of INFINITY to 1000000000.
Removed spurious if conditions.
Used more general memory lookup pointer variable i.e. replace imk & kmj with im and km. These new variable will be have more reads / writes ratio as compared to imk & kmj .

void floyd_v2(int * matrix, int V)
{
    register int i,j,k,kmj,imk, v;
    register int * km;
    register int * im;
    v = V;
    for(k=0;k<v;++k)
    {
        km = (matrix + k*v);
        for(i=0;i<v;++i)
        {
            im = (matrix + i*v);
            for(j=0;j+15<v;j+=16)
            {
                imk = (*(im + k));

                kmj = (*(km + j+0));
                if( ((im + j+0)) > ( kmj + imk ) ) ((im + j+0)) = ( kmj + imk );
                ...
                
                kmj = (*(km + j+15));
                if( ((im + j+15)) > ( kmj + imk ) ) ((im + j+15)) = ( kmj + imk );
            }
            while(j<v)
            {
                kmj = (*(km + j));
                imk = (*(im + k));
                if( ((im + j)) > ( kmj + imk ) ) ((im + j)) = ( kmj + imk );
                j++;
            }
        }
    }
}

Run time was reduced to ~19.8s on Abacus.

Later Optimisation for Assignment2

on Abacus with total core 4 : ~8.310297s (on average)
on Abacus with total core 5 : ~6.212121s (on average)
On my machine : ~5.6966s (on average)

Process for multicore programming optimisation

The OpenMP API uses the fork-join model of parallel execution. Multiple threads perform tasks defined implicitly or explicitly by OpenMP directives. All OpenMP applications begin as a single thread of execution, called the initial thread. The initial thread executes sequentially, until it encounters a parallel construct. At that point, this thread creates a group of itself and zero or more additional threads and becomes the master thread of the new group. Each thread executes the commands included in the parallel region, and their execution may be differentiated, according to additional directives provided by the programmer. At the end of the parallel region, all threads are synchronized

void floyd_v1(int * matrix, int V)
{
    register int i,j,k;
    for(k=0;k<V;++k)
    {
        register int* km = (matrix + k*V);

        #pragma omp parallel for private(j) shared(matrix,V)         
        
        for(i=0;i<V;++i)
        {
            register int* im = (matrix + i*V);
            
            for(j=0;j<V;++j)
            {
                if( (*(im + j)) > ( (*(im + k) ) + ( *(km + j) ) ) )
                    (*(im + j)) = ( (*(im + k) ) + ( *(km + j) ) );
            }
        }
    }
}

Final Result

On Abacus we get best running time to be ~5s with our final code with core 5.

Profiling on abacus:

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.vscode		.vscode
images		images
testcases		testcases
2019101004.zip		2019101004.zip
Assignment-2.pdf		Assignment-2.pdf
README.md		README.md
Report.pdf		Report.pdf
code.c		code.c
floyd		floyd
q2.1.c		q2.1.c
temp		temp
test		test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SPP A2 | Floyd Warshal Optimization

Floyd Warshall Algorithm

Optimizations

Previous Version (i.e. in Assignment1)

Process for previous version optimisation

Later Optimisation for Assignment2

Process for multicore programming optimisation

Final Result

`gprof`

`Valgrind`

About

Releases

Packages

Languages

ayushsharma-crypto/Floyd-Warshal-Optimization

Folders and files

Latest commit

History

Repository files navigation

SPP A2 | Floyd Warshal Optimization

Floyd Warshall Algorithm

Optimizations

Previous Version (i.e. in Assignment1)

Process for previous version optimisation

Later Optimisation for Assignment2

Process for multicore programming optimisation

Final Result

gprof

Valgrind

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`gprof`

`Valgrind`

Packages