Skip to content

Commit

Permalink
Feature/oro 0 amdadvtech merge (#43)
Browse files Browse the repository at this point in the history
* Add gitignore to the repository


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix missing CUDA properties. (#16)

Signed-off-by: Chih-Chen Kao <[email protected]>

* Feature/oro 0 radix sort (#19)

* [ORO-0] Working 8 bit radix sort.

* [ORO-0] Some optimization.

* Create LICENSE

* Update README.md (#15)

* Feature/oro 0 raw get set (#19)

* [ORO-0] Rename setter and getter.

* [ORO-0] Fix when there is a dll but no device.

* [ORO-0] Deletion function.

* [ORO-0] Multi processor count.

* [ORO-0] Extended the sort to more than 8 bits. Implemented tests.

* [ORO-0] Moved temp buffer allocation out from the sort().

* [ORO-0] README. References.

* [ORO-0] Debug flag.

* Refactor the code to add the basic constructs to support selecting different scan algorithms.
Add different implementation of the scan algorithm: CPU, single WG and all WG .

Signed-off-by: Chih-Chen Kao <[email protected]>

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* Optimization: Implement the single-pass kernel for GPU parallel scan.
Fix a GPU memory bug.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Feature/oro 0 kernel cache (#4)

* [ORO-0] Cache kernel.

* [ORO-0] Support newer HIP builds on windows (#22)

* [ORO-0] Unit test. (#23)

* Fix LDS scan bug.
The previous implementation would lead to an error when the wavefront (wrap) size is not equal to the size of a workgroup (block).
Since not all threads run simultaneously, for an input arrays larger than the wavefront size, the previous algorithm will not work
because it performs the scan in-place on the input array. The results of one wavefront (wrap) will be overwritten by work items (threads) in another wavefront (wrap).

Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize the LDS scan algorithm. (#6)

* Optimize the LDS scan algorithm.
This version does not require a temp buffer and can support a LDS input size up to 2 times the workgroup size.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Support an input array in LDS that is 2 times the WG size.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Feature/oro 0 clean up (#7)

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* [ORO-0] Clean up.

* Feature/oro 0 clean up (#10)

* Squashed commit of the following:

commit 3f32bea2244653d59efb3c3eaa9433018dde5835
Author: takahiroharada <[email protected]>
Date:   Wed Apr 13 10:48:35 2022 -0700

    [ORO-0] Fix nvrtc.

* [ORO-0] Clean up.

* [ORO-0] SortKernel1. Less complex. (#8)

SortKernel (occupancy: 8)
- vgpr: 128
- lds: 6704
SortKernel1 (occupancy: 9)
- vgpr: 106
- lds 7720

* [ORO-0] Kernel execution time check.

* Fix the memory access pattern and change it to coalesced memory access. (#11)

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Single kernel sort for small keys. (#12)

* Optimize the Count kernel for less LDS usage to achieve full occupancy (#13)

* Optimize the Count kernel to let it use less LDS and could achieve full occupancy.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Remove __threadfence_block()

Removes the boundary check in the inner loop.
The upper bound is set only once before going into the loop.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Introduce DRIVER and RTC APIs

* Disable enum-variant

* Improve paths

* Add fields

* Update Vulkan test

* Define CUDA in terms of DRIVER and RTC

* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <[email protected]>

Co-authored-by: takahiroharada <[email protected]>

* Merging another merge (#18)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <[email protected]>

* Calculate the number of WGs based on LDS and max-thread-per-WGP. (#15)

* Calculate the number of WGs based on LDS and max-thread-per-WGP.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Add a workaround for CUDA.

Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize the sort kernel: single-pass 8bit sort & parallel scan in 4bit sort. (#14)

* Fix a minor issue in CountKernel to make it more robust.

Implement a single-pass 8-bit local sort.

Implement a single-pass 8-bit local sort with shared bins.


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix nItemsPerWI and enable the version with shared LDS.


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Print driver version.

* [ORO-0] Repro case.

* Fix SORT_WG_SIZE.
Fix stable sort order.



Signed-off-by: Chih-Chen Kao <[email protected]>

* Optimize sort kernel to remove inner boundary check.
Adjust nItemsPerWI.

Signed-off-by: Chih-Chen Kao <[email protected]>

Co-authored-by: takahiroharada <[email protected]>

Co-authored-by: takahiroharada <[email protected]>

Co-authored-by: Chih-Chen Kao <[email protected]>

* Implement key-value pair sorting (#17)

* Add gitignore to the repository


Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix missing CUDA properties. (#16)

Signed-off-by: Chih-Chen Kao <[email protected]>

* Add basic structure for key-value pair sorting.
Fix an error in single pass sort


Signed-off-by: Chih-Chen Kao <[email protected]>

* Add Value data in the test and sort it according to keys.

Signed-off-by: Chih-Chen Kao <[email protected]>

* Support Key only sorting.

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Make single pass kernel non compile time switch.

* Support both Key-Only & Key-Value pair sort kernels


Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Test change.

* [ORO-0] A bug.

* [ORO-0] NVIDIA occupancy computation fix. Test change. Tweak params to use single pass sort as much as possible.

Co-authored-by: Takahiro Harada <[email protected]>
Co-authored-by: takahiroharada <[email protected]>

* [ORO-0] Revert demo code.

* Fix missing CUDA properties.  (#26)

* Update Orochi.cpp

* [ORO-0] Clean up.

* [ORO-0] OroUtils. (#27)

* [ORO-0] OroUtils.

* [ORO-0] Linux build fix.

* [ORO-0] Forgot to add.

* [ORO-0] Linux build fix.

* [ORO-0] Clean up.

Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Mehmet Oguz Derin <[email protected]>

* Add kernel path and include dir to the functions. (#20)

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] BakeKernel. (#21)

* [ORO-0] BakeKernel.

* Update tools/genArgs.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/stringify.py

commented code removal

* Update tools/genArgs.py

dead code removal

* Update tools/stringify.py

dead code removal

* fix include

Signed-off-by: Chih-Chen Kao <[email protected]>

* fix script

Signed-off-by: Chih-Chen Kao <[email protected]>

* fix

Signed-off-by: Chih-Chen Kao <[email protected]>

Co-authored-by: Chih-Chen Kao <[email protected]>

* Fix Orochi CUDA API (#23)

Fix Orochi CUDA API 

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Linux build fix. (#22)

* [ORO-0] Linux build fix.

* Fix Orochi CUDA API


Signed-off-by: Chih-Chen Kao <[email protected]>

Co-authored-by: Chih-Chen Kao <[email protected]>

* Quick fix for old linux gcc which does not support std::exclusive_scan (#24)

Quick fix for old linux gcc which does not support std::exclusive_scan

Signed-off-by: Chih-Chen Kao <[email protected]>

* Fix the kernel cache bug. (#25)

Fix the kernel cache bug.

The function should not return the oroFunctions that are created previously solely based on the names because they might be invalid.

Signed-off-by: Chih-Chen Kao <[email protected]>

* [ORO-0] Remove static variables. (#26)

* [ORO-0] Remove static variables.

* [ORO-0] Applied the suggestions.

* [ORO-0] Linux regression fix.

* Fix OrochiUtils::getFunctionFromString API (#27)

Signed-off-by: Chih-Chen Kao <[email protected]>

* Adding missing assert (#28)

* Adding missing assert

* Adding more asserts

* Feature/oro 0 gpuopen merge (#31)

* Fix oroGetDeviceProperties in cuda path.

* Fix linux crash (#29)

* [ORO-0] Added missing file.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Fix hipGetErrorString (#32)

* [ORO-0] Fix  hipGetErrorString

It was incorrectly importing this API. Import the correct API in hipew.

* [ORO-0] Remove printf from kernelExec and skip compilation of vulkan test on Linux (#31)

* [ORO-0] Skip compilation of vulkan test on Linux

* [ORO-0] Update kernelExec unit test - remove printf

* [ORO-0] Remove cout

* [ORO-0] Add Orochi error codes mapped to HIP/CUDA (#33)

* Add missing path on Apple config. (#34)

* [ORO-0] Adding hiprtc+comgr dlls to workaround the regression in 22.7.1 driver (#38)

* [ORO-0] Adding hiprtc to workaround the regression in 22.7.1 driver released at 7/26/2022.

* [ORO-0] Created win64 subdir.

* [ORO-0] Add hiprtc.dll and comgr dll

Co-authored-by: takahiroharada <[email protected]>

* fix footnote markdown format (#39)

* Fix orochi utils issue in unit tests

Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: NevesLucas <[email protected]>
Co-authored-by: PixelClear <[email protected]>

Signed-off-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Chih-Chen Kao <[email protected]>
Co-authored-by: Aaryaman Vasishta <[email protected]>
Co-authored-by: Mehmet Oguz Derin <[email protected]>
Co-authored-by: Daniel Meister <[email protected]>
Co-authored-by: NevesLucas <[email protected]>
Co-authored-by: PixelClear <[email protected]>
  • Loading branch information
7 people authored Aug 18, 2022
1 parent 03c4676 commit d78fb81
Show file tree
Hide file tree
Showing 8 changed files with 357 additions and 18 deletions.
50 changes: 38 additions & 12 deletions Orochi/OrochiUtils.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,7 @@ struct OrochiUtilsImpl
return false;
}

static void getCacheFileName( oroDevice device, const char* moduleName, const char* functionName, const char* options, std::string& binFileName )
static void getCacheFileName( oroDevice device, const char* moduleName, const char* functionName, const char* options, std::string& binFileName, const std::string& cacheDirectory )
{
auto hashBin = []( const char* s, const size_t size )
{
Expand Down Expand Up @@ -220,7 +220,7 @@ struct OrochiUtilsImpl
using namespace std::string_literals;

deviceName = deviceName.substr( 0, deviceName.find( ":" ) );
binFileName = OrochiUtils::s_cacheDirectory + "/"s + moduleHash + "-"s + optionHash + ".v."s + deviceName + "."s + driverVersion + "_"s + std::to_string( 8 * sizeof( void* ) ) + ".bin"s;
binFileName = cacheDirectory + "/"s + moduleHash + "-"s + optionHash + ".v."s + deviceName + "."s + driverVersion + "_"s + std::to_string( 8 * sizeof( void* ) ) + ".bin"s;
}
static
bool isFileUpToDate( const char* binaryFileName, const char* srcFileName )
Expand Down Expand Up @@ -381,27 +381,47 @@ struct OrochiUtilsImpl
}
};

char* OrochiUtils::s_cacheDirectory = "./cache/";
std::map<std::string, oroFunction> OrochiUtils::s_kernelMap;
OrochiUtils::OrochiUtils()
{
m_cacheDirectory = "./cache/";
}

OrochiUtils::~OrochiUtils()
{
}

oroFunction OrochiUtils::getFunctionFromFile( oroDevice device, const char* path, const char* funcName, std::vector<const char*>* optsIn )
{
std::string cacheName = OrochiUtilsImpl::getCacheName( path, funcName );
if( s_kernelMap.find( cacheName.c_str() ) != s_kernelMap.end() )
const std::string cacheName = OrochiUtilsImpl::getCacheName( path, funcName );
if( m_kernelMap.find( cacheName.c_str() ) != m_kernelMap.end() )
{
return s_kernelMap[ cacheName ];
return m_kernelMap[ cacheName ];
}

std::string source;
if( !OrochiUtilsImpl::readSourceCode( path, source, 0 ) )
return 0;

oroFunction f = getFunction( device, source.c_str(), path, funcName, optsIn );
s_kernelMap[cacheName] = f;
m_kernelMap[cacheName] = f;
return f;
}

oroFunction OrochiUtils::getFunction( oroDevice device, const char* code, const char* path, const char* funcName, std::vector<const char*>* optsIn )
oroFunction OrochiUtils::getFunctionFromString( oroDevice device, const char* source, const char* path, const char* funcName, std::vector<const char*>* optsIn,
int numHeaders, const char** headers, const char** includeNames )
{
const std::string cacheName = OrochiUtilsImpl::getCacheName( path, funcName );
if( m_kernelMap.find( cacheName.c_str() ) != m_kernelMap.end() )
{
return m_kernelMap[cacheName];
}
oroFunction f = getFunction( device, source, path, funcName, optsIn, numHeaders, headers, includeNames );
m_kernelMap[cacheName] = f;
return f;
}

oroFunction OrochiUtils::getFunction( oroDevice device, const char* code, const char* path, const char* funcName, std::vector<const char*>* optsIn,
int numHeaders, const char** headers, const char** includeNames )
{
std::vector<const char*> opts;
opts.push_back( "-std=c++17" );
Expand All @@ -422,7 +442,7 @@ oroFunction OrochiUtils::getFunction( oroDevice device, const char* code, const
std::string o;
for(int i=0; i<opts.size(); i++)
o.append( opts[i] );
OrochiUtilsImpl::getCacheFileName( device, path, funcName, o.c_str(), cacheFile );
OrochiUtilsImpl::getCacheFileName( device, path, funcName, o.c_str(), cacheFile, m_cacheDirectory );
}
if( OrochiUtilsImpl::isFileUpToDate( cacheFile.c_str(), path ) )
{
Expand All @@ -433,7 +453,8 @@ oroFunction OrochiUtils::getFunction( oroDevice device, const char* code, const
{
orortcProgram prog;
orortcResult e;
e = orortcCreateProgram( &prog, code, path, 0, 0, 0 );
e = orortcCreateProgram( &prog, code, path, numHeaders, headers, includeNames );
OROASSERT( e == ORORTC_SUCCESS, 0 );

e = orortcCompileProgram( prog, opts.size(), opts.data() );
if( e != ORORTC_SUCCESS )
Expand All @@ -449,18 +470,23 @@ oroFunction OrochiUtils::getFunction( oroDevice device, const char* code, const
}
size_t codeSize;
e = orortcGetCodeSize( prog, &codeSize );
OROASSERT( e == ORORTC_SUCCESS, 0 );

codec.resize( codeSize );
e = orortcGetCode( prog, codec.data() );
OROASSERT( e == ORORTC_SUCCESS, 0 );
e = orortcDestroyProgram( &prog );
OROASSERT( e == ORORTC_SUCCESS, 0 );

//store cache
OrochiUtilsImpl::createDirectory( s_cacheDirectory );
OrochiUtilsImpl::createDirectory( m_cacheDirectory.c_str() );
OrochiUtilsImpl::cacheBinaryToFile( codec, cacheFile );
}
oroModule module;
oroError ee = oroModuleLoadData( &module, codec.data() );
OROASSERT( ee == oroSuccess, 0 );
ee = oroModuleGetFunction( &function, module, funcName );
OROASSERT( ee == oroSuccess, 0 );

return function;
}
Expand Down
16 changes: 11 additions & 5 deletions Orochi/OrochiUtils.h
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#pragma once
#include <Orochi/Orochi.h>
#include <vector>
#include <map>
#include <unordered_map>
#include <string>

#if defined(_WIN32)
Expand All @@ -18,8 +18,14 @@ class OrochiUtils
int x, y, z, w;
};

static oroFunction getFunctionFromFile( oroDevice device, const char* path, const char* funcName, std::vector<const char*>* opts );
static oroFunction getFunction( oroDevice device, const char* code, const char* path, const char* funcName, std::vector<const char*>* opts );
OrochiUtils();
~OrochiUtils();

oroFunction getFunctionFromFile( oroDevice device, const char* path, const char* funcName, std::vector<const char*>* opts );
oroFunction getFunctionFromString( oroDevice device, const char* source, const char* path, const char* funcName, std::vector<const char*>* opts,
int numHeaders, const char** headers, const char** includeNames );
oroFunction getFunction( oroDevice device, const char* code, const char* path, const char* funcName, std::vector<const char*>* opts,
int numHeaders = 0, const char** headers = 0, const char** includeNames = 0 );

static void launch1D( oroFunction func, int nx, const void** args, int wgSize = 64, unsigned int sharedMemBytes = 0 );

Expand Down Expand Up @@ -64,6 +70,6 @@ class OrochiUtils
}

public:
static char* s_cacheDirectory;
static std::map<std::string, oroFunction> s_kernelMap;
std::string m_cacheDirectory;
std::unordered_map<std::string, oroFunction> m_kernelMap;
};
111 changes: 111 additions & 0 deletions Test/Stopwatch.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
/*
AMD copyrights (Copyright (c) 2011 Advanced Micro Devices, Inc. All rights reserved)
*/
#pragma once

#if defined(__WINDOWS__)
#define NOMINMAX
#include <windows.h>

#define TIME_TYPE LARGE_INTEGER
#define QUERY_FREQ(f) QueryPerformanceFrequency(&f)
#define RECORD(t) QueryPerformanceCounter(&t)
#define GET_TIME(t) (t).QuadPart*1000.0
#define GET_FREQ(f) (f).QuadPart
#else
#include <sys/time.h>

#define TIME_TYPE timeval
#define QUERY_FREQ(f) f.tv_sec = 1
#define RECORD(t) gettimeofday(&t, 0)
#define GET_TIME(t) ((t).tv_sec*1000.0+(t).tv_usec/1000.0)
#define GET_FREQ(f) 1.0
#endif


class Stopwatch
{
public:
__inline
Stopwatch();
__inline
void init();
__inline
void start();
__inline
void split();
__inline
float getCurrent();
__inline
void stop();
__inline
float getMs();
__inline
void getMs( float* times, int capacity );

private:
enum
{
CAPACITY = 12,
};
int m_idx;

TIME_TYPE m_frequency;
TIME_TYPE m_t[CAPACITY];
};

__inline
Stopwatch::Stopwatch()
{
// QueryPerformanceFrequency( &m_frequency );
QUERY_FREQ( m_frequency );
}

__inline
void Stopwatch::start()
{
m_idx = 0;
// QueryPerformanceCounter(&m_t[m_idx++]);
RECORD( m_t[m_idx++] );
}

__inline
void Stopwatch::split()
{
// QueryPerformanceCounter(&m_t[m_idx++]);
RECORD( m_t[m_idx++] );
}

__inline
float Stopwatch::getCurrent()
{
TIME_TYPE t;
RECORD( t );
return (float)( GET_TIME(t) - GET_TIME(m_t[0]) )/GET_FREQ(m_frequency);
}

__inline
void Stopwatch::stop()
{
split();
}

__inline
float Stopwatch::getMs()
{
// return (float)(1000*(m_t[1].QuadPart - m_t[0].QuadPart))/m_frequency.QuadPart;
return (float)( GET_TIME( m_t[1] ) - GET_TIME( m_t[0] ) )/GET_FREQ( m_frequency );
}

__inline
void Stopwatch::getMs(float* times, int capacity)
{
for(int i=0; i<capacity; i++) times[i] = 0.f;

for(int i=0; i<std::min(capacity, m_idx-1); i++)
{
// times[i] = (float)(1000*(m_t[i+1].QuadPart - m_t[i].QuadPart))/m_frequency.QuadPart;
times[i] = (float)( GET_TIME( m_t[i+1] ) - GET_TIME( m_t[i] ) )/GET_FREQ( m_frequency );
}
}

6 changes: 5 additions & 1 deletion UnitTest/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@
#include <Orochi/Orochi.h>
#include <Orochi/OrochiUtils.h>

#if defined( OROASSERT )
#undef OROASSERT
#endif
#define OROASSERT( x ) ASSERT_TRUE( x )
#define OROCHECK( x ) { oroError e = x; OROASSERT( e == ORO_SUCCESS ); }

Expand Down Expand Up @@ -50,11 +53,12 @@ TEST_F( OroTestBase, deviceprops )

TEST_F( OroTestBase, kernelExec )
{
OrochiUtils o;
int a_host = -1;
int* a_device = nullptr;
OROCHECK( oroMalloc( (oroDeviceptr*)&a_device, sizeof( int ) ) );
OROCHECK( oroMemset( (oroDeviceptr)a_device, 0, sizeof( int ) ) );
oroFunction kernel = OrochiUtils::getFunctionFromFile( m_device, "../UnitTest/testKernel.h", "testKernel", 0 );
oroFunction kernel = o.getFunctionFromFile( m_device, "../UnitTest/testKernel.h", "testKernel", 0 );
const void* args[] = { &a_device };
OrochiUtils::launch1D( kernel, 64, args, 64 );
OrochiUtils::waitForCompletion();
Expand Down
6 changes: 6 additions & 0 deletions tools/bakeKernel.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
echo // automatically generated, don't edit > ParallelPrimitives/cache/Kernels.h
echo // automatically generated, don't edit > ParallelPrimitives/cache/KernelArgs.h
python tools/stringify.py ./ParallelPrimitives/RadixSortKernels.h >> ParallelPrimitives/cache/Kernels.h
python tools/genArgs.py ./ParallelPrimitives/RadixSortKernels.h >> ParallelPrimitives/cache/KernelArgs.h

python tools/stringify.py ./ParallelPrimitives/RadixSortConfigs.h >> ParallelPrimitives/cache/Kernels.h
7 changes: 7 additions & 0 deletions tools/bakeKernel.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# mkdir hiprt/cache/
echo "// automatically generated, don't edit" > ParallelPrimitives/cache/Kernels.h
echo "// automatically generated, don't edit" > ParallelPrimitives/cache/KernelArgs.h
python tools/stringify.py ./ParallelPrimitives/RadixSortKernels.h >> ParallelPrimitives/cache/Kernels.h
python tools/genArgs.py ./ParallelPrimitives/RadixSortKernels.h >> ParallelPrimitives/cache/KernelArgs.h

python tools/stringify.py ./ParallelPrimitives/RadixSortConfigs.h >> ParallelPrimitives/cache/Kernels.h
56 changes: 56 additions & 0 deletions tools/genArgs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
#!/usr/bin/env python
from __future__ import print_function

import sys
import os

def genArgs( fileName, api, includes ):
with open(fileName) as f:
iName = os.path.basename( fileName ).split('.')[0]

print( '#if !defined(ORO_PP_LOAD_FROM_STRING)' )
print( ' static const char** '+iName+'Args = 0;' )
print( '#else' )
print( ' static const char* '+iName+'Args[] = {' )
includes += iName +'Includes[] = {'
for line in f.readlines():
a = line.strip('\r\n')
if a.find('#include') == -1:
continue
if a.find('#include') != -1 and a.find('inl.' + api) != -1:
continue
if (api == 'cl' or api == 'metal') and a.find('.cu') != -1:
continue
if (a.find('"') != -1 and a.find('#include') != -1):
continue

filename = os.path.basename(a.split('<')[1].split('>')[0])
includes += '"' + a.split('<')[1].split('>')[0] + '",'
name = filename.split('.' + api)[0]
name = name.split('.h')[0]
name = api + '_'+name
print ( name + ',' )
print( api + '_'+iName+'};' )
print( '#endif' )
return includes

argvs = sys.argv

files = []
if len(argvs) >= 2:
files.append( argvs[1] )

print( '#pragma once' )


api = 'hip'

# Visit each file
print( 'namespace ' + api + ' {')

includes = 'static const char* '
for s in files:
includes = genArgs(s, api, includes)
includes += '};'
print( includes )
print( '}\t//namespace ' + api)
Loading

0 comments on commit d78fb81

Please sign in to comment.