You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Implementations for various memory models are tightly coupled, complicating maintenance and making changes difficult
Some memory models cannot be used in the same build (e.g. copy hiding and single memory space models cannot be used together)
Build time configuration of CHAI is complicated and varies significantly by platform
Inconsistent support for features across memory models (i.e. callbacks, resize invalidates prior copies in some cases but not others)
New Design Goals
Modularize implementations for different memory models so that maintenance, understanding, and change are easier
Allow all supported memory models to be used in the same build
Simplify build time configuration of CHAI (should be able to remove variables like CHAI_DISABLE_RM, CHAI_ENABLE_UM, CHAI_ENABLE_PINNED, CHAI_THIN_GPU_ALLOCATE)
Maintain or improve performance. In particular, avoid inheritance in device and other performance critical code
Leave the design space open for shared_ptr-like semantics and container semantics
Leave the design space open for user customization
Make it harder to use ManagedArray incorrectly (i.e. calling allocate twice could easily leak, so combine allocate and reallocate into a single resize function. Another example would be preventing the call to resize or free on a slice.)
Requirements (Ramblings?)
Avoid inheritance in device and other performance critical code
ManagedArray (or whatever classes are copied onto the device) should not inherit from a base class (with the possible exceptions of an empty base class for tagging, or CRTP). Device compilers often struggle to implement virtual functions correctly and optimize them.
ManagedArray could contain a pointer to a base class or a type erased class that involves inheritance. That member would only be used in host code (e.g. to update coherence in the host copy constructor).
Maintain/improve performance
Limit inheritance as mentioned above.
Use camp::Resource instead of ExecutionSpace so that synchronization could be limited to streams instead of the whole device.
Consider whether device memory allocations could also occur in a stream
ManagedArray::operator[] should be inlined, non-virtual, and a single array access
Reduce the size of ManagedArray (currently has 8 members, and I think it could be reduced to 3). This would help performance as ManagedArray is designed to be passed by copy everywhere.
Continue to optimize in the case where ManagedArray is read only.
API on device
ManagedArray::size() and ManagedArray::operator[]. The copy constructor as well.
Discrete and unified memory implementations should be separate, but have the same interface.
Option 1: Static polymorphism (i.e. CRTP)
Option 2: Template parameter to ManagedArray (ManagedArray<ElementType, ArrayManagerType>)
Option 3: Runtime polymorphism (ArrayManager is a base class, UnifiedArrayManager and DiscreteArrayManager would inherit from it). ManagedArray would contain a pointer to ArrayManager.
Option 4: Type erasure (ArrayManager is a type-erased container for any type that the necessary interface). ManagedArray would contain an ArrayManager object
Option 5: Just allow separate types and rely on testing to make sure they provide the minimal interface (though extra features could slip in on one type and prevent it from being interchangeable
Design Questions
Should different underlying array implementations result in different types being exposed to the user?
Having separate types corresponds to options 1 and 2 above.
Examples: UnifiedArray vs DiscreteArray or ManagedArray<int, UnifiedArrayManager> vs ManagedArray<int, DiscreteArrayManager>
Note that regardless of the implementation, it should have the same interface so that it can be used in the same way (should act the same with RAJA loops, for example).
Pros:
Avoids virtual function calls (though may not avoid indirection).
In some cases, we care about the type of array (really the type of underlying memory). For example, with MPI calls we might want to force a host or device pointer based on whether we want to use GPU aware MPI or not. Or for performance, we might want to use a pinned array (such as for a scan where we need to pick off the last element on the host). There are probably other ways to do this besides forcing CHAI to have separate types. Depending on the type of array, we might be able to optimize some algorithms (though this is more speculative).
Cons:
Could potentially increase the use of templates. For example a non-template function that took ManagedArray before might have to be templated now on the ArrayManager type. To avoid making it a template function, applications would have to pick a single ArrayManager type per build and use a type alias to hide that type.
Might increase build time.
To get around some of the limitations, there could be a partial specialization of ManagedArray (i.e. ManagedArray<int, AnyManagerType> that uses some form of type erasure. But I'm not sure I see the advantage of this over just having a single type.
This also begs the question of whether different memory types should result in different array types. i.e. Should there be a CudaHostPinnedArray, CudaUnifiedArray, HipMallocArray, CudaMallocArray, HostMallocArray, etc...). From the current standpoint of CHAI, there are really two paradigms - either you need to synchronize and copy for coherence, or synchronize for coherence.
Exposing a single type as an interface corresponds to options 3 and 4 above.
Pros:
Simple from the user perspective (single interface, the only thing that varies is how it is constructed)
Matches the current usage of ManagedArray
Applications can easily switch the memory model used based on runtime options (better support for fat binaries)
No template explosion
Faster to compile
Cons:
Simple implementation uses inheritance virtual functions for updating coherence, which might hurt performance (but the impact has not been measured compared to the current indirection already being used)
Complicated implementation uses type erasure (which either uses virtual functions or trades space for direct dispatch implementations).
Should there be separate types for a ManagedArray and a slice of a ManagedArray?
Currently, a slice of a ManagedArray produces another ManagedArray. As a consequence, users can still call allocate, reallocate, and free on a slice. That means we have to do extra checking in these routines and decide how to handle cases where we are dealing with a slice. Free can easily be a no-op, but what to do in the case of allocate/reallocate? Throw an exception? Silently fail? Write an error message and continue? Also, 3 of the 8 data members in ManagedArray are used to support slices. This increased size could be hurting performance for all ManagedArrays when slices are really a small percentage of the total uses of ManagedArray.
Having separate types for ManagedArray and say ManagedArrayView could fix these problems. It would be a compile time error instead of a runtime error to call free/allocate/reallocate on a slice, for example. The increased safety is a solid upside. The main downside is that this would be a breaking change.
The tricky part is how to reduce code duplication since most of the implementation for ManagedArray and ManagedArrayView would be the same, modulo a couple of functions. Perhaps ManagedArray could be templated on a HandleType that contains a pointer to the ArrayManager. One could envision PointerHandle, ViewHandle, LegacyPointerHandle (that provides CHAI's current behavior where slices are the same type), and even future types such as SharedPointerHandle, UniquePointerHandle, or ContainerHandle.
Should the execution context policy be part of the ManagedArray API?
CHAI's main goal is implicit memory coherence. Using CHAI with RAJA provides this implicit coherence for a lot of application code. However, there are cases where applications need to extract a raw pointer (for example a pointer passed to a host only library function). In this case, there are two possible (and not necessarily mutually exclusive) options.
Option 1: Keep the execution context implicit from the perspective of ManagedArray. This means the application needs to take care of setting the execution context. The major downside is that this is a breaking change. But it is a breaking change that would allow for certain use cases (such as OpenMP with nowait). Another downside is a little bit of boilerplate code (can be limited to introduction of a new scope and a single line making use of RAII). The upside is that this approach would allow easier change of the execution context to use camp::Resource later. Another upside is this explicit approach may actually help reduce uses of a pointer that might be invalidated later. There is a possibility that this design could allow for a lightweight and hopefully faster HostArrayManager that does not have to check the execution context at all for host only builds.
{
ExecutionContextGuard guard(chai::CPU);
T* data = managedArray.data();
someHostLibraryFunction(data);
}
Option 2: Make the execution context a part of the ManagedArray API. This increases coupling and makes future changes harder (such as moving to camp::Resource). To decrease coupling, ManagedArray could be templated on an ExecutionContextPolicy. This option may be more convenient for users (avoids the boilerplate), but may result in less discplined uses of the extracted pointer, which could be invalidated by other uses.
T* data = managedArray.data(chai::CPU)
Option 3: Expose both options. The coupling to the execution context may make change harder, but the users get the convenience they may want. There may also be ambiguity, though about whether calling data() on the host means data(chai::CPU) or data(chai::ExecutionContextManager::getExecutionContext()).
Things to consider:
Near-term change from chai::ExecutionSpace to camp::Resource
Breaking current code (ManagedArray::data() always assumes the host execution space). But also note this prevents us from using openmp with the nowait policy).
Possible future needs for thread safety
This example showed the data method, but this discussion also applies to construction (I think it would make sense to allocate on a particular GPU stream, for example) and getting/setting a single element in the array. Should the execution context be passed explicitly to these methods, or should these methods grab the execution context from the execution context manager? Or should there be a default execution context set somewhere?
Proposed Design
Architectural Elements
ManagedArray
Instead of containing multiple memory management implementations and delegating some implementations to the old ArrayManager, some of which are chosen at configuration time and others at run time, ManagedArray will delegate the memory management to its own instance of an ArrayManager (not a singleton). ManagedArray will contain an active pointer, a size, and a handle to an ArrayManager. This will significantly decrease the size of ManagedArray, which is advantageous because it is designed to be passed around by copy.
ArrayManager
In this design, ArrayManager is an abstract base class rather than a concrete type. It takes on the role of PointerRecord and parts of the role of the old ArrayManager class (basically everything except managing the execution context). If users desire customization, they can derive from this class.
UnifiedArrayManager
This class replaces the "thin" implementation of ManagedArray. It is designed to be used in the case where there is a unified memory space (same address can be used across execution contexts) but synchronization is still required for coherence. Interacts with ExecutionContextManager to get the current execution context and synchronize execution contexts when needed.
DiscreteArrayManager
This class replaces the "main" implementation of ManagedArray. It is designed to be used in the case where there are discrete memory spaces, and copying between them is required for coherence. Interacts with ExecutionContextManager to get the current execution context and informs ExecutionContextManager if synchronization has taking place (such as during a cudaMemcpy from device to host).
ExecutionContext
The old ExecutionSpace enum grew in a confusing way to include both execution spaces and memory types. Also, the term "space" has become a little limiting. For performance reasons, it is important to be able to not only distinguish whether we are executing on a GPU, but which GPU stream is being used. That would allow CHAI to synchronize at the stream level rather than the whole device, which is a very expensive operation. ExecutionContext would work with camp::Resource objects, which would provide that more fine-grained control. It could also be made to work with the current ExecutionSpace enum in the short-term, choosing a default camp resource under the hood.
ExecutionContextManager
This class takes on some of the role of ArrayManager, namely managing the current execution context. It also keeps track of which execution contexts are unsynchronized. Concrete ArrayManager implementations communicate synchronization that has occurred or synchronization that is needed to ExecutionContextManager, which then performs the synchronization only if it is needed. This is much more performant than having every ArrayManager instance call cudaDeviceSynchronize(), for example (e.g. if five ManagedArrays are copied into a RAJA sequential loop, we only want to synchronize the device once, not five times).
ExecutionContextGuard
This class provides a convenient way of setting the execution context on construction, and restoring the previous execution context on destruction. It is a similar idea to std::lock_guard and would be used the same way, though it does not provide thread safety. Another type of guard could potentially be added to provide thread safety if needed. This class would not be used by the RAJA plugin, but could be helpful in application code when raw data needs to be extracted from a ManagedArray (such as passing a raw pointer to a function).
template <typename T>
class SharedPointerHandle {
public:
SharedPointerHandle();
SharedPointerHandle(T* pointer);
CHAI_HOST_DEVICE SharedPointerHandle(const SharedPointerHandle& other); // Does reference counting
private:
T* m_pointer;
int m_count;
};
ArrayManager
/*!
* \class ArrayManager
*
* \brief Controls the coherence of an array.
*/
template <typename ElementType>
class ArrayManager {
public:
/*!
* \brief Virtual destructor.
*/
virtual ~ArrayManager() = default;
/*!
* \brief Creates a clone of this ArrayManager.
*
* \return A new ArrayManager object that is a clone of this instance.
*/
virtual ArrayManager* clone() const = 0;
/*!
* \brief Resizes the array to the specified new size.
*
* \param newSize The new size to resize the array to.
*/
virtual void resize(std::size_t newSize) = 0;
/*!
* \brief Returns the size of the contained array.
*
* \return The size of the contained array.
*/
virtual std::size_t size() const = 0;
/*!
* \brief Updates the data to be coherent in the current execution context.
*
* \param data [out] A coherent array in the current execution context.
*/
virtual ElementType* data() = 0;
/*!
* \brief Returns the value at index i.
*
* Note: Use this function sparingly as it may be slow.
*
* \param i The index of the element to get.
* \return The value at index i.
*/
virtual ElementType get(std::size_t i) const = 0;
/*!
* \brief Sets the value at index i to the specified value.
*
* Note: Use this function sparingly as it may be slow.
*
* \param i The index of the element to set.
* \param value The value to set at index i.
*/
virtual void set(std::size_t i, ElementType value) = 0;
};
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Old Design Issues
New Design Goals
Requirements (Ramblings?)
Avoid inheritance in device and other performance critical code
Maintain/improve performance
API on device
Discrete and unified memory implementations should be separate, but have the same interface.
Design Questions
Should different underlying array implementations result in different types being exposed to the user?
Having separate types corresponds to options 1 and 2 above.
Examples: UnifiedArray vs DiscreteArray or ManagedArray<int, UnifiedArrayManager> vs ManagedArray<int, DiscreteArrayManager>
Note that regardless of the implementation, it should have the same interface so that it can be used in the same way (should act the same with RAJA loops, for example).
Pros:
Cons:
To get around some of the limitations, there could be a partial specialization of ManagedArray (i.e. ManagedArray<int, AnyManagerType> that uses some form of type erasure. But I'm not sure I see the advantage of this over just having a single type.
This also begs the question of whether different memory types should result in different array types. i.e. Should there be a CudaHostPinnedArray, CudaUnifiedArray, HipMallocArray, CudaMallocArray, HostMallocArray, etc...). From the current standpoint of CHAI, there are really two paradigms - either you need to synchronize and copy for coherence, or synchronize for coherence.
Exposing a single type as an interface corresponds to options 3 and 4 above.
Pros:
Cons:
Should there be separate types for a ManagedArray and a slice of a ManagedArray?
Currently, a slice of a ManagedArray produces another ManagedArray. As a consequence, users can still call allocate, reallocate, and free on a slice. That means we have to do extra checking in these routines and decide how to handle cases where we are dealing with a slice. Free can easily be a no-op, but what to do in the case of allocate/reallocate? Throw an exception? Silently fail? Write an error message and continue? Also, 3 of the 8 data members in ManagedArray are used to support slices. This increased size could be hurting performance for all ManagedArrays when slices are really a small percentage of the total uses of ManagedArray.
Having separate types for ManagedArray and say ManagedArrayView could fix these problems. It would be a compile time error instead of a runtime error to call free/allocate/reallocate on a slice, for example. The increased safety is a solid upside. The main downside is that this would be a breaking change.
The tricky part is how to reduce code duplication since most of the implementation for ManagedArray and ManagedArrayView would be the same, modulo a couple of functions. Perhaps ManagedArray could be templated on a HandleType that contains a pointer to the ArrayManager. One could envision PointerHandle, ViewHandle, LegacyPointerHandle (that provides CHAI's current behavior where slices are the same type), and even future types such as SharedPointerHandle, UniquePointerHandle, or ContainerHandle.
Should the execution context policy be part of the ManagedArray API?
CHAI's main goal is implicit memory coherence. Using CHAI with RAJA provides this implicit coherence for a lot of application code. However, there are cases where applications need to extract a raw pointer (for example a pointer passed to a host only library function). In this case, there are two possible (and not necessarily mutually exclusive) options.
Things to consider:
Proposed Design
Architectural Elements
ManagedArray
Instead of containing multiple memory management implementations and delegating some implementations to the old ArrayManager, some of which are chosen at configuration time and others at run time, ManagedArray will delegate the memory management to its own instance of an ArrayManager (not a singleton). ManagedArray will contain an active pointer, a size, and a handle to an ArrayManager. This will significantly decrease the size of ManagedArray, which is advantageous because it is designed to be passed around by copy.
ArrayManager
In this design, ArrayManager is an abstract base class rather than a concrete type. It takes on the role of PointerRecord and parts of the role of the old ArrayManager class (basically everything except managing the execution context). If users desire customization, they can derive from this class.
UnifiedArrayManager
This class replaces the "thin" implementation of ManagedArray. It is designed to be used in the case where there is a unified memory space (same address can be used across execution contexts) but synchronization is still required for coherence. Interacts with ExecutionContextManager to get the current execution context and synchronize execution contexts when needed.
DiscreteArrayManager
This class replaces the "main" implementation of ManagedArray. It is designed to be used in the case where there are discrete memory spaces, and copying between them is required for coherence. Interacts with ExecutionContextManager to get the current execution context and informs ExecutionContextManager if synchronization has taking place (such as during a cudaMemcpy from device to host).
ExecutionContext
The old ExecutionSpace enum grew in a confusing way to include both execution spaces and memory types. Also, the term "space" has become a little limiting. For performance reasons, it is important to be able to not only distinguish whether we are executing on a GPU, but which GPU stream is being used. That would allow CHAI to synchronize at the stream level rather than the whole device, which is a very expensive operation. ExecutionContext would work with camp::Resource objects, which would provide that more fine-grained control. It could also be made to work with the current ExecutionSpace enum in the short-term, choosing a default camp resource under the hood.
ExecutionContextManager
This class takes on some of the role of ArrayManager, namely managing the current execution context. It also keeps track of which execution contexts are unsynchronized. Concrete ArrayManager implementations communicate synchronization that has occurred or synchronization that is needed to ExecutionContextManager, which then performs the synchronization only if it is needed. This is much more performant than having every ArrayManager instance call
cudaDeviceSynchronize()
, for example (e.g. if five ManagedArrays are copied into a RAJA sequential loop, we only want to synchronize the device once, not five times).ExecutionContextGuard
This class provides a convenient way of setting the execution context on construction, and restoring the previous execution context on destruction. It is a similar idea to std::lock_guard and would be used the same way, though it does not provide thread safety. Another type of guard could potentially be added to provide thread safety if needed. This class would not be used by the RAJA plugin, but could be helpful in application code when raw data needs to be extracted from a ManagedArray (such as passing a raw pointer to a function).
API (WIP)
ManagedArray
ViewHandle
PointerHandle
ArrayManager
UnifiedArrayManager
Beta Was this translation helpful? Give feedback.
All reactions