In: Computer Science
Answer:
CUDA Memory(Compute Unified Device Architecture)
Apart from the device DRAM, CUDA supports several additional types of memory that can be used to increase the CGMA ratio for a kernel. We know that accessing the DRAM is slow and expensive. To overcome this problem, several low-capacity, high-bandwidth memories, both on-chip and off-chip are present on a CUDA GPU. If some data is used frequently, then CUDA caches it in one of the low-level memories. Thus, the processor does not need to access the DRAM every time.The following figure illustrates the memory architecture supported by CUDA and typically found on Nvidia cards
Device code:
Thread: This is just an execution of a kernel with a given index. Each thread uses its index to access elements in array such that the collection of all threads cooperatively processes the entire data set.
Block: This is a group of threads. There’s not much you can say about the execution of threads within a block ,they could execute concurrently or serially and in no particular order. You can coordinate the threads, somewhat, using the _syncthreads() function that makes a thread stop at a certain point in the kernel until all the other threads in its block reach the same point.
Grid: This is a group of blocks. There’s no synchronization at all between the blocks.
Host code: This helps in transferring data to/from per grid global and constant memories.
Global Memory:
high-latency memory (the slowest in the figure). To increase the arithmetic intensity of our kernel, we want to reduce as many accesses to the global memory as possible. One thing to note about global memory is that there is no limitation on what threads may access it. All the threads of any block can access it. There are no restrictions, like there are in the case of shared memory or registers.
Constant Memory:
The constant memory can be written into and read by the host. It is used for storing data that will not change over the course of kernel execution. It supports short-latency, high-bandwidth, read-only access by the device when all threads simultaneously access the same location. There is a total of 64K constant memory on a CUDA capable device.
Shared Memory:
All threads of a block can access its shared memory. Shared memory can be used for inter-thread communication. Each block has its own shared-memory. Just like registers, shared memory is also on-chip, but they differ significantly in functionality and the respective access cost.