CUDA: How should I handle cases where the number of threads cannot be represented as a dimGrid*dimBlock? -
Suppose that my input has seven data points on which some are calculated and returned the result back to an output array Goes 7. Size declares the block dimension, there are 4 results in grid size of 2, which is to run the kernel with the invalid thread id (using pt_id = blockIdx.x * blockDim.x + threadID.x) Goes to Attention 7 and fails due to invalid memory access (since I base the thread ID But I am using some of my things). I could add the code to my kernel which specifically connects the thread ID to a max_thread_id parameter and if thread_id> max_thread_ id does nothing, I wonder if rubbing is a great way to handle input arrays .
The function of a size that is not greater than the dimension of the block, is. The solution I use the most is to assume that the size of your input data is N
and you want to configure your launch with the block size equal to BLOCK_SIZE
Are there. In this case, your launch configuration can look like this:
kernel_function & lt; & Lt; & Lt; (N + BLOCK_SIZE - 1) / BLOCK_SIZE, BLOCK_SIZE & gt; & Gt; & Gt; (...);
And each thread on the kernel code determines whether it is going to work, something like this:
int id = blockIdx.x * BlockDim.x + threadIdx.x; If (id & lt; N) {/ * accessories * /} and {return; }
If the size of the job depends on the input ( N
), then you have to pass this value as a parameter in the kernel function too. In addition, it is quite common to define the value of N
and BLOCK_SIZE
as a macro or template parameter.
Finally, if the size of your input array is small, like in your example, the GPU is absorbed and the similarity does not reduce or reduce the performance of your algorithm.
Comments
Post a Comment