added launch bounds attributes for all CUDA kernels (cherry picked from commit d2251687)
Attach a file by drag & drop or click to upload