Pytorch中的non_blocking ·

思考一个问题，x = x.cuda(non_blocking=True)中non_blocking的作用，以及什么时候使用。

首先要明确，这一项设置是为了通过代码设置，加速程序的执行。

其次，non_blocking=True与pin_memory=True一起使用。

假设我们的代码顺序是这样的：

# 1. to cuda
x = x.cuda(non_blocking=True)
# 2. Perform some CPU operations
...
# 3. Perform GPU operations using x

由于步骤1中发起的copy操作是异步的，它执行时不会阻塞步骤2的执行。即，步骤1和2可以同时执行。由于步骤3依赖x，需要x先被copy到GPU上，所以它必须等步骤1执行完才能执行。

所以这里，步骤1和2可以同时执行（overlapping），步骤3只能之后执行。因此步骤2的执行间隔duration是我们能够从设置non_blocking=True中节约的最大时间。如果不设置non_blocking=True，CPU就会等主线程完成数据transfer之后再执行步骤2。

简单来说，设置x = x.cuda(non_blocking=True)后，如果下一步的操作依赖data，那么就没有加速效果。

另外需要注意，要在Dataloader中使用multiprocessing （通过设置num_workers）。

一个简单的benchmark：

import torchvision, torch, time
import numpy as np

pin_memory = True
batch_size = 1024 # bigger memory transfers to make their cost more noticable
n_workers = 6 # parallel workers to free up the main thread and reduce data decoding overhead
train_dataset =torchvision.datasets.CIFAR10(
    root='cifar10_pytorch',
    download=True,
    transform=torchvision.transforms.ToTensor()
)   
train_dataloader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=batch_size,
    pin_memory=pin_memory,
    num_workers=n_workers
)   
print('pin_memory:', pin_memory)
times = []
n_runs = 10

def work():
    # emulates the CPU work done
    time.sleep(0.1)

for i in range(n_runs):
    st = time.time()
    for bx, by in train_dataloader:
       bx, by = bx.cuda(non_blocking=pin_memory), by.cuda(non_blocking=pin_memory)
       work()
    times.append(time.time() - st)
print('average time:', np.mean(times))

¹~

cd = sorted(cout.items(), key=lambda item: item[1], reverse=True)