harry's blog

1. distributed

有一个坑是使用分布式计算的时候，每张卡的内存分配都应该是均匀的，但是有时候会出现 0 卡占用更多内存的情况，这个坑在知乎上有讨论：链接

分布式本身的内存分配应该是均匀的（左图），但是有时候会出现另一种情况（有图）

这是 load 模型的时候导致的，当用下面句子 load 模型时，torch.load 会默认把 load 进来的数据放到 0 卡上，这样四个进程全部会在 0 卡占用一部分显存

checkpoint = torch.load("checkpoint.pth")
model.load_state_dict(checkpoint["state_dict"])

解决方法就是将 load 进来的数据 map 到 cpu 上

checkpoint = torch.load("checkpoint.pth", map_location=torch.device('cpu'))
model.load_state_dict(checkpoint["state_dict"])

2. 计算图

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’s forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).

这里解释两个可能的原因：

模型中 forward 函数内有的参数没有参与计算，举个例子在输出图片时你输出了目标图和一个其他的参数（假设这是一个观察变量，你仅仅想看看那个参数的变化情况，而不是将他也参与反向传播），计算时只用了目标图计算 loss 和反向传播，此时就会报错
模型中的某一个操作，例如 conv 在 forward 是未使用

class model(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.fc1 = nn.Linear(dim, dim*4)
		self.fc2 = nn.Linear(dim*4, dim)
		self.fc3 = nn.Linear(dim, dim)  #########
    def forward(self, x):
		y1 = self.fc1(x)
		y2 = self.fc2(y2)
								  # error1, self.fc3未使用
		return y1, y2
...
optimizer.zero_grad()
y1, y2 = model(input)
loss   = lossl1(y1, x)            # error2, y2未参与计算
loss.backward()
optimizer.step()

如果你的观察变量不会影响结果，即 y2，你就可以将分布式中的 torch.nn.parallel.DistributedDataParallel 的参数 find_unused_parameters=True，就不会报错了；如果会影响结果说明 forward 函数写错了，查看错误即可

本文由 Yonghui Wang 创作，采用知识共享署名4.0 国际许可协议进行许可
本站文章除注明转载/出处外，均为本站原创或翻译，转载前请务必署名
最后编辑时间为: Dec 19, 2024 12:13 pm

pytorch 常见错误

1. distributed

2. 计算图