pytorch 常见错误
pytorch
本文字数:588 字 | 阅读时长 ≈ 2 min

pytorch 常见错误

pytorch
本文字数:588 字 | 阅读时长 ≈ 2 min

1. distributed

有一个坑是使用分布式计算的时候,每张卡的内存分配都应该是均匀的,但是有时候会出现 0 卡占用更多内存的情况,这个坑在知乎上有讨论:链接

分布式本身的内存分配应该是均匀的(左图),但是有时候会出现另一种情况(有图)

这是 load 模型的时候导致的,当用下面句子 load 模型时,torch.load 会默认把 load 进来的数据放到 0 卡上,这样四个进程全部会在 0 卡占用一部分显存

checkpoint = torch.load("checkpoint.pth")
model.load_state_dict(checkpoint["state_dict"])

解决方法就是将 load 进来的数据 map 到 cpu 上

checkpoint = torch.load("checkpoint.pth", map_location=torch.device('cpu'))
model.load_state_dict(checkpoint["state_dict"])

2. 计算图

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel; (2) making sure all forward function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn’t able to locate the output tensors in the return value of your module’s forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).

这里解释两个可能的原因:

class model(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.fc1 = nn.Linear(dim, dim*4)
		self.fc2 = nn.Linear(dim*4, dim)
		self.fc3 = nn.Linear(dim, dim)  #########
    def forward(self, x):
		y1 = self.fc1(x)
		y2 = self.fc2(y2)
								  # error1, self.fc3未使用
		return y1, y2
...
optimizer.zero_grad()
y1, y2 = model(input)
loss   = lossl1(y1, x)            # error2, y2未参与计算
loss.backward()
optimizer.step()	

如果你的观察变量不会影响结果,即 y2,你就可以将分布式中的 torch.nn.parallel.DistributedDataParallel 的参数 find_unused_parameters=True,就不会报错了;如果会影响结果说明 forward 函数写错了,查看错误即可

9月 09, 2024