省流

pytorch内置一个名为autograd的engine, 负责隐式执行所有梯度计算工作.

在计算图的根部call backwark(), DAG(directed acyclic graph)会计算图中每一个叶子节点的grad并累加到名为grad的属性上, 期间引用链式法则

DAGs are dynamic in PyTorch An important thing to note is that the graph is recreated from scratch; after each .backward() call, autograd starts populating a new graph.

为了节省空间占用, 一个DAG在被从根节点执行 backward() 之后, 当前的DAG会被清空, autograd将重新构建一个DAG from scratch.

如果你试图将一个DAG的root backward两遍, pytorch会提醒你添加retain_graph=True

Intro

接手了一段pytorch代码, 代码片段中出现了Variable(), 之前没见过去查了一下, 原来是pytorch 0.4版本就已经deprecated的老东西, 作用是声明一个需要被计算梯度的变量, 现在这个功能已经被整合进Tensor中了. 我虽然没有什么代码洁癖, 但是这种老东西在接受的代码中多少还是有点受不了, 于是顺手换掉了, 以下更换前后的对比:

# 更换前
style = Variable(style.repeat(BATCH_SIZE, 1, 1, 1)).cuda()
vgg = utils_patch.Vgg16().cuda()
style_features = vgg(style)
style_gram = [utils_patch.gram(fmap) for fmap in style_features]
image_transformer = UNet(3, 3, bilinear=True).to(device)
    perturbation = Variable(torch.rand(1, 128, 16, 16)).to(device)
    params = list(image_transformer.parameters()) + list(perturbation)
    optimizer = torch.optim.Adam(params, 1e-3)
    for p in params:
        p.requires_grad_(True)

# 更换后
style = style.repeat(BATCH_SIZE, 1, 1, 1).to(device).require_grad_(True) # it's here !!!!
vgg = utils_patch.Vgg16().buffer()
style_features = vgg(style)
style_gram = [utils_patch.gram(fmap) for fmap in style_features]
image_transformer = UNet(3, 3, bilinear=True).to(device)
    perturbation = torch.rand(1, 128, 16, 16).to(device)
    params = list(image_transformer.parameters()) + list(perturbation)
    optimizer = torch.optim.Adam(params, 1e-3)
    for p in params:
        p.requires_grad_(True)

结果出现了报错

trying to backward through the graph a second time

汗流浃背了, stackoverflow启动, 看到了这样一段解释:
To reduce memory usage, during the .backward() call, all the intermediary results are deleted when they are not needed anymore. Hence if you try to call .backward() again, the intermediary results don’t exist and the backward pass cannot be performed (and you get the error you see).
这里作者建议为backward()添加参数retain_grahp=True, 在每一次backward后保留DAG, 然而这样会使得训练过程中的显存占用显著提升, 3090吃不住.

不得已打开pytorch.org, 重温一下getting started, 了解到了DAG的原理, 一通逻辑推理后, 发现原因出在style上, 这个变量不参与优化也不需要其梯度, 但是我将其从Variable替换为Tensor后顺便加了一个require_grad, 导致backward时style被一同refresh, 而style_feature以及style_gram作为对于style的操作被DAG捕获并加入计算图, backward后, 这些操作被DAG discarded, 然而因为style时leaf tensor需要计算梯度, 这里pytorch判定为用户二次backward了同一DAG, 因此报错.

以上是我随便蒙的过程, 我觉得还有点道理, 反正我是被说服了(我对以上内容的正确性不负责😋)

省流

Intro

发送评论 编辑评论

发送评论编辑评论