探究实验

0

batch_size是不是越大越好？

Answer: 不是使用不同的batch_size，在测试集上的准确率分别为： | batch_size | accuracy | | :--------: | :------: | | 16 | 65% | | 32 | 63% | | 64 | 56% | | 128 | 47% |

这说明了batch_size并不是越大越好。

可能的原因是： - 大的batch会导致模型更容易收敛到局部最优解，而不是全局最优解；小的batch实际起了退火的作用 - 大的batch会导致模型更新次数减少，收敛速度变慢

另外，大的batch也可能导致显存容量不足，无法训练

1

在训练集那里的transform试一下RandomHorizontalFlip，效果会更好吗？

Answer: 我们对数据集做了RandomHorizontalFlip的Augmentation，改动后在测试集上的准确率为56%，与原来相比没有明显变化其Training Loss在最后一个epoch平均为1.226，相较原来的版本中的1.209高了一点，这是在预期之中的，因为增加Augmentation之后过拟合的程度降低了，Training Loss会提高实际上，由于这个Augmentation是比较弱的，所以效果并不明显

2

换一个optimizer, 使效果更好一些

Answer: 我们可以使用Adam来大幅优化训练效率

self.optimizer = torch.optim.Adam(self.model.parameters(), lr=Config.LEARNING_RATE)

用其训练后，收敛速度大幅提高，且最终精度也大幅提高：

对比它与原来的SGD在每个epoch的表现如下： | epoch | SGD | Adam | | :---: | :---: | :---: | | 1 | 16% | 48% | | 2 | 29% | 54% | | 3 | 36% | 57% | | 4 | 40% | 60% | | 5 | 43% | 62% | | 6 | 46% | 62% | | 7 | 48% | 63% | | 8 | 50% | 64% | | 9 | 51% | 63% | | 10 | 53% | 65% | | 11 | 55% | 64% | | 12 | 56% | 65% |

这样的结果在预期之中，毕竟Adam已经成为广大深度学习任务的标配优化器了，它在二阶上对动量进行修正，使得在loss的landscape较为极端的情况下，仍能保持较好的收敛性。

3

保持epoch数不变，加一个scheduler，是否能让效果更好一些

Answer: 我们使用最简单的StepLR来进行学习率的调整，我们设置的参数令其每过5个epoch学习率减半，初始学习率设为0.002，分别对SGD与Adam测试如下： | Group | SGD-Accuracy | SGD-Loss | Adam-Accuracy | Adam-Loss | | :----------: | :----------: | :------: | :-----------: | :-------: | | No-Scheduler | 56% | 1.211 | 65% | 0.779 | | Step-LR | 60% | 1.068 | 65% | 0.668 |

可以看到，使用StepLR后，SGD的效果有了一定的提升，而Adam的效果没有明显变化，但两个优化器训出来的Loss都有了明显的下降，这说明了学习率的调整是有效的，的确和调小学习率能促进收敛的预期一致，而Adam的效果没有明显变化可能是因为过拟合了。

4

根据Net() 生成 Net1(), 加入三个batch_normalization层，显示测试结果

Answer: 在两个卷积层和第一个线性层后加入BN层后，在测试集上的准确率为65%，较原来提升了9%，可见BatchNorm对效果的提升极为显著。 BatchNorm能提升效果主要是因为： - 可以保证各层数据特征分布的稳定性，使得模型更容易收敛，不容易梯度消失或爆炸 - 可以保证隐含层输出集中在一般激活函数的主要非线性区

具体代码如下：

class BnModel(nn.Module): # Add Batch Normalization (Net1)
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.bn1 = nn.BatchNorm2d(6)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.bn2 = nn.BatchNorm2d(16)

        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.bn3 = nn.BatchNorm1d(120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.bn1(x)
        x = self.pool(F.relu(self.conv2(x)))
        x = self.bn2(x)
        x = torch.flatten(x, 1) # flatten all dimensions except batch
        x = F.relu(self.fc1(x))
        x = self.bn3(x)
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

5

根据Net() 生成Net2(), 使用Kaiming初始化卷积与全连接层，显示测试结果

Answer: 加入Kaiming初始化后，在测试集上的准确率为57%，较原来提升了1%。

实际上，nn.Linear及nn.Conv2d的父类nn._ConvNd的初始化默认就使用了Kaiming初始化，它们初始化时都调用了以下函数：

def reset_parameters(self) -> None:
    # Setting a=sqrt(5) in kaiming_uniform is the same as initializing with
    # uniform(-1/sqrt(in_features), 1/sqrt(in_features)). For details, see
    # https://github.com/pytorch/pytorch/issues/57109
    init.kaiming_uniform_(self.weight, a=math.sqrt(5))
    if self.bias is not None:
        fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
        bound = 1 / math.sqrt(fan_in) if fan_in > 0 else 0
        init.uniform_(self.bias, -bound, bound)

此处的提升可能是因为Kaiming初始化的参数不一样。

该网络具体代码如下：

class KaimingInitModel(nn.Module): # Using Kaiming Initialization (Net2)
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)

        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

        torch.nn.init.kaiming_uniform_(self.conv1.weight, a=0, mode='fan_in', nonlinearity='relu')
        torch.nn.init.kaiming_uniform_(self.conv2.weight, a=0, mode='fan_in', nonlinearity='relu')
        torch.nn.init.kaiming_uniform_(self.fc1.weight, a=0, mode='fan_in', nonlinearity='relu')
        torch.nn.init.kaiming_uniform_(self.fc2.weight, a=0, mode='fan_in', nonlinearity='relu')
        torch.nn.init.kaiming_uniform_(self.fc3.weight, a=0, mode='fan_in', nonlinearity='relu')

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = torch.flatten(x, 1) # flatten all dimensions except batch
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

6

根据Net()生成Net3(),将Net()中的通道数加到原来的2倍，显示测试结果

Answer: 通道数翻倍后，在测试集上的准确率为60%，较原来提升了4%，可见提升通道数有一定效果。这主要是因为通道数增加后，卷积核的种类增加了，能提取的特征种类也增加了。

具体代码如下：

class DoubleChannelModel(nn.Module): # Doubling the Channel (Net3)
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 12, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(12, 32, 5)

        self.fc1 = nn.Linear(32 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = torch.flatten(x, 1) # flatten all dimensions except batch
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

7

在不改变Net()的基础结构（卷积层数、全连接层数不变）和训练epoch数的前提下，你能得到最好的结果是多少？

Answer: 我们进行了如下改进： - 使用Adam优化器，增加了0.0001的WEIGHT_DECAY，使用CosineAnnealingLR余弦退火学习率调整器，初始学习率设为0.002 - 同上述BN网络加入三个BatchNorm层，并加入了三个p=0.1的dropout层，这三个层正好加在BatchNorm层之后 - 前两层卷积层通道数分别改为256与256 - 采用了大量数据增强，具体如下（其中AutoAugment()来自论文'AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty - https://arxiv.org/abs/1912.02781）：

def get_transform(self):
    res = []
    res.append(transforms.RandomHorizontalFlip(p=0.5))
    res.extend([transforms.Pad(2, padding_mode='constant'),
                    transforms.RandomCrop([32,32])])
    res.append(transforms.RandomApply([AutoAugment()], p=0.6))
    res.append(transforms.ToTensor())
    res += [transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
    return transforms.Compose(res)

最终在测试集上的准确率为83%，较原来提升了27%。

模型具体代码如下：

class BetterBaselineModel(nn.Module): # Better Baseline Model (Net4)
    def __init__(self):
        super().__init__()
        channel1 = 256
        channel2 = 256
        self.conv1 = nn.Conv2d(3, channel1, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.bn1 = nn.BatchNorm2d(channel1)
        self.dropout1 = nn.Dropout(0.1)
        self.conv2 = nn.Conv2d(channel1, channel2, 5)
        self.bn2 = nn.BatchNorm2d(channel2)
        self.dropout2 = nn.Dropout(0.1)

        self.fc1 = nn.Linear(channel2 * 5 * 5, 120)
        self.bn3 = nn.BatchNorm1d(120)
        self.dropout3 = nn.Dropout(0.1)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.bn1(x)
        x = self.dropout1(x)
        x = self.pool(F.relu(self.conv2(x)))
        x = self.bn2(x)
        x = self.dropout2(x)
        x = torch.flatten(x, 1) # flatten all dimensions except batch
        x = F.relu(self.fc1(x))
        x = self.bn3(x)
        x = self.dropout3(x)
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

8

使用ResNet18(),显示测试结果

Answer: 使用resnet-18，上一问其它参数不变，在测试集上的准确率为83%，与上一问的结果相同。其在较少参数下能达到这样的效果，说明了其优秀的性能，这可能是因为： - 其残差连接能够有效地缓解梯度消失问题，方便层数堆叠 - 让网络学习残差比学习原始特征更容易，免去了对额外的恒等映射的行为的学习

具体代码如下：

class ResNet(nn.Module): # Resnet18
    def __init__(self):
        super().__init__()
        self.model = torchvision.models.resnet18(pretrained=False)
        if Config.PRETRAINED:
            self.model.load_state_dict(torch.load(Config.RESNET_PRETRAINED_PATH))
            for param in self.model.parameters():
                param.requires_grad = False
        self.model.fc = nn.Linear(512, 10)
        self.model.fc.requires_grad_(True)

    def forward(self, x):
        x = self.model(x)
        return x