-
Notifications
You must be signed in to change notification settings - Fork 19
week 24.10 30.10
Stanford course:
INPUT -> [[CONV -> RELU]*N -> POOL?]*M -> [FC -> RELU]*K -> FC
where the * indicates repetition, and the POOL? indicates an optional pooling layer. Moreover, N >= 0 (and usually N <= 3), M >= 0, K >= 0 (and usually K < 3). For example, here are some common ConvNet architectures you may see that follow this pattern:
INPUT -> FC, implements a linear classifier. Here N = M = K = 0.
INPUT -> CONV -> RELU -> FC
INPUT -> [CONV -> RELU -> POOL]*2 -> FC -> RELU -> FC. Here we see that there is a single CONV layer between every POOL layer.
INPUT -> [CONV -> RELU -> CONV -> RELU -> POOL]*3 -> [FC -> RELU]*2 -> FC Here we see two CONV layers stacked before every POOL layer. This is generally a good idea for larger and deeper networks, because multiple stacked CONV layers can develop more complex features of the input volume before the destructive pooling operation.
stacking CONV layers with tiny filters as opposed to having one CONV layer with big filters allows us to express more powerful features of the input, and with fewer parameters. As a practical disadvantage, we might need more memory to hold all the intermediate CONV layer results if we plan to do backpropagation.
-
The input layer (that contains the image) should be divisible by 2 many times. Common numbers include 32 (e.g. CIFAR-10), 64, 96 (e.g. STL-10), or 224 (e.g. common ImageNet ConvNets), 384, and 512.
-
The conv layers should:
- use small filters (e.g. 3x3 or at most 5x5)
- use a stride of S=1
- padding the input volume with zeros in such way that the conv layer does not alter the spatial dimensions of the input. That is, when F=3, then using P=1 will retain the original size of the input. When F=5, P=2. For a general F, it can be seen that P=(F−1)/2 preserves the input size.
example: CONV layers that perform 3x3 convolutions with stride 1 and pad 1, and of POOL layers that perform 2x2 max pooling with stride 2 (and no padding).
memory: L*W*D
weights: (convSize * convSize * prevLayerDepth) * currentLayerDepth
INPUT: [224x224x3] memory: 2242243=150K weights: 0
CONV3-64: [224x224x64] memory: 22422464=3.2M weights: (333)64 = 1,728
CONV3-64: [224x224x64] memory: 22422464=3.2M weights: (3364)64 = 36,864
POOL2: [112x112x64] memory: 11211264=800K weights: 0
CONV3-128: [112x112x128] memory: 112112128=1.6M weights: (3364)*128 = 73,728