Replies: 2 comments
-
I have the same question as you asked. In Transfer Learning we need to pass the image size as the model was trained on , but even after passing other image size i can able to get good results . Then what is the advantage of training a model with the specific size . |
Beta Was this translation helpful? Give feedback.
-
@ghnreigns With the exception of vision transformer, all models here are fully convolutional with the exception of the final fully connected classifier layer. They also all have an adaptive average pooling layer (as you noticed). The adaptive pooling layer pools the final feature map, regardless of its size, down to 1x1. The combo of fully conv + adaptive global pool before the classifier allows a consistent sized input to the classifier layer. The size of the final feature map is related to the size of the input image. Each network has a certain number of strided layers (strided convs or pooling layers) that determines the reduction in feature map size from original input image size. All models here (with exception of a few of the SelecSLS variants that are 64) are stride 32 meaning there is approx 1/32 reduction (5 strided layers across the network) from input size (there is some variation in the specifics due to different padding used). The final feature map will be a tensor of size (batch_size, num_features, FH, FW) where FH and FW are feature map height and width. FH and FW will be <= 1/32 * input W, H. After the avg pool, the resulting tensor will be (batch_size, num_features, 1, 1) then flattened if the FC is Linear, otherwise left as 1x1 if it's a 1x1 Conv layer. The final pool'd features are an average at across the spatial dims (FH, FW) of the final feature map for each feature channel. There is also another hiccup you may come across in Tensorflow or Keras wrt to 'SAME' padding. The implementation of TF SAME padding requires knowledge of the input shape at each layer where padding is applied. Depending on how that is done, pre calculated for a specific size or on the fly for each different input size encountered (with some caching), some knoweldge of the spatial dim size is needed for each layer. By default, PyTorch padding is a simple symmetric application of padding that does not depend on input size, one usually pre-calculates the padding that approximately keeps the input == output size (w/ application of stride) based on the kernel size, stride, dilation values. I've hacked around that to support TF style SAME padding dynamically, with a small perf hit. I only use that for compatibility with Tensorflow/Keras trained weights as I don't see it as being desirable otherwise. You can definitely increase or decrease the resolution of the input and still leverage pretrained weights, but the further you get from the training resolution the less 'useful' those pretrained weights will be. Maybe only some of the layers near the stem of the model will be useful and the rest may (or may not) be better than random init. The amount of augmentation or size (assuming more variation) of original dataset for the original pretrained weights also impact how good they are when you use the model directly at a diff size or for finetuning at different sizes. |
Beta Was this translation helpful? Give feedback.
-
I am a beginner in deep learning, with some foundation in mathematics, therefore, anything that does not make sense to me requires my immediate attention to try and resolve it. Here is one:
I started off with Keras, so I vividly remember the documentation once said, if you are using a certain VGG, you need to have the
input_shape
to be exactly the same as the resolution that VGG has trained on previously (224x224x3). I later read the source code and understood that it may not be the case if you setinclude_top=False
.Now I am a PyTorch lover, and especially a lover of Roff's geffnet/timm/effdet modules. But since his packages are so popular on Kaggle, we always have a lot of notebooks (including mine 👍 ) that uses his packages when we do transfer learning. What I do not understand is there seemingly no repercussions when we resize the image to sizes such as 128, 256, 384, 512, 784, etc. These images are NOT the native shape of the pretrained model's input shape.
Why do the images with varying sizes work well even not trained on the model's native resolution. For example, I used EfficientnetB2 (native resolution 260x260) to train on images of sizes 512x512 and have no problem getting to the top of the leaderboard. This is just an example.
So you may ask, what is your confusion? Well, according to my understanding, all convolutional layers are size invariant. Therefore the CNN layers that the pretrained model has do not pose any problem to our input size. But what about the dense layers, input shape size is not invariant under this.
I do not know how the code handles this such that it can accept an arbitrary input size, there must be some magic inside the source code, or it could be something simple that my incompetent brain cannot handle, and I did look through the source code, I suspect that
self.global_pool = nn.AdaptiveAvgPool2d(1)
may have aided this, but not really sure how. If anyone can help me with this, it will be a great favor, and I can continue documenting more knowledge and giving back to society in the form of articles on Kaggle.Beta Was this translation helpful? Give feedback.
All reactions