You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To summarize:
• widening consistently improves performance across residual networks of different
depth;
• increasing both depth and width helps until the number of parameters becomes too
high and stronger regularization is needed;
Actually, the first conclusion contradicts the second in a way, but that's not the point.
You may really want to try running tests on MNIST with absolute 0 amount of augmentation as a test on which it's actually easy to overfit. Lesson learnt from MNIST for me was that it's more like there's an optimal width & height which is actually rather low (for such a task). And also that the standart blocks/activation scheduling (standard "preact") may not always be optimal and that groups (like in https://arxiv.org/pdf/1605.06489v1.pdf ) are hugely beneficial at least up to some amount of them.
I was able to achieve .25% peak error rate pretty easily and my best arch was pulling out same peak, but also .26% error through lots of epochs, which was rather hard to get here considering that this is a very high precision already, so the across-epoch fluctuations are relatively high. This was without any parameter smoothing, like moving average.
The text was updated successfully, but these errors were encountered:
Just took another look at https://arxiv.org/pdf/1605.07146v1.pdf
Actually, the first conclusion contradicts the second in a way, but that's not the point.
You may really want to try running tests on MNIST with absolute 0 amount of augmentation as a test on which it's actually easy to overfit. Lesson learnt from MNIST for me was that it's more like there's an optimal width & height which is actually rather low (for such a task). And also that the standart blocks/activation scheduling (standard "preact") may not always be optimal and that groups (like in https://arxiv.org/pdf/1605.06489v1.pdf ) are hugely beneficial at least up to some amount of them.
I was able to achieve .25% peak error rate pretty easily and my best arch was pulling out same peak, but also .26% error through lots of epochs, which was rather hard to get here considering that this is a very high precision already, so the across-epoch fluctuations are relatively high. This was without any parameter smoothing, like moving average.
The text was updated successfully, but these errors were encountered: