What do you think?
Rate this book
527 pages, Hardcover
Published December 5, 2023
no-one really understands deep learning at the time of writing... Modern deep networks learn piecewise linear functions with more regions than there are atoms in the universe and can be trained with fewer data examples than model parameters. It is neither obvious that we should be able to fit these functions reliably nor that they should generalize well to new data... It is probably hard to imagine equations with these properties, and the reader should endeavor to suspend disbelief for now.
These theoretical results are intriguing but usually make unrealistic assumptions about the network structure... Overparameterization seems to be important, but theory cannot yet explain empirical fitting performance
However, sharpness is not a good criterion to predict generalization between datasets; when the labels in the CIFAR dataset are randomized (making generalization impossible), there is no commensurate decrease in the flatness of the minimum.
Current evidence suggests that overparameterization is needed for generalization — at least for the size and complexity of datasets that are currently used. There are no demonstrations of state-of-the-art performance on complex datasets where there are significantly fewer parameters than training examples. Attempts to reduce model size by pruning or distilling trained networks have not changed this picture.
Moreover, recent theory shows that there is a trade-off between the model’s Lipschitz constant and overparameterization; Bubeck & Sellke (2021) proved that in D dimensions, smooth interpolation requires D times more parameters than mere interpolation. They argue that current models for large datasets (e.g., ImageNet) aren’t overparameterized enough; increasing model capacity further may be key to improving performance...
there have been efforts to use shallower networks. Zagoruyko & Komodakis (2016) constructed shallower but wider residual neural networks and achieved similar performance to ResNet. More recently, Goyal et al. (2021) constructed a network that used parallel convolutional channels and achieved performance similar to deeper networks with only 12 layers... Nonetheless, the balance of evidence suggests that depth is critical; even the shallowest networks with good image classification performance require >10 layers. However, there is no definitive explanation for why. Three possible explanations are that (i) deep networks can represent more complex functions than shallow ones, (ii) deep networks are easier to train, and (iii) deep networks impose better inductive biases
We do not currently have any prescriptive theory that will allow us to predict the circumstances in which training and generalization will succeed or fail. We do not know the limits of learning in deep networks or whether much more efficient models are possible. We do not know if there are parameters that would generalize better within the same model.