This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Andrey Zhmoginov, Google Research & {azhmogin,sandler,mxv}@google.com;
(2) Mark Sandler, Google Research & {azhmogin,sandler,mxv}@google.com;
(3) Max Vladymyrov, Google Research & {azhmogin,sandler,mxv}@google.com.
Table of Links
- Abstract and Introduction
- Problem Setup and Related Work
- HyperTransformer
- Experiments
- Conclusion and References
- A Example of a Self-Attention Mechanism For Supervised Learning
- B Model Parameters
- C Additional Supervised Experiments
- D Dependence On Parameters and Ablation Studies
- E Attention Maps of Learned Transformer Models
- F Visualization of The Generated CNN Weights
- G Additional Tables and Figures
F VISUALIZATION OF THE GENERATED CNN WEIGHTS.
Figures 9 and 10 show the examples of the CNN kernels that are generated by a single-head, 1- layer transformer for a simple 2-layer CNN model with 9 × 9 stride-4 kernels. Different figures correspond to different approaches to re-assembling the weights from the generated slices: using “output” allocation or “spatial” allocation (see Section 3.1 in the main text for more information). Notice that “spatial” weight allocation produces more homogeneous kernels for the first layer when compared to the “output” allocation. In both figures we show the difference of the final generated kernels for 3 variants: model with both layers generated, one generated and one trained and both trained.
Trained layers are always fixed for the inference for all the episodes, but the generated layers vary, albeit not significantly. In Figures 11 and 12 we show the generated kernels for two different episodes and, on the right, the difference between them. It appears that the generated convolutional kernel change withing 10 − 15% form episode to episode.
