Meta-learn warp-layers that modify the loss surface to be smoother
Novelty: non-linear warp layers avoid dependence on the trajectory or the initialization, depending only on the task parameter updates at the current position of the search space
Meta-Learning
Learning some parameters of the optimizer such as initialization
Optimize for the final model accuracy as a function of the initialization
When using gradient descent, the objective is differentiable
Backpropagate through all training steps back into the initialization
Vanishing / exploding gradients when the loss surface is very flat or steep
Costly—usually scales to a handful of training steps only
Policy gradient methods: optimize policy parameters to maximize the expected reward
Variance reduction using baseline—separate the quality of the action from the quality of the state
A canonical choice of baseline function is the value function
Surrogate objective is a simplification of reward maximization used by modern policy gradient algorithms (policy is divided by the old policy, Schulman et al. 2015)
Measure of gradient variance: mean pairwise correlation (similarity) between gradient samples
Visualization of optimization landscapes with different number of samples per estimate
Finds a representation such that $p(h)$ factorizes as $\prod p(h_d)$, where $h_d$ are independent latent variables (arguably a “good” representation)
Prior distribution $p(h_d)$ is Gaussian or logistic
Training: maximize the likelihood of the data using a “change of variables” formula
$f(x)$ has to be invertible—achieved by splitting $x$ into $x_1$ and $x_2$ and using a transformation that transforms them into $x_1$ and $x_2 + m(x_1)$ respectively
Sampling images: sample from $p(h)$ and use the inverse of $f(x)$
Fixed-point representation for the weights and activations, with a different bitwidth for each layer
A quantizer DNN is learned using gradient-based methods
Which parameters (bitwidth, step size, minimum value, maximum value) to use for parameterization of uniform and power-of-two quantizations?
The gradients with regard to the quantizer parameters are bounded and decoupled when choosing step size and maximum value, or minimum value and maximum value
How to learn the parameters?
A penalty term is added to the loss to enforce size constraints for the weights and activations
Many language tasks can be formulated as maximizing an objective function that is a lower bound on mutual information between different parts of the text sequence
BERT (masked LM): word and corrupted word context, or sentence and following sentence
Skip-gram (word2vec): word and word context
InfoWorld (proposed by the authors): sentence and n-gram, both encoded using Transformer
Mutual information: the amount the uncertainty about $X$ is reduced by knowing the value of $Z$
Maximizing mutual information directly is infeasible
The mutual information between $X$ and $Z$ can be expressed as the KL divergence between the joint probability distribution and the product of the marginal distributions: $I(X;Z) = D_{KL}(P_{XZ} \parallel P_X P_Z)$
$E_P[f(x)] - \log E_Q[e^{f(x)}]$, where $f(x)$ is any real-valued function for which the expectations are finite, is always less than or equal to the KL divergence between $P$ and $Q$
Donsker-Varadhan representation for KL divergence: supremum of this lower bound is equal to the KL divergence
In theory, any function can be represented with a neural network ⇒ we can train a neural network $f(x)$ to maximize the lower bound