Jekyll2021-12-03T10:47:56+01:00https://senarvi.github.io/feed.xmlMostly Machine LearningStuff that I work on.Seppo Enarviseppo2021@marjaniemi.comKalman filter equations and extended Kalman filter2021-12-02T00:00:00+01:002021-12-02T00:00:00+01:00https://senarvi.github.io/kalman-filter<h2 id="kalman-filter">Kalman filter</h2> <p>Kalman filter estimates the state of some quantities by combining two kinds of information: a dynamic model that describes our idea of how the world behaves, and measurements that provide noisy estimates of the state. The principle is to iterate the following steps:</p> <ul> <li>Take measurement <span>$z[n]$</span>.</li> <li>Estimate the current state <span>$x[n,n]$</span> based on the measurement and the values predicted by the dynamic model.</li> <li>Predict the next state <span>$x[n+1,n]$</span> based on the dynamic model equations.</li> </ul> <h3 id="measurements">Measurements</h3> <p>Each measurement <span>$z[n]$</span> has some uncertainty. Kalman filter assumes that the measurement error is distributed normally (Gaussian distribution). We need to know the error variance <span>$r$</span> from the specifications of the sensor or through calibration. Usually we measure multiple quantities, and it’s convenient to store them in a vector. Then the measurement error is specified as a covariance matrix <span>$R$</span>.</p> <p>In some applications we want to first transform the measurements into appropriate outputs. For example, we may want to combine the measurements from multiple sensors that measure the same physical quantity. In the standard Kalman filter, each output is a linear combination of the measurements, meaning that the transform can be represented as a matrix <span>$H$</span> that we call the observation matrix.</p> <h3 id="state-update-equation">State update equation</h3> <p>State update equations describe how to combine the measurements and the estimate that is obtained using our dynamic model to producee a better state estimate.</p> <p>The current state could be estimated as the average of all the previous measurements, but we don’t want to store all the previous measurements. It’s possible to derive the following recursive equation:</p> <div>$$x[n,n] = \frac{1}{n} \sum z = x[n,n−1] + \frac{1}{n} (z[n] − x[n,n−1])$$</div> <p>This is a weighted sum of the measurement and <span>$x[n,n−1]$</span>, which is our prediction of what the measurement would be. Here the measurement is weighted by <span>$1/n$</span>. We make this value a time-dependent variable and call it the Kalman gain <span>$K[n]$</span>. The state update equation interpolates between the predicted value and the measurement.</p> <div>$$x[n,n] = x[n,n−1] + K[n] (z[n] – H x[n,n−1])$$</div> <p>Kalman gain will be computed based on uncertainty in our measurements. A low value will smooth the uncertainty. A high value gives more weight to the measurement.</p> <h3 id="state-extrapolation-equation">State extrapolation equation</h3> <p>The state extrapolation equation predicts the next state according to our dynamic model. For example, a one-dimensional accelerating movement can be modeled using three variables:</p> <div>\begin{align} x[n+1,n] &amp;= x[n,n] + \Delta t x_v[n,n] + \frac{\Delta t^2}{2} x_a[n,n] \\ x_v[n+1,n] &amp;= x_v[n,n] + \Delta t x_a[n,n] \\ x_a[n+1,n] &amp;= x_a[n,n] \end{align}</div> <p>It’s convenient to place the variables in a vector. In the standard Kalman filter, the dynamic model has to be linear, meaning that it’s possible to represent the state extrapolation equation as matrix multiplication. One-dimensional accelerating movement can be described by the following state transition matrix:</p> <div>$$F = \begin{bmatrix} 1 &amp; \Delta t &amp; \frac{\Delta t^2}{2} \\ 0 &amp; 1 &amp; \Delta t \\ 0 &amp; 0 &amp; 1 \end{bmatrix}$$</div> <p>We’ll make the state extrapolation equation more generic by including control input <span>$u[n]$</span>. This vector may contain other information that we don’t predict using the dynamic model, for example steering of the vehicle.</p> <div>$$x[n+1,n] = F x[n,n] + B u[n]$$</div> <h3 id="kalman-gain">Kalman gain</h3> <p>Our predictions have uncertainty too, since our dynamic model is just an approximation. Kalman Filter provides an estimate for the prediction error, <span>$p$</span> (or generally, the covariance matrix <span>$P$</span>), which depends on the time instance and is updated using the covariance equations. It can be initialized to the variance of the initial estimate of <span>$x$</span>.</p> <p>The current Kalman gain is large (more weight to the measurement), when the prediction error is large, or the measurement error is small.</p> <div>$$K[n] = p[n,n-1] / (p[n,n-1] + r)$$</div> <p>In matrix form, when including the observation matrix, the equation becomes more complex:</p> <div>$$K[n] = P[n,n-1] H^T (H P[n,n-1] H^T + R)^{-1}$$</div> <h3 id="covariance-extrapolation-equation">Covariance extrapolation equation</h3> <p>In the same way that we extrapolate the state using the dynamic model, we extrapolate the prediction error using our uncertainty in the dynamic model. This uncertainty can be described, for example, using process noise variance <span>$q$</span>:</p> <div>$$p[n+1,n] = p[n,n] + q$$</div> <p>Set the process noise variance to a low value, e.g. 0.0001, if the model is accurate. If the model is not accurate (for example, modeling an accelerating motion with a constant velocity), a too low value will cause lag error. With multiple variables, we describe the process noise as its covariance matrix <span>$Q$</span>.</p> <div>$$P[n+1,n] = F P[n,n] F^T + Q$$</div> <h3 id="covariance-update-equation">Covariance update equation</h3> <p>The prediction uncertainty is updated based on the Kalman gain. The larger the Kalman gain, the smaller we’re going to make our next estimate.</p> <div>$$p[n,n] = (1 − K[n]) p[n,n−1]$$</div> <p>The matrix form considers the observation matrix.</p> <div>$$P[n,n] = (I – K[n] H) P[n,n-1]$$</div> <h2 id="extended-kalman-filter">Extended Kalman filter</h2> <p>Extended Kalman filter is an extension of this concept for nonlinear dynamic model and output functions.</p> <ul> <li>Instead of transforming the state using the matrix <span>$F$</span>, use a possibly non-linear state transition function: <span>$x[n+1,n] = f(x[n,n], u[n])$</span></li> <li>Instead of transforming the measurements using the matrix <span>$H$</span>, use a possibly non-linear output function: <span>$z = h(x)$</span></li> </ul> <p>The idea is to linearize the dynamic model and output function around each estimate. This means numerically finding the partial derivatives of the functions at the estimate, and using the Kalman filter equations as if the functions were linear.</p> <ul> <li>Replace <span>$H$</span> in the covariance update and Kalman gain equations with the first derivative, or the Jacobian, of <span>$h$</span>.</li> <li>Replace <span>$F$</span> in the covariance extrapolation equation with the first derivative, or the Jacobian, of <span>$f$</span>.</li> </ul>Seppo Enarviseppo2021@marjaniemi.comKalman filterUnderstanding convergence of SGD2019-11-15T00:00:00+01:002019-11-15T00:00:00+01:00https://senarvi.github.io/understanding-convergence-of-sgd<h2 id="batch-size-learning-rate-weight-averaging-and-solutions-that-generalize-better">Batch size, learning rate, weight averaging, and solutions that generalize better</h2> <h3 id="references">References</h3> <p>This article discusses several papers that I recently found, that analyze stochastic gradient descent optimization, make interesting observations about its convergence, and help understanding the significance of batch size and learning rate:</p> <ul> <li><a href="https://arxiv.org/abs/1812.06162">McCandlish et al.</a> 2018. An Empirical Model of Large-Batch Training.</li> <li><a href="https://arxiv.org/abs/1609.04836">Keskar et al.</a> 2017. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima.</li> <li><a href="https://arxiv.org/abs/1710.06451">Smith and Le</a> 2018. A Bayesian Perspective on Generalization and Stochastic Gradient Descent.</li> <li><a href="https://arxiv.org/abs/1711.00489">Smith et al.</a> 2018. Don’t Decay the Learning Rate, Increase the Batch Size.</li> <li><a href="https://arxiv.org/abs/1803.05407">Izmailov et al.</a> 2018. Averaging Weights Leads to Wider Optima and Better Generalization.</li> </ul> <h3 id="training-batch-size">Training batch size</h3> <p>Stochastic gradient descent (SGD) is basically gradient descent using noisy gradients of the training loss. The batch size determines the quality of the gradient estimates. The gradient being calculated as an average over the examples in a mini-batch, its variances scales inversely with the batch size. The more examples that are being used to estimate the gradient, the more accurate the estimates are.</p> <p>Given that the different examples in a mini-batch can be processed in parallel in a GPU, it seems to be a good idea to use as large batch size as possible. The assumption is that this makes the training converge faster. Also the Transformer model especially requires large enough batch size to converge at all. In theory we could always increase the batch size by using more GPUs. However, when the gradient estimates get closer to the true gradient, at some point increasing the batch size will barely improve the model update anymore. The increased communication between the GPUs would also reduce the possible gains in terms of training speed.</p> <p>How should we then choose the batch size in order to train as efficiently as possible? The next section tries to quantify the noise introduced to gradient estimates by using a specific batch size, and its impact on the progress that a training step can make.</p> <h3 id="optimal-learning-rate-and-saturation-of-gradient-estimates">Optimal learning rate and saturation of gradient estimates</h3> <p>At each step SGD moves the model parameters <span>$\theta$</span> towards the negative gradient an amount specified by the learning rate or step size <span>$\epsilon$</span>. Optimal step size would of course be one that minimizes the new loss <span>$L(\theta - \epsilon G)$</span>. <a href="https://arxiv.org/abs/1812.06162">McCandlish et al.</a> show how we could estimate the optimal step size if we had access to the true gradient <span>$G$</span> and the true Hessian <span>$H$</span>. They approximate the new loss using the second-order Taylor expansion. The second-order Taylor expansion of function <span>$f(x)$</span> with a real argument can be written</p> <div>$$f(a + x) \approx f(a) + x f'(a) + \frac{1}{2} x^2 f''(a).$$</div> <p>In the same way we can approximate the new loss. We’ll just write it in matrix form:</p> <div>$$L(\theta - \epsilon G) \approx L(\theta) - \epsilon G^T G + \frac{1}{2} \epsilon^2 G^T H G$$</div> <p>This function can be minimized by setting the derivative to zero. This gives the optimal step size for the true gradient:</p> <div>$$\epsilon_{max} = \frac{\left|G\right|^2}{G^T H G}.$$</div> <p>Using this to estimate the learning rate at each step would be very costly, since it would require the computation of the Hessian matrix. In fact, this starts to look a lot like second-order optimization, which is not used in deep learning applications because the computation of the Hessian is too expensive. However, now we’re just trying to understand what happens to the optimal learning rate and batch size under noise.</p> <p>When using SGD with batch size <span>$B$</span>, the gradient estimates are noisy and we need to consider the expectation of <span>$L(\theta - \epsilon G)$</span>. <a href="https://arxiv.org/abs/1812.06162">McCandlish et al.</a> also derive the optimal step size in this case, which can be expressed:</p> <div>$$\epsilon_{opt}(B) = \frac{\epsilon_{max}}{1 + B_{noise} / B}$$</div> <p>They call the quantity <span>$B_{noise}$</span> the <em>gradient noise scale</em>. The noise scale depends on the true gradient and Hessian, and the variance of the gradient.</p> <p>Some observations they make about the noise scale:</p> <ul> <li>One would expect it to be larger for difficult tasks, where examples are less correlated.</li> <li>It doesn’t depend on the size of the data set. Model size shouldn’t have much effect either.</li> <li>During training the magnitude of the gradient decreases, so the noise scale grows.</li> </ul> <p>There is a similar relation between the best possible improvement in loss with true gradient, <span>$\Delta L_{max}$</span>, and the best possible improvement under noise, <span>$\Delta L_{opt}(B)$</span>:</p> <div>$$\Delta L_{opt}(B) = \frac{\Delta L_{max}}{1 + B_{noise} / B}$$</div> <p>Some interesting points that are made about selecting batch size and learning rate:</p> <ul> <li>As larger batch sizes give better gradient estimates, a larger learning rate can be used.</li> <li>There’s a good chance the training will diverge when the step size is larger than twice <span>$\epsilon_{opt}$</span>.</li> <li>When the batch size is a lot smaller than <span>$B_{noise}$</span>, increasing batch size linearly increases the progress in loss.</li> <li>When the batch size is a lot larger than <span>$B_{noise}$</span>, increasing batch size has hardly any effect on the progress.</li> </ul> <h3 id="flat-minima-generalize-better">Flat minima generalize better</h3> <p>Intuitively it’s easy to understand that larger batches lead to faster convergence, but after certain point growing the batch size doesn’t really help anymore. However, this doesn’t explain the well known fact that a too large batch size can even hurt the model performance. <a href="https://arxiv.org/abs/1609.04836">Keskar et al.</a> observed that large batches yield a similar training loss, but generalize worse than small-batch training. They found two reasons for the worse generalization performance:</p> <ul> <li>Large-batch training tends to converge to a minimum close to the initial parameter values, instead of exploring all the parameter space.</li> <li>Large-batch training tends to converge to sharper minima, while small-batch training converges to flatter minima.</li> </ul> <p>They illustrate the generalization capability of flat and sharp minima using a loss function <span>$f(x)$</span> with only a single parameter <span>$x$</span>:</p> <figure> <img src="/assets/images/flat-vs-sharp-minimum.png" /> <figcaption> Generalization capability of a flat and a sharp minimum. <a href="https://arxiv.org/abs/1609.04836">Keskar et al.</a> </figcaption> </figure> <p>Both minima reach the same loss value, but the flat minimum is less sensitive to perturbations in the parameter space. They provide experimental evidence that large-batch training converges more likely to sharp minima and minima close to the starting point. They argue that the inherent noise in small-batch training helps to push the parameters out of a sharp basin.</p> <h3 id="useful-random-fluctuations">Useful random fluctuations</h3> <p><a href="https://arxiv.org/abs/1812.06162">McCandlish et al.</a> looked at how noise degrades gradient estimates and found a point of diminished returns for batch size. In Appendix C they make an empirical observation that the noise scale primarily depends on the learning rate and batch size, and under some assumptions is approximately proportional to <span>$B / \epsilon$</span>. Their definition of the noise scale is independent of the training set size.</p> <p><a href="https://arxiv.org/abs/1710.06451">Smith and Le</a> observed that some amount of noise is helpful, so there is an optimal value for batch size, when other hyperparameters are kept fixed. They try to assess the noise level in SGD by interpreting it as integration of a stochastic differential equation. With true gradient <span>$d C / d \omega$</span> the gradient descent update can be written</p> <div>$$\Delta \omega = - \frac{\epsilon}{N} \frac{d C}{d \omega},$$</div> <p>where normalization by the training set size <span>$N$</span> comes from the fact that we define the cost as the sum of the costs from individual training examples, but in practice we want to take the average. When estimating the gradient from a mini-batch, an error term must be added. They write this as a stochastic differential equation</p> <div>$$\frac{d \omega}{d t} = - \frac{d C}{d \omega} + \eta(t),$$</div> <p>where <span>$\eta(t)$</span> represents noise. Integrating this equation over the span of <span>$\epsilon / N$</span> represents one SGD update.</p> <p>They calculate the variance in the update in two ways and equate them: from the discrete equation, assuming Gaussian gradient noise, and from the differential equation, by integrating <span>$\eta(t)$</span>. This gives a formula for a “scaling factor” in the variance that they call the noise scale, <span>$g \approx \epsilon N / B$</span>. Assuming that there is an optimal scale of random fluctuations, <span>$g$</span> should be kept fixed. This implies that the optimal batch size is proportional to <span>$\epsilon N$</span>. Similarly, when increasing batch size, learning rate should be increased proportionally.</p> <p>So essentially a small batch size and a high learning rate serve the same purpose—increase the fluctuations that are helpful for learning. In this sense decaying learning rate during training is very similar to simulated annealing. A larger learning rate will explore a larger area of the parameter space, while decaying it will allow training to converge to a minimum. In another paper <a href="https://arxiv.org/abs/1711.00489">Smith et al.</a> suggest to increase the batch size instead of annealing learning rate, which makes sense if there’s more GPU memory available than what the optimal batch size can initially utilize.</p> <h3 id="averaging-model-parameters">Averaging model parameters</h3> <p>A very interesting approach for finding flat minima was recently proposed by <a href="https://arxiv.org/abs/1803.05407">Izmailov et al.</a>. Instead of continuously decaying the learning rate, it’s possible to find several different models by using a cyclical learning rate schedule. This means simply that the learning rate is repeatedly decayed to zero, and then raised again to a higher value. The model parameters are saved after each decay cycle. They observed that the model parameters traverse around the minimum, but never quite reaching the optimal point.</p> <p>This suggests that an improved model can be obtained by taking the averages of the values of each parameter in the intermediate models. The figure below shows three intermediate models (<span>$W_1, W_2, W_3$</span>) and an average model (<span>$W_{SWA}$</span>) in the parameter space.</p> <figure> <img src="/assets/images/stochastic-weight-averaging.png" /> <figcaption> Three models obtained by training using cyclical learning rate, and an average model. <a href="https://arxiv.org/abs/1803.05407">Izmailov et al.</a> </figcaption> </figure> <p><a href="https://arxiv.org/abs/1803.05407">Izmailov et al.</a> call this method stochastic weight averaging (SWA), and observe that the solutions found by SWA are broader in the sense that even if the training loss might be slightly higher, the model is not as sensitive to perturbations of the parameters. This improves the generalization of the model, with no extra cost, except for storing the intermediate models.</p>Seppo Enarviseppo2021@marjaniemi.comBatch size, learning rate, weight averaging, and solutions that generalize betterREINFORCE2019-04-06T00:00:00+02:002019-04-06T00:00:00+02:00https://senarvi.github.io/reinforce<h2 id="improving-sequence-to-sequence-models-with-reinforcement-learning">Improving sequence-to-sequence models with reinforcement learning</h2> <h3 id="learning-to-predict-word-sequences">Learning to predict word sequences</h3> <p>Language models and sequence-to-sequence models that generate text typically have an output layer that produces a logit for each word in the vocabulary. The logits are normalized using softmax, which gives a probability distribution over the vocabulary. The model is optimized by minimizing cross entropy, which measures how well our model distribution <span>$p_{\theta}(w_t \mid w_1 \ldots w_{t-1})$</span> fits the empirical distribution in the training data:</p> <div>$$H(w_1 \ldots w_T,p_{\theta}) = -\frac{1}{T} \sum_t \log(p_{\theta}(w_t \mid w_1 \ldots w_{t-1}))$$</div> <p>Usually in language modeling and sequence generation tasks, this objective is used during training, with <span>$w_1 \ldots w_{t-1}$</span> representing the ground-truth output sequence. It is fast to compute and works well for language modeling, where we have a huge corpus of sentences from the output distribution. It is not that good measure of the model performance in most sequence-to-sequence tasks, however, where there can be lots of different outputs that are correct for given input, but we observe only one example in the training data. For the same reason, machine translation models are not evaluated by the probability they give to the reference sequence. Instead, usually metrics such as BLEU and ROUGE are used, that compare n-gram statistics of the most likely word sequence generated by the model to those of the reference sequence. Clearly using cross entropy for training is less than optimal, when we evaluate the model using another metric at test time.</p> <p>There’s also another problem in using cross entropy for training models that are intended for generating word sequences. During inference the model generates a sequence from the model distribution, which encompasses all possible word sequences. But during training the reference sequence (offset by one word) is fed into the model, and the model computes just the next word probabilities. The reference sequence deviates at each time step more and more from what the model would generate. This second problem was named <strong>exposure bias</strong> by <a href="http://arxiv.org/abs/1511.06732">Ranzato et al</a>.</p> <h3 id="formulation-as-a-decision-making-problem">Formulation as a decision making problem</h3> <p>Metrics such as BLEU and ROUGE are not differentiable, so we cannot just compute one of them on generated word sequences and use that as the training objective. It is possible, however, to approach a sequence-to-sequence task using reinforcement learning, using the metric to reward the network based on sequences it would generate.</p> <p>The idea is to formulate the problem as a decision making problem in the following way. An <strong>agent</strong> observes the <strong>state</strong> of the environment, which includes the word sequences and other input features. Based on the current state, the agent repeatedly takes an <strong>action</strong> generating the next word in the output sequence. The model is seen as a <strong>policy</strong> <span>$p_\theta$</span>, which dictates the next action.</p> <p>The REINFORCE method is episodic. One episode ends when the agent generates the end-of-sequence token at time <span>$T$</span>. Generally speaking, the agent receives a <strong>reward</strong> <span>$r_t$</span> after performing an action at time <span>$t$</span>. The <strong>return</strong>, or <strong>cumulative reward</strong>, from time <span>$t$</span> onwards, is the sum of the rewards:</p> <div>$$G_t = \sum_{i=t}^T r_i$$</div> <p>The <strong>value</strong> of a state is the expected cumulative reward by following policy <span>$p_\theta$</span>. Usually, when the task is to generate word sequences, can only observe the cumulative reward <span>$G_1 = R(W)$</span>, for example the ROUGE score, after generating the entire sequence <span>$W$</span>.</p> <h3 id="reinforce-objective-and-its-gradient">REINFORCE objective and its gradient</h3> <p>REINFORCE is a policy-gradient method, solving the problem using stochastic gradient descent. This is possible when the parameters of the policy, <span>$\theta$</span>, are continuous. The objective function is the value at the beginning of the sequence:</p> <div>$$J(\theta) = E[G_1] = \sum_W p_{\theta}(W) R(W)$$</div> <p>The summation over word sequences makes direct computation of the objective, as well as the gradient, unfeasible, but they can be approximated by sampling. The objective could be approximated by sampling a sequence and computing the cumulative reward. However, for training a model we actually don’t need to approximate the objective function but its gradient. Stochastic gradient descent only requires that the expectation of the sampled gradients is proportional to the actual gradient (section 13.3 in <a href="http://incompleteideas.net/book/RLbook2018.pdf">Sutton and Barto</a>). Let’s start by writing the gradient as an expectation over word sequences:</p> <div>\begin{align} \nabla_{\theta} J(\theta) &amp;= \sum_W \nabla_{\theta} p_{\theta}(W) R(W) \\ &amp;= \sum_W p_{\theta}(W) \frac{\nabla_{\theta} p_{\theta}(W)}{p_{\theta}(W)} R(W) \\ &amp;= E_W R(W) \nabla_{\theta} \log p_{\theta}(W) \end{align}</div> <p>where we have used <span>$\frac{\nabla x}{x} = \log \nabla x$</span>.</p> <p>This brings us to the REINFORCE algorithm, which is essentially an approximation of the gradient using a single sample <span>$W$</span>:</p> <div>$$\nabla_{\theta} J(\theta) \approx R(W) \nabla_{\theta} \log p_{\theta}(W)$$</div> <p>This quantity can be used as a sample of the gradient, since its expectation is equal to the gradient of the objective function. Implementation is quite easy with a library that supports automatic differentiation. One can simply take the gradient of <span>$R(W) \log p_{\theta}(W)$</span> instead of the gradient of the actual objective.</p> <p>Writing a differentiation operator for backpropagation is not too difficult either. Let’s say the input to the softmax at time <span>$t$</span> is <span>$o_t$</span>. There is a simple expression for the partial derivatives of <a href="https://deepnotes.io/softmax-crossentropy">cross entropy over softmax output</a>, assuming the reference output is a one-hot vector. We use <span>$1(w_t)$</span> to denote a one-hot vector where the value corresponding to the word <span>$w_t$</span> is one and other values are zero. Then the following gives an expression for the gradient with regard to the softmax input:</p> <div>\begin{align} \nabla_{o_t} J(\theta) &amp;\approx R(W) \nabla_{o_t} \log p_{\theta}(W) \\ &amp;= R(W) \nabla_{o_t} \sum_t \log p_{\theta}(w_t \mid w_1 \ldots w_{t-1}) \\ &amp;= R(W) \nabla_{o_t} \log p_{\theta}(w_t \mid w_1 \ldots w_{t-1}) \\ &amp;= R(W) (1(w_t) - p_{\theta}(w_t \mid w_1 \ldots w_{t-1})) \end{align}</div> <h3 id="reinforce-with-baseline">REINFORCE with baseline</h3> <p>While in theory it is enough that the expectation of the gradient sample is proportional to the actual gradient, having the training converge in a reasonable time is a whole another thing. A good estimate of the gradient generally has a low variance (variance measures how spread out the estimates are around the mean), meaning that the parameter updates have a low variance as well. The parameter update in REINFORCE is based on a single random output sequence sampled from the action space. It’s easy to reason that the longer the output sequences are, the less likely it is to obtain a sequence that results in an accurate estimate. Actually, the variance of the gradient estimate grows cubically with the sequence length (section 3 in <a href="https://doi.org/10.1016/j.neunet.2008.02.003">Peters and Schaal</a>).</p> <p>We start by rewriting the loss function, taking into account that the cumulative reward is accumulated from rewards <span>$r_t$</span> from individual time steps:</p> <div>\begin{align} \nabla_{\theta} J(\theta) &amp;= E_W R(W) \nabla_{\theta} \log p_{\theta}(W) \\ &amp;= E_W \sum_t r_t \nabla_{\theta} \log p_{\theta}(W) \\ &amp;= E_W \sum_t r_t \nabla_{\theta} \log p_{\theta}(w_1 \ldots w_t) \\ &amp;= E_W \sum_t G_t \nabla_{\theta} \log p_{\theta}(w_t \mid w_1 \ldots w_{t-1}) \\ &amp;= \sum_t E_{w_t} G_t \nabla_{\theta} \log p_{\theta}(w_t \mid w_1 \ldots w_{t-1}) \end{align}</div> <p><a href="https://arxiv.org/abs/1505.00521">Zaremba and Sutskever</a> show in Appendix A that the third equation above holds because actions cannot influence past rewards. The fourth equation was obtained by reordering the sums:</p> <div>\begin{align} &amp;\sum_{t=1}^T r_t \sum_{i=1}^t \nabla_{\theta} \log p_{\theta}(w_i \mid w_1 \ldots w_{i-1}) \\ = &amp;\sum_{t=1}^T \sum_{i=t}^T r_i \nabla_{\theta} \log p_{\theta}(w_t \mid w_1 \ldots w_{t-1}) \end{align}</div> <p>At certain states all actions have a higher value than in other states. It makes no difference with regard to the gradient, if the value of all actions in a particular state is changed by the same amount. In other words, we can subtract a quantity <span>$b_t$</span> from the reward or cumulative reward of all the possible words <span>$w_t$</span> that follow a certain partial output sequence <span>$w_1 \ldots w_{t-1}$</span>, without changing the gradient:</p> <div>$$\nabla_{\theta} J(\theta) = \sum_t E_{w_t} (G_t - b_t) \nabla_{\theta} \log p_{\theta}(w_t \mid w_1 \ldots w_{t-1})$$</div> <p>The function <span>$b_t$</span> is called a <strong>baseline</strong>. It can be an arbitrary function of the state, as long as it doesn’t depend on the next action (i.e. it is constant with regard to <span>$w_t$</span>). This can be shown formally by taking <span>$b_t$</span> outside of the expectation. It gets then be multiplied by the following term, meaning that the subtracted quantity is zero (<a href="https://arxiv.org/abs/1612.00563">Rennie et al</a>):</p> <div>\begin{align} &amp;E_{w_t} \nabla_{\theta} \log p_{\theta}(w_t \mid w_1 \ldots w_{t-1}) \\ = &amp;\sum_{w_t} p_{\theta}(w_t \mid w_1 \ldots w_{t-1}) \nabla_{\theta} \log p_{\theta}(w_t \mid w_1 \ldots w_{t-1}) \\ = &amp;\sum_{w_t} \nabla_{\theta} p_{\theta}(w_t \mid w_1 \ldots w_{t-1}) \\ = &amp;\nabla_{\theta} \sum_{w_t} p_{\theta}(w_t \mid w_1 \ldots w_{t-1}) \\ = &amp;\nabla_{\theta} 1 \\ = &amp;0 \end{align}</div> <p>where we have used <span>$\nabla \log f(x) = \frac{\nabla f(x)}{f(x)}$</span>.</p> <p>The variance of the gradient estimates can be reduced by using a baseline that is higher for states that generally receive higher rewards. How to come up with such a baseline is not trivial. Some proposed approaches are listed below.</p> <ul> <li><a href="https://arxiv.org/abs/1705.04304">Paulus et al</a>: The baseline is the reward observed for a sequence that is generated by greedy decoding.</li> <li><a href="https://arxiv.org/abs/1612.00563">Rennie et al</a>: After generating the sequence until time step <span>$t$</span> by sampling, the rest of the sequence is generated by greedy decoding. The reward observed for the combined sequence is used as the baseline at time step <span>$t$</span>.</li> <li><a href="https://arxiv.org/abs/1805.09461">Keneshloo et al</a>: More than one sampled sequence is used for estimating the gradient. The average reward of the sampled sequence is used as the baseline.</li> <li><a href="https://arxiv.org/abs/1505.00521">Zaremba and Sutskever</a>: An LSTM that runs over the same input as the model is used to predict <span>$G_t$</span> at time step <span>$t$</span>.</li> <li><a href="http://arxiv.org/abs/1511.06732">Ranzato et al</a>: A linear regressor that takes as input the hidden states of the model is used to predict <span>$r_t$</span> at time step <span>$t$</span>.</li> </ul>Seppo Enarviseppo2021@marjaniemi.comImproving sequence-to-sequence models with reinforcement learningTensor2Tensor presentation slides2018-04-20T00:00:00+02:002018-04-20T00:00:00+02:00https://senarvi.github.io/tensor2tensor-slides## Gentle Introduction to Tensor2Tensor * Models * Modalities * Estimators * Datasets * Problems * Metrics --- ## Introduction * Training models is a very complex process and for large part similar in different projects. * Layers, optimizers, checkpointing. * TensorFlow offers many constructs on top of the basic functionality that make modeling easier. * tf.contrib module for volatile or experimental code. * Tensor2Tensor is a toolkit for sequence-to-sequence modeling. * Very flexible: machine translation, end-to-end ASR, image classification, language modeling, abstractive summarization. * Should be easy to get started—can even download the data automatically. * Maintained by Google Brain team. * Frequent contributions. * Bugs and incomplete documentation. --- <!-- .slide: data-transition="fade-in none-out" --> ## Tensor2Tensor Models * Consist of three parts: input modality, model body, and output modality. * Modalities abstract away the type of source and target data (words, audio, images). * Input modality converts input into feature vectors, e.g. word embeddings. * Output modality converts activations into outputs, e.g. word IDs. * Output modality also implements the loss function. * The model body defines the architecture for low-dimensional input/output, i.e. everything between the modalities. --- <!-- .slide: data-transition="none-in fade-out" --> ## Tensor2Tensor Models ![Input, body, and output of an encoder-decoder model](../assets/images/tensorflow-slides/tensor2tensor-modalities.png) <!-- .element: class="plain" --> --- ## Model Body * Base class for the model body is T2TModel. * The toolkit recognizes models that are registered using @registry.register_model decorator (can be selected using --model flag). * Subclasses, e.g. LSTMSeq2SeqAttention, override body(). * model_fn() stacks the body function between given modalities. * During training returns the loss. * During decoding returns the output distribution for the last time step. --- ## Modalities * Base class for modalities is Modality. * Subclasses implement some of bottom(), top(), loss(). * Input modality provides the features, e.g. word embeddings, through bottom() method. * Output modality provides the logits through top() method. * During training loss() is used. --- ## Data Parallelism * T2TModel supports data parallelism. * The data that is moved between the model parts is represented as a list of tensors, one for each shard. * expert_utils.Parallelism is a helper class that calls a function on each element of a list and collects the results into a list. --- ## Estimator * Estimator class is a TensorFlow abstraction that combines a model with instructions how to use it. * Takes care of the training loop, variable initialization, checkpointing / restoring training, saving summaries for TensorBoard. * Three modes: * TRAIN estimates model parameters. * PREDICT predicts the output distribution using a trained model. * EVAL computes the loss and possible other evaluation metrics. * The user provides an input function input_fn, which the estimator uses to read the input data. When called, an input function returns two variables: * features, a mapping from feature name to an array of values, and * labels, an array of labels. --- ## Estimator Example * TensorFlow provides Estimator subclasses that implement some common models. * There are functions for creating input_fn from e.g. NumPy and pandas tables. * feature_columns defines what features to use and how to handle e.g. categorical data. * model_dir specifies a directory where checkpoints are created. * A checkpoint contains everything needed to continue training or evaluate the model. python feature_cols = [tf.feature_column.numeric_column("x", shape=[28, 28])] dnn = tf.estimator.DNNClassifier([32, 64, 32], feature_cols, model_dir, n_classes, dnn.train(input_fn=train_input_fn, steps=2000) accuracy = dnn_classifier.evaluate(input_fn=eval_input_fn)["accuracy"]‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍  --- ## Dataset * tf.data.Dataset class represents a sequence of elements (for example input-output sentence pairs). * The data can be for example in tensors, text files, or binary files. * A new Dataset can be created by applying a transformation to another Dataset. * shuffle(buffer_size) randomizes the order of the elements. * repeat() iterates the dataset repeatedly. * batch(batch_size) always takes the next batch_size elements and creates a tensor with one more dimension. * group_by_window(key_func, reduce_func, window_size) creates batches of window_size elements with matching key. * map(map_func) performs the function map_func on the elements. --- ## Dataset Iterators * tf.data.Iterator is an interface for reading the sequence one element at a time. * Can be created from a Dataset using make_one_shot_iterator(). * Mimics Python iterator interface, which makes it confusing. * get_next() is called only once—it returns a Tensor that represents the next element. * The value of the Tensor changes on each evaluation of the graph. --- ## Input Function Example <pre class="stretch"><code data-trim> images = tf.constant(["train/img1.png", "train/img2.png", "train/img3.png", "train/img4.png", "train/img5.png", "train/img6.png"]) labels = tf.constant([0, 0, 0, 1, 1, 1]) def preprocess(image_path, label): image_data = tf.read_file(image_path) image = tf.image.decode_image(image_data, channels=3) return image, label def train_input_fn(): dataset = tf.data.Dataset.from_tensor_slices((images, labels)) dataset = dataset.map(preprocess) dataset = dataset.shuffle(6) dataset = dataset.repeat() dataset = dataset.batch(2) iter = dataset.make_one_shot_iterator() return iter.get_next()‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍ </code></pre> --- <!-- .slide: data-transition="fade-in none-out" --> ## Bucketing * It would be efficient to have as little padding in mini-batches as possible. * Tensor2Tensor uses group_by_window() to create mini-batches of sequences that are similar in length.  def get_bucket_id(example): seq_length = get_length(example) cond = tf.logical_and( tf.less_equal(buckets_min, seq_length), tf.less(seq_length, buckets_max)) return tf.reduce_min(tf.where(cond)) def create_batch(bucket_id, dataset): padded_shapes = dict( [(name, [None] * len(shape)) for name, shape in dataset.output_shapes.items()]) return dataset.padded_batch(batch_size, padded_shapes) dataset = dataset.apply(group_by_window(get_bucket_id, create_batch, batch_size))  --- <!-- .slide: data-transition="none-in fade-out" --> ## Bucketing ![Illustration of batching using group_by_window() dataset transformation](../assets/images/tensorflow-slides/group-by-window.png) <!-- .element: class="plain" --> --- ## TFRecord * Before training a model with Tensor2Tensor, the dataset is converted into TFRecord format (using t2t-datagen). * The data is already in raw binary format and training will be faster. * Words are converted to integer IDs, meaning that the TFRecord files are created using a specific vocabulary. * The data can be read and passed to an Estimator using TFRecordDataset. --- <!-- .slide: data-transition="fade-in none-out" --> ## Custom Estimators * Tensor2Tensor uses the base Estimator class, which takes model_fn, hyperparameters, and configuration (e.g. model directory) in constructor. * model_fn specifies how to compute model output, given the input and the mode. * Such a function is given by T2TModel as the estimator_model_fn() method. * Returns a slightly different graph depending on the mode. --- <!-- .slide: data-transition="none-in fade-out" --> ## Custom Estimators * model_fn takes the following arguments: * features: feature tensors returned by the input_fn * labels: label tensor returned by the input_fn * mode: TRAIN, PREDICT, or EVAL * params: model hyperparameters * Tensor2Tensor adds hparams and decode_hparams. * model_fn returns an EstimatorSpec that describes the output of the model using the input tensors: * In PREDICT mode that includes the predicted probabilities. * In EVAL mode that includes the loss and possibly a dictionary of operations that are used to compute other evaluation metrics. * In TRAIN mode that includes the loss and train_op that defines the optimization step. --- ## Model Function Example <pre class="stretch"><code data-trim> def model_fn(features, labels, mode, params): hidden_activations = tf.layers.dense(features["inputs"], 256) logits = tf.layers.dense(hidden_activations, params["num_classes"]) predicted_classes = tf.argmax(logits, 1) # For PREDICT, the predicted classes and probabilities are needed. if mode == tf.estimator.ModeKeys.PREDICT: predictions = {"class_ids": predicted_classes[:, tf.newaxis], "probabilities": tf.nn.softmax(logits), "logits": logits} return EstimatorSpec(mode, predictions=predictions) # For TRAIN and EVAL, compute the loss. loss = sparse_softmax_cross_entropy(labels, logits) # For EVAL, compute the evaluation metrics. if mode == tf.estimator.ModeKeys.EVAL: accuracy = tf.metrics.accuracy(labels, predicted_classes) metrics = {"accuracy": accuracy} return EstimatorSpec(mode, loss=loss, eval_metric_ops=metrics) # For TRAIN, return also the optimizer. optimizer = tf.train.AdagradOptimizer(learning_rate=0.1) train_op = optimizer.minimize(loss) return EstimatorSpec(mode, loss=loss, train_op=train_op)‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍ </code></pre> --- ## Estimator – Training * train_and_evaluate() performs training, periodically evaluating the model and possibly executing other *hooks*. python train_spec = tf.estimator.TrainSpec(train_input_fn, max_steps, hooks) eval_spec = tf.estimator.EvalSpec(eval_input_fn) tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)  * Recently changed; Tensor2Tensor uses the deprecated tf.contrib.learn.Experiment class. --- ## Estimator – Hooks * Hooks are attached to a certain point of the execution. * Can be used to initialize, save, and monitor things. * Derived from SessionRunHook, which calls a method when a session is created or run() is called. * Tensor2Tensor implements MetricsBasedHook, which quits training after loss has stopped decreasing. * A dataset can be parameterized using a tf.placeholder() and initialized at the beginning of a session. --- ## Problem * Problem is a Tensor2Tensor concept that defines data generation and the evaluation metrics that are computed during training. * These cannot be changed using hyperparameters, but many other aspects can. * Subclasses implement generate_data() method that generates train and dev sets. * Normally downloads the data from the Internet. * Text2TextProblem typically creates a vocabulary and encodes text. * SubwordTextEncoder encodes text using a subword vocabulary. --- ## Problem – Metrics * The metrics computed at each validation are defined at eval_metrics(). * Can be monitored using TensorBoard. * The metric that is used for early stopping is specified using the --eval_early_stopping_metric flag. * If not set, the loss function is used. * Otherwise has to be one of those returned by eval_metrics(). --- ## Problem – Applying * Problem classes are registered using @registry.register_problem decorator and selected using the --problem command line flag. * The data processing is done by t2t-datagen tool, which creates TFRecord files. * Training is done by t2t-trainer, which operates on the binary files. --- ## Hyperparameters * Hyperparameters are used by the problem (e.g. vocabulary size, sampling rate) and the model (e.g. number of layers, layer size). * Typically a set of default hyperparameters are defined for each model in a function that returns a tf.contrib.training.HParams object. * Registered using @registry.register_hparams decorator and selected using the --hparams_set flag. * The default values can be overridden using configuration flags. --- ## Code Organization * data_generators: Problem and subclasses * data_generators/text_problems.py: Text2TextProblem * layers/common_hparams.py: A common set of hyperparameters * layers/modalities.py: Modality subclasses * models: T2TModel subclasses * models/lstm.py: LSTM models and their default hyperparameter sets * utils/beam_search.py: Beam search decoder * utils/data_reader.py: Bucketing functions * utils/flags.py: Configuration flags (other than hyperparameters) * utils/metrics.py: Evaluation metrics * utils/modality.py: Base Modality class * utils/rouge.py: Functions for ROUGE score computation * utils/t2t_model.py: Base T2TModel class * utils/trainer_lib.py: Functions for creating estimator, experiment, hooksSeppo Enarviseppo2021@marjaniemi.com## Gentle Introduction to Tensor2TensorTensorFlow presentation slides2018-04-13T00:00:00+02:002018-04-13T00:00:00+02:00https://senarvi.github.io/tensorflow-slides## Gentle Introduction to TensorFlow * Sessions * Variables * Broadcasting * Optimization * Devices * Recurrency * Debugging * TensorBoard --- ## Introduction * Numerous machine learning toolkits utilize the same principle—the user describes a neural network using a graph consisting of operations (forward pass) and the toolkit performs automatic differentiation (backward pass). * Theano, Torch, Keras, Caffe, CNTK, MXNet * Development is by far most active around TensorFlow. * Tensor2Tensor for sequence-to-sequence modeling. * C++ API can be used to execute graphs built using the Python API in a production environment. --- ## Basic Concepts * tf.Graph class represents a computation graph. * By default all operations and variables are placed in the default graph. * tf.Tensor represents the output of an operation. * tf.Variable represents persistent state. * tf.Session is a class for performing computation on a graph. * By default, the default graph is used. * Maintains the state of the variables between run() calls. python a = tf.constant(1.0) b = tf.constant(2.0) c = a + b sess = tf.Session() print(sess.run(c)) # 3.0  --- ## Variables * Before using variables, they have to be initialized. * init_op sets them to the initial value given in the constructor. * Alternatively their state can be read from a *checkpoint* file. python a = tf.Variable(0.0) b = tf.constant(1.0) assign_op = tf.assign(a, a + b) init_op = tf.global_variables_initializer() sess = tf.Session() sess.run(init_op) print(sess.run(assign_op)) # 1.0 print(sess.run(assign_op)) # 2.0  --- <!-- .slide: data-transition="fade-in none-out" --> ## Model Parameters * Usually model parameters are created using tf.get_variable(name, shape). * "Traditional" way of constructing a layer is by creating a class. python class MyLayer(base.Layer): def __init__(self, units_in, units_out, name): self.kernel = tf.get_variable(name + "/kernel", shape=[units_in, units_out]) self.bias = tf.get_variable(name, + "/bias", shape=[units_out]) def __call__(self, input): return tf.matmul(input, self.kernel) + self.bias‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍  --- <!-- .slide: data-transition="none-in none-out" --> ## Model Parameters * Most of the time another convention is used in TensorFlow. * Trainable variables are collected automatically. * with tf.variable_scope() is convenient for creating name hierarchies. * Makes debugging a lot easier. python def my_layer(input, units): kernel = tf.get_variable("kernel", shape=[inputs.get_shape(), units]) bias = tf.get_variable("bias", shape=[units_out]) return tf.matmul(input, kernel) + bias‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍ with tf.variable_scope(name):‍‍ output = my_layer(input, units)‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍  --- <!-- .slide: data-transition="none-in fade-out" --> ## Model Parameters * What if we want to apply the same layer to different inputs (without creating new weights)? python with tf.variable_scope(name):‍‍ output1 = my_layer(input1, units)‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍ tf.get_variable_scope().reuse_variables() output2 = my_layer(input2, units)‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍  --- ## Tensor Shape * tensor.get_shape() returns the static shape, known at compile time. * tf.shape(tensor) returns a Tensor that represents the dynamic shape. python batch_size = 16 length = tf.placeholder(tf.int32) inputs = tf.zeros((batch_size, length)) static_shape = inputs.get_shape() # TensorShape; (16, ?) dynamic_shape = tf.shape(inputs) # Tensor; [16, length]‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍  --- <!-- .slide: data-transition="fade-in none-out" --> ## Broadcasting ![Addition of a matrix and a vector](../assets/images/tensorflow-slides/matrix-vector-addition.png) <!-- .element: class="plain" --> * Convert tensors to the same shape to make operations compatible. * Elementwise binary operations follow NumPy broadcasting. * The shape of the low-rank tensor must match the trailing dimensions of the high-rank tensor. --- <!-- .slide: data-transition="none-in fade-out" --> ## Broadcasting * Matrix multiplication tf.matmul() does not broadcast. The tensors have to be converted to 2D first. python x = tf.random_normal([i, j, k]) y = tf.random_normal([k, l]) x = tf.reshape(x, [-1, k]) z = tf.matmul(x, y) z = tf.reshape(z, [i, j, l])  * Note that if you want a linear layer / multiplication by a weight matrix, it's easier to use tf.layers.dense(x, size, use_bias=False). --- ## Optimization * Common optimization methods are implemented in various subclasses of tf.train.Optimizer. * AdadeltaOptimizer, AdagradOptimizer, MomentumOptimizer, AdamOptimizer * By default all *trainable* variables are optimized. * minimize(cost) returns an operation that performs one training step. python optimizer = GradientDescentOptimizer(learning_rate=0.1) optimizer_op = optimizer.minimize(cost) for i in range(training_steps): sess.run(optimizer_op)  --- ## Devices * An operation can run on a CPU ("/cpu:0") or a GPU ("/device:GPU:0"). * By default, the first GPU device is selected, if the operation has a GPU implementation (and you have a GPU). * with tf.device('/device:GPU:N') assigns variables and operations to GPU N. * Setting configuration variable log_device_placement=True causes the device mapping to be printed. * By default, TensorFlow maps all memory of all GPUs. * Set CUDA_VISIBLE_DEVICES=N unless job scheduler is allocating a GPU for your process. --- <!-- .slide: data-transition="fade-in none-out" --> ## Recurrency * RNNCell is an abstract class that defines a recurrent layer as a function of activations and state. * Input, output, and state are 2-dimensional: [sequences, units] ![Block diagram of an RNNCell](../assets/images/tensorflow-slides/rnncell.png) <!-- .element: class="plain" --> --- <!-- .slide: data-transition="none-in fade-out" --> ## Recurrency * There are several different RNN cells already implemented in tf.contrib.rnn: LSTMCell, GRUCell, ... * To insert a recurrent layer into a graph, use tf.nn.dynamic_rnn(cell, inputs). * Inputs and outputs are 3-dimensional: [sequences, time, units] * Initial state can be given as input and final state is received as output (2-dimensional). python cell = tf.nn.rnn_cell.LSTMCell(num_units) outputs, final_state = tf.nn.dynamic_rnn(cell, inputs, initial_state)‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍  --- <!-- .slide: data-transition="fade-in none-out" --> ## RNNCell Wrappers * Wrap some functionality over another RNN, e.g. LSTMCell. * MultiRNNCell(cells) * Stacks a list of RNNCells on top of each other. * State is a tuple. ![Block diagram of MultiRNNCell](../assets/images/tensorflow-slides/multirnncell.png) <!-- .element: class="plain" --> --- <!-- .slide: data-transition="none-in fade-out" --> ## RNNCell Wrappers * DropoutWrapper(cell) * Adds dropout. * AttentionWrapper(cell, attention_mechanism) * Implements attention over the decoder RNN cell. * Assumes we have the outputs of the encoder (memory). --- ## AttentionWrapper ![Block diagram of AttentionWrapper](../assets/images/tensorflow-slides/attentionwrapper.png) <!-- .element: class="plain" --> python decoder_cell = tf.nn.rnn_cell.LSTMCell(num_units) attention_mechanism = BahdanauAttention(num_units, encoder_outputs) attention_cell = AttentionWrapper(decoder_cell, attention_mechanism)‍‍‍‍‍‍ decoder_outputs, _ = tf.nn.dynamic_rnn(attention_cell, decoder_inputs, final_encoder_state)‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍  --- <!-- .slide: data-transition="fade-in none-out" --> ## Debugging * Assertion and print statements that need the actual value of the tensor should be added to the graph. * Problem: Assertion operation is not needed for computing the output. * The solution in TensorFlow: python assert_op = tf.Assert(tf.reduce_all(x > 0), [x]) x = tf.with_dependencies([assert_op], x)‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍ with tf.control_dependencies([tf.assert_equal(x, y)]): x = tf.identity(x)‍‍  * Printing is an identity operation that can be placed in the graph: python x = tf.Print(x, [tf.argmax(x)], 'argmax(x) = ')‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍  --- ## TensorFlow Debugger * tfdbg can be used to debug problems that occur in the middle of training, e.g. inf values. * Run *n* steps or until some tensor values match a given filter. * Print values of tensors and trace them to Python code. * Display information about the computation graph. python from tensorflow.python.debug import LocalCLIDebugWrapperSession sess = LocalCLIDebugWrapperSession(sess)‍‍‍  python from tensorflow.python.debug import LocalCLIDebugHook ex = experiment.Experiment(..., train_monitors=[LocalCLIDebugHook()])‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍  --- ## Python Debuggers * There are many debuggers for Python: pdb, ipdb, pudb, ... * Enable by placing e.g. import ipdb; ipdb.set_trace() where you want to debug. ![Screenshot of ipdb console](../assets/images/tensorflow-slides/ipdb.png) --- ## Custom Operation * py_func can convert a Python function into a TensorFlow operation. * Should take NumPy arrays as input and return NumPy arrays as outputs. * Print values, draw a plot using matplotlib, use Python debugger. python def debug_fn(x): print "x.shape={}".format(x.shape) if (x > 100).any(): import ipdb; ipdb.set_trace() return x output = tf.py_func(debug_fn, [input], tf.float32)  --- ## TensorBoard * The value of some variables can be monitored very easily during training. * In the code you only need to create a summary operation, for example: python tf.summary.scalar("mean_score", tf.reduce_mean(score))‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍  * Normally you would also need to collect the summaries and write to a log file using tf.summary.merge_all() and tf.summary.FileWriter, but Tensor2Tensor does this automatically. * TensorBoard is a web interface for monitoring training. * Start the service using tensorboard --logdir <model directory> --port <port> and point your web browser to the address that gets printed. --- ## TensorBoard – Scalars * The scalars page displays a graphical history of the scalar summary variables. ![Scalar graphs on TensorBoard](../assets/images/tensorflow-slides/tensorboard-scalars.png) --- ## TensorBoard – Histogram * A histogram displays the distribution of a set of values. * The histogram is saved for the training step at specific intervals. python tf.summary.histogram("attention_peak", tf.argmax(alignments, 2))  ![Screenshot of scalar graphs on TensorBoard](../assets/images/tensorflow-slides/tensorboard-histogram.png) --- ## TensorBoard – Graph * The computation graph is also saved in text form in the model directory. * It can be inspected on the graphs page, showing the hierarchy in the variable and operation names. ![Screenshot of a histogram on TensorBoard](../assets/images/tensorflow-slides/tensorboard-graph.png)Seppo Enarviseppo2021@marjaniemi.com## Gentle Introduction to TensorFlowFinnish and Estonian LM training data2017-05-05T00:00:00+02:002017-05-05T00:00:00+02:00https://senarvi.github.io/finnish-and-estonian-lm-training-data<h2 id="conversational-finnish-and-estonian-data-sets">Conversational Finnish and Estonian data sets</h2> <p>Here are a couple of data sets that I have collected from the web for training language models:</p> <ul> <li><a href="https://dl.dropboxusercontent.com/s/gckrscbkuilfvhm/73M-conversational-finnish.txt.gz">Conversational Finnish (73 million words, 197 MB)</a></li> <li><a href="https://dl.dropboxusercontent.com/s/kodlx8maserea3x/80M-conversational-estonian.txt.gz">Conversational Estonian (80 million words, 204 MB)</a></li> </ul> <p>The data, originally 2.7 billion Finnish words and 340 million Estonian words, have been collected by <a href="/crawling-conversation-sites/">crawling conversation sites</a>. The text has been normalized and filtered to match transcribed conversations, duplicate lines have been removed, and the sentences have been shuffled. The filtering is described in Kurimo et al. (2016), <a href="/publications/lre2016.pdf">Modeling under-resourced languages for speech recognition</a>.</p>Seppo Enarviseppo2021@marjaniemi.comConversational Finnish and Estonian data setsCrawling conversation sites for language modeling data2017-05-04T00:00:00+02:002017-05-04T00:00:00+02:00https://senarvi.github.io/crawling-conversation-sites<h2 id="how-to-use-scrapy-to-obtain-vast-amounts-of-data-for-language-modeling">How to use Scrapy to obtain vast amounts of data for language modeling</h2> <h3 id="creating-web-spiders-with-scrapy">Creating web spiders with Scrapy</h3> <p>In the past, a common way to obtain <a href="https://ssli.ee.washington.edu/tial/projects/ears/WebData/web_data_collection.html">language model training data from the web</a> has involved generating Google queries using in-domain data. When trying to collect Finnish conversations, I found this method very inefficient. It might be that something has changed in how Google handles queries, or Google finds less conversational Finnish data, or that my set of in-domain n-grams was too small. In any case, I found crawling large conversation sites using Python and <a href="https://scrapy.org/">Scrapy</a> to be way more efficient. Using Scrapy can initially seem like a lot of work, but usually the same spider can be adapted to different sites with only small changes. A single conversation site contains millions or billions of words of conversations.</p> <p>Creating a new project has been made easy. After <a href="https://doc.scrapy.org/en/latest/intro/install.html">installing Scrapy</a>, create the directory structure using:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>scrapy startproject mybot </code></pre></div></div> <p>Before writing the actual spider, let’s look at a few details that need to be filled in the files that the command created. <code class="language-plaintext highlighter-rouge">mybot/items.py</code> defines a class for storing a single data item, in this case a message extracted from a conversation site. I also extract a unique ID for each message. It can be a unique URL or some attribute in the HTML code that the site uses. The file looks like this:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">scrapy.item</span> <span class="kn">import</span> <span class="n">Item</span><span class="p">,</span> <span class="n">Field</span> <span class="k">class</span> <span class="nc">MessageItem</span><span class="p">(</span><span class="n">Item</span><span class="p">):</span> <span class="nb">id</span> <span class="o">=</span> <span class="n">Field</span><span class="p">()</span> <span class="n">text</span> <span class="o">=</span> <span class="n">Field</span><span class="p">()</span> </code></pre></div></div> <p><code class="language-plaintext highlighter-rouge">mybot/pipelines.py</code> defines components of a pipeline that is used to process all extracted data items. Pipeline components are classes that implement <code class="language-plaintext highlighter-rouge">process_item()</code> method. The method is called on every extracted item by the Scrapy framework. I have used two components—one that filters duplicate items, and one that writes the item to disk:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">scrapy.exceptions</span> <span class="kn">import</span> <span class="n">DropItem</span> <span class="k">class</span> <span class="nc">DuplicatesPipeline</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span> <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="bp">self</span><span class="p">.</span><span class="n">seen_ids</span> <span class="o">=</span> <span class="nb">set</span><span class="p">()</span> <span class="k">def</span> <span class="nf">process_item</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">item</span><span class="p">,</span> <span class="n">spider</span><span class="p">):</span> <span class="k">if</span> <span class="n">item</span><span class="p">[</span><span class="s">'id'</span><span class="p">]</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">seen_ids</span><span class="p">:</span> <span class="k">raise</span> <span class="n">DropItem</span><span class="p">(</span><span class="s">"Duplicate item: "</span> <span class="o">+</span> <span class="nb">str</span><span class="p">(</span><span class="n">item</span><span class="p">[</span><span class="s">'id'</span><span class="p">]))</span> <span class="k">else</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">seen_ids</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">item</span><span class="p">[</span><span class="s">'id'</span><span class="p">])</span> <span class="k">return</span> <span class="n">item</span> <span class="k">class</span> <span class="nc">WriterPipeline</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span> <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="bp">self</span><span class="p">.</span><span class="n">output_file</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">'messages.txt'</span><span class="p">,</span> <span class="s">'a'</span><span class="p">)</span> <span class="k">def</span> <span class="nf">process_item</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">item</span><span class="p">,</span> <span class="n">spider</span><span class="p">):</span> <span class="bp">self</span><span class="p">.</span><span class="n">output_file</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="s">'###### '</span><span class="p">)</span> <span class="bp">self</span><span class="p">.</span><span class="n">output_file</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">item</span><span class="p">[</span><span class="s">'id'</span><span class="p">]))</span> <span class="bp">self</span><span class="p">.</span><span class="n">output_file</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span> <span class="bp">self</span><span class="p">.</span><span class="n">output_file</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">item</span><span class="p">[</span><span class="s">'text'</span><span class="p">])</span> <span class="bp">self</span><span class="p">.</span><span class="n">output_file</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span> <span class="k">return</span> <span class="n">item</span> </code></pre></div></div> <p>The spider can be configured in <code class="language-plaintext highlighter-rouge">mybot/settings.py</code>, which also defines what pipeline components will be used to process the extracted items:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">BOT_NAME</span> <span class="o">=</span> <span class="s">'mybot'</span> <span class="n">DOWNLOAD_DELAY</span> <span class="o">=</span> <span class="mf">0.5</span> <span class="n">SPIDER_MODULES</span> <span class="o">=</span> <span class="p">[</span><span class="s">'mybot.spiders'</span><span class="p">]</span> <span class="n">NEWSPIDER_MODULE</span> <span class="o">=</span> <span class="s">'mybot.spiders'</span> <span class="n">ITEM_PIPELINES</span> <span class="o">=</span> <span class="p">[</span> <span class="s">'mybot.pipelines.DuplicatesPipeline'</span><span class="p">,</span> <span class="s">'mybot.pipelines.WriterPipeline'</span> <span class="p">]</span> </code></pre></div></div> <p><code class="language-plaintext highlighter-rouge">DOWNLOAD_DELAY</code> specifies a time in seconds to wait before downloading consecutive pages from the same site. This is important because downloading too fast may cause your bot to overload the web server, prevent the site from being used, or make the site block requests from your bot.</p> <h3 id="crawling-static-web-pages">Crawling static web pages</h3> <p>The actual spider class that extracts the data items from HTML pages you’ll have to implement in <code class="language-plaintext highlighter-rouge">mybot/spiders/mysite_spider.py</code>. Crawling a site that uses static HTML is straightforward. Every site structure is a bit different, though, so you’ll have to customize the spider for each site. You need to specify the rules for following links to different conversations, and implement a function that parses the messages from a conversation page. This is how the file usually starts:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">scrapy.spiders</span> <span class="kn">import</span> <span class="n">CrawlSpider</span><span class="p">,</span> <span class="n">Rule</span> <span class="kn">from</span> <span class="nn">scrapy.linkextractors.sgml</span> <span class="kn">import</span> <span class="n">SgmlLinkExtractor</span> <span class="kn">from</span> <span class="nn">scrapy.selector</span> <span class="kn">import</span> <span class="n">Selector</span> <span class="kn">from</span> <span class="nn">mybot.items</span> <span class="kn">import</span> <span class="n">MessageItem</span> <span class="k">class</span> <span class="nc">MysiteSpider</span><span class="p">(</span><span class="n">CrawlSpider</span><span class="p">):</span> <span class="n">name</span> <span class="o">=</span> <span class="s">"mysite-spider"</span> <span class="n">allowed_domains</span> <span class="o">=</span> <span class="p">[</span><span class="s">"mysite.com"</span><span class="p">]</span> <span class="n">start_urls</span> <span class="o">=</span> <span class="p">[</span><span class="s">"http://mysite.com/viewforum.php"</span><span class="p">]</span> <span class="n">extractor</span> <span class="o">=</span> <span class="n">SgmlLinkExtractor</span><span class="p">(</span><span class="n">allow</span><span class="o">=</span><span class="p">(</span><span class="s">'view(topic|forum)\.php'</span><span class="p">))</span> <span class="n">rules</span> <span class="o">=</span> <span class="p">(</span> <span class="n">Rule</span><span class="p">(</span><span class="n">extractor</span><span class="p">,</span> <span class="n">callback</span><span class="o">=</span><span class="s">'parse_item'</span><span class="p">,</span> <span class="n">follow</span><span class="o">=</span><span class="bp">True</span><span class="p">),</span> <span class="p">)</span> <span class="k">def</span> <span class="nf">parse_item</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">response</span><span class="p">):</span> <span class="c1"># Parse messages from the response. </span></code></pre></div></div> <p>A <code class="language-plaintext highlighter-rouge">CrawlSpider</code> automates the process of following links on the loaded web pages. <code class="language-plaintext highlighter-rouge">start_urls</code> defines one or more top-level URLs, where to start crawling. <code class="language-plaintext highlighter-rouge">allowed_domains</code> limits the spider to links that point to this domain name. You probably want to stay inside the same domain and not to follow external links. <code class="language-plaintext highlighter-rouge">extractor</code> is an object that extracts the links that the spider should follow from a web page. You should provide a regular expression in the <code class="language-plaintext highlighter-rouge">allow</code> parameter that matches only those pages that contain conversations or links to conversation threads.</p> <p>Implementation of the <code class="language-plaintext highlighter-rouge">parse_item</code> function depends on the HTML structure of the site. It should iterate through all the messages in the HTML code, and yield a new item on each message. Iterating through the messages is facilitated by <a href="https://doc.scrapy.org/en/latest/topics/selectors.html">XPath selectors</a>, a mechanism for identifying HTML elements. At its simplest, the function could look something like this:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">parse_item</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">response</span><span class="p">):</span> <span class="n">selector</span> <span class="o">=</span> <span class="n">Selector</span><span class="p">(</span><span class="n">response</span><span class="p">)</span> <span class="k">for</span> <span class="n">message</span> <span class="ow">in</span> <span class="n">selector</span><span class="p">.</span><span class="n">xpath</span><span class="p">(</span><span class="s">'//div[@class="message"]'</span><span class="p">):</span> <span class="n">ids</span> <span class="o">=</span> <span class="n">message</span><span class="p">.</span><span class="n">xpath</span><span class="p">(</span><span class="s">'@id'</span><span class="p">).</span><span class="n">extract</span><span class="p">()</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">ids</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">1</span><span class="p">:</span> <span class="k">continue</span> <span class="nb">id</span> <span class="o">=</span> <span class="n">ids</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="n">text</span> <span class="o">=</span> <span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">message</span><span class="p">.</span><span class="n">xpath</span><span class="p">(</span><span class="s">'text()'</span><span class="p">).</span><span class="n">extract</span><span class="p">())</span> <span class="k">if</span> <span class="ow">not</span> <span class="n">text</span><span class="p">:</span> <span class="k">continue</span> <span class="n">item</span> <span class="o">=</span> <span class="n">MessageItem</span><span class="p">()</span> <span class="n">item</span><span class="p">[</span><span class="s">'id'</span><span class="p">]</span> <span class="o">=</span> <span class="nb">id</span> <span class="n">item</span><span class="p">[</span><span class="s">'text'</span><span class="p">]</span> <span class="o">=</span> <span class="n">text</span> <span class="k">yield</span> <span class="n">item</span> </code></pre></div></div> <p>The XPath <code class="language-plaintext highlighter-rouge">//div[@class="message"]</code> is used to select all <code class="language-plaintext highlighter-rouge">&lt;div&gt;</code> elements from the document with class attribute set to “message”. Proper instructions on using XPath selectors are out of the scope of this post, but there are a few things that can be noted from the code. The <code class="language-plaintext highlighter-rouge">select()</code> method always returns a list, because there can be multiple elements that match the search. A useful feature is that the returned objects are selectors themselves that can be used to select nested objects. Without the leading <code class="language-plaintext highlighter-rouge">//</code>, an XPath selects child elements. In this case the <code class="language-plaintext highlighter-rouge">&lt;div&gt;</code> selector is used to select its id attribute and the text inside the element. <code class="language-plaintext highlighter-rouge">extract()</code> method is used to convert a selector to a text string.</p> <p>So how do you find out how the elements that you want to select can be identified? You could of course look at the HTML source code. An easier way, and the only way with dynamic web pages, is to use a browser that supports inspecting the HTML document. For example, in Google Chrome you can right-click an element and select “Inspect”. A panel showing the document structure appears below the web page. Inspecting a phpBB message looks something like this:</p> <p><img src="https://senarvi.github.io/assets/images/inspect-html-screenshot.png" alt="Inspecting an HTML element in Google Chrome" /></p> <p>In the above example, text inside the content div could be selected for example with the XPath <code class="language-plaintext highlighter-rouge">//div[@class="postbody"]/div[@class="content"]/text()</code>. An easy way to test it is to load the page in Scrapy shell, e.g:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>scrapy shell http://mysite.com/viewtopic.php?f<span class="o">=</span>123&amp;t<span class="o">=</span>456 </code></pre></div></div> <p>A Python interpreter opens with <code class="language-plaintext highlighter-rouge">response</code> variable already set. Now you can select the text elements:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">response</span><span class="p">.</span><span class="n">xpath</span><span class="p">(</span><span class="s">'//div[@class="postbody"]/div[@class="content"]/text()'</span><span class="p">)</span> </code></pre></div></div> <p>Notice that this returns a list of selectors. Use the <code class="language-plaintext highlighter-rouge">extract()</code> method to get the text data.</p> <h3 id="crawling-dynamic-web-pages">Crawling dynamic web pages</h3> <p>It gets a bit more involved if the site creates pages dynamically, on the client side. Then the client needs to load the page and execute any JavaScript code, which is clearly too much functionality to be implemented in Scrapy. <a href="http://www.seleniumhq.org/projects/webdriver/">Selenium WebDriver</a> library enables a script to drive a web browser, primarily intended for testing web applications. It can be used to load a web page in a browser, and read the generated HTML document. The document can then be parsed using Scrapy.</p> <p>Now that Scrapy is not able to read the links from the HTML pages, a <code class="language-plaintext highlighter-rouge">CrawlSpider</code> cannot be used to automatically crawl through the site. The solution is to derive from the simpler base class <code class="language-plaintext highlighter-rouge">Spider</code>, and manually extract and follow links. The base class reads just the start page and calls <code class="language-plaintext highlighter-rouge">parse()</code> method. The response that is passed to <code class="language-plaintext highlighter-rouge">parse()</code> is just the static page, so the method needs to reload the URL using Selenium. This is how I have structured the class:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">scrapy.spiders</span> <span class="kn">import</span> <span class="n">Spider</span> <span class="kn">from</span> <span class="nn">scrapy.selector</span> <span class="kn">import</span> <span class="n">Selector</span> <span class="kn">from</span> <span class="nn">scrapy.http</span> <span class="kn">import</span> <span class="n">Request</span><span class="p">,</span> <span class="n">TextResponse</span> <span class="kn">from</span> <span class="nn">selenium</span> <span class="kn">import</span> <span class="n">webdriver</span> <span class="kn">from</span> <span class="nn">mybot.items</span> <span class="kn">import</span> <span class="n">MessageItem</span> <span class="kn">import</span> <span class="nn">time</span> <span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">deque</span> <span class="k">class</span> <span class="nc">MysiteSpider</span><span class="p">(</span><span class="n">Spider</span><span class="p">):</span> <span class="n">name</span> <span class="o">=</span> <span class="s">"mysite-spider"</span> <span class="n">start_urls</span> <span class="o">=</span> <span class="p">[</span><span class="s">"http://mysite.com/forum"</span><span class="p">]</span> <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="n">Spider</span><span class="p">.</span><span class="n">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="bp">self</span><span class="p">.</span><span class="n">verificationErrors</span> <span class="o">=</span> <span class="p">[]</span> <span class="c1"># You can use a specific Firefox profile by </span> <span class="c1"># creating a selenium.webdriver.FirefoxProfile </span> <span class="c1"># and passing it to the webdriver constructor. </span> <span class="bp">self</span><span class="p">.</span><span class="n">selenium</span> <span class="o">=</span> <span class="n">webdriver</span><span class="p">.</span><span class="n">Firefox</span><span class="p">()</span> <span class="bp">self</span><span class="p">.</span><span class="n">seen_urls</span> <span class="o">=</span> <span class="nb">set</span><span class="p">()</span> <span class="bp">self</span><span class="p">.</span><span class="n">url_queue</span> <span class="o">=</span> <span class="n">deque</span><span class="p">()</span> <span class="k">def</span> <span class="nf">__del__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="bp">self</span><span class="p">.</span><span class="n">selenium</span><span class="p">.</span><span class="n">quit</span><span class="p">()</span> <span class="k">print</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">verificationErrors</span><span class="p">)</span> <span class="k">def</span> <span class="nf">parse</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">response</span><span class="p">):</span> <span class="c1"># response is the start page as seen by Scrapy. </span> <span class="c1"># Call parse_url() to reload it using Selenium. </span> <span class="k">def</span> <span class="nf">parse_url</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">page_url</span><span class="p">):</span> <span class="c1"># Load a URL using Selenium. </span></code></pre></div></div> <p><code class="language-plaintext highlighter-rouge">seen_urls</code> is maintained to avoid processing the same URL multiple times. URLs to be visited are kept in <code class="language-plaintext highlighter-rouge">url_queue</code>. <code class="language-plaintext highlighter-rouge">parse()</code> reads just the URL from the response, and calls <code class="language-plaintext highlighter-rouge">parse_url()</code> to load the document in a browser and process it. It will also call <code class="language-plaintext highlighter-rouge">parse_url()</code> for any URLs that are added to <code class="language-plaintext highlighter-rouge">url_queue</code> while processing a page.</p> <p><code class="language-plaintext highlighter-rouge">parse_url()</code> may have to wait a few seconds for the document to load. Then it creates a new response and XPath selector for parsing the document. It has to identify the links that should be followed, either using an XPath, or based on the URL. If the URL is found in <code class="language-plaintext highlighter-rouge">seen_urls</code>, it has been queued already. Links to be processed can be added to <code class="language-plaintext highlighter-rouge">url_queue</code>. It’s also possible to yield requests that Scrapy will process using a different handler.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">parse</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">response</span><span class="p">):</span> <span class="n">requests</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">parse_url</span><span class="p">(</span><span class="n">response</span><span class="p">.</span><span class="n">url</span><span class="p">)]</span> <span class="k">while</span> <span class="bp">self</span><span class="p">.</span><span class="n">url_queue</span><span class="p">:</span> <span class="n">url</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">url_queue</span><span class="p">.</span><span class="n">popleft</span><span class="p">()</span> <span class="n">requests</span><span class="p">.</span><span class="n">extend</span><span class="p">([</span><span class="n">x</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">parse_url</span><span class="p">(</span><span class="n">url</span><span class="p">)])</span> <span class="k">for</span> <span class="n">request</span> <span class="ow">in</span> <span class="n">requests</span><span class="p">:</span> <span class="k">yield</span> <span class="n">request</span> <span class="k">def</span> <span class="nf">parse_url</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">page_url</span><span class="p">):</span> <span class="bp">self</span><span class="p">.</span><span class="n">selenium</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">page_url</span><span class="p">)</span> <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">selenium</span><span class="p">:</span> <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span> <span class="n">page_source</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">selenium</span><span class="p">.</span><span class="n">page_source</span> <span class="n">response</span> <span class="o">=</span> <span class="n">TextResponse</span><span class="p">(</span><span class="n">url</span><span class="o">=</span><span class="n">page_url</span><span class="p">,</span> <span class="n">body</span><span class="o">=</span><span class="n">page_source</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s">'utf-8'</span><span class="p">)</span> <span class="n">selector</span> <span class="o">=</span> <span class="n">Selector</span><span class="p">(</span><span class="n">response</span><span class="p">)</span> <span class="k">for</span> <span class="n">link</span> <span class="ow">in</span> <span class="n">selector</span><span class="p">.</span><span class="n">xpath</span><span class="p">(</span><span class="s">'//a'</span><span class="p">):</span> <span class="n">urls</span> <span class="o">=</span> <span class="n">link</span><span class="p">.</span><span class="n">select</span><span class="p">(</span><span class="s">'@href'</span><span class="p">).</span><span class="n">extract</span><span class="p">()</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">urls</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">1</span><span class="p">:</span> <span class="k">continue</span> <span class="n">url</span> <span class="o">=</span> <span class="n">urls</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="k">if</span> <span class="n">url</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">seen_urls</span><span class="p">:</span> <span class="k">continue</span> <span class="bp">self</span><span class="p">.</span><span class="n">seen_urls</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">url</span><span class="p">)</span> <span class="k">if</span> <span class="n">is_forum_url</span><span class="p">(</span><span class="n">url</span><span class="p">):</span> <span class="bp">self</span><span class="p">.</span><span class="n">url_queue</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">url</span><span class="p">)</span> <span class="c1"># It's possible to yield a request that will </span> <span class="c1"># be handled by a custom callback function. </span> <span class="k">if</span> <span class="n">is_special_url</span><span class="p">(</span><span class="n">url</span><span class="p">):</span> <span class="k">yield</span> <span class="n">Request</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">callback</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">parse_special</span><span class="p">)</span> </code></pre></div></div>Seppo Enarviseppo2021@marjaniemi.comHow to use Scrapy to obtain vast amounts of data for language modelingKaldi lattices2016-12-15T00:00:00+01:002016-12-15T00:00:00+01:00https://senarvi.github.io/kaldi-lattices<h2 id="introduction-to-working-with-kaldi-lattices-and-differences-to-slf">Introduction to working with Kaldi lattices, and differences to SLF</h2> <h3 id="differences-between-slf-and-kaldi-lattices">Differences between SLF and Kaldi lattices</h3> <p>During speech recognition, the decoder explores a search space that in principle contains every possible sentence. It is often useful to save a subset of this search space as a directed graph, where the nodes or edges correspond to word hypotheses. The scores computed using acoustic and language models are saved in the graph so that it can be efficiently rescored and decoded using different, perhaps computationally more demanding, language models.</p> <p>HTK recognizer defined the Standard Lattice Format (SLF), which has been widely adopted. It is basically a text file that lists the graph nodes and edges. The nodes may contain time stamps so that the words can be aligned with audio. Usually the word identities, and language model and acoustic scores (log-likelihoods) are stored in the edges.</p> <p>Kaldi represents many components of the speech recognition system as finite-state transducers (FSTs), including hidden Markov models, language model, and pronunciation dictionary. FSTs are finite-state machines that produce output when they transition from one state to the next. This is nothing but a certain kind of directed graph, so lattices in Kaldi are naturally also represented as an FST. In this case the output is the word identities.</p> <p>The FSTs in Kaldi are weighted, meaning that each arc is associated with a cost of taking that transition. In lattice FSTs the weights are defined by the acoustic and language model scores. The numbers are a bit different from the scores that one finds in SLF lattices:</p> <ol> <li>Kaldi lattices use costs, i.e. negative log-likelihoods, instead of log-likelihoods.</li> <li>Kaldi scripts may “push” weights towards the beginning of the graph, so that for example language model probabilities cannot be interpreted as individual word probabilities, but only the sum along an entire path through the lattice is meaningful.</li> <li>When a neural network acoustic model is used, the probabilities are not correctly normalized. The neural network predicts posterior probabilities, which will be divided by the prior probabilities of the HMM states to obtain pseudo-likelihoods (likelihoods scaled by a constant scaling factor). It is possible for the pseudo-probabilities to be higher than one.</li> <li>Language model probability is incorporated in <em>graph cost</em>, which also includes pronunciation, transition, and silence probabilities.</li> </ol> <p>The last item requires some consideration when rescoring language model probabilities. In order to preserve the rest of the graph cost, one could subtract the old language model cost from the graph cost, and add the remaining value to the acoustic cost (which will not be modified during rescoring). However, if the lexicon doesn’t include pronunciation probabilities, and silence probabilities are not estimated, this might not be worth the trouble. The default silence probability of <span>$0.5$</span> can be compensated by a word insertion penalty of <span>$\log(2)$</span>, and the transition probabilities don’t make much difference.</p> <h3 id="a-look-inside-kaldi-lattices">A look inside Kaldi lattices</h3> <p>There are actually two alternative FST representations for lattices in Kaldi, implemented by the Lattice and CompactLattice classes. They differ in how they store the <em>transition IDs</em>, which identify the HMM states. Lattice has transition IDs in input, CompactLattice includes transition IDs in the weights. (The concept of weight in Kaldi is pretty liberal in what it may contain. Multiplying the weights of a CompactLattice will concatenate the transition ID sequences.)</p> <p>Kaldi stores lattices in its general purpose archive format, which can be either binary or text. Usually the lattices are saved in binary CompactLattice form. <code class="language-plaintext highlighter-rouge">lattice-copy</code> command can be used for converting between binary and text formats, as well as Lattice and CompactLattice. One file may contain multiple lattices. The decode scripts in the example recipes save the recognition lattices from job N into <code class="language-plaintext highlighter-rouge">lat.N.gz</code>.</p> <p>In order to see what a lattice file contains, one can simply convert it to text form using <code class="language-plaintext highlighter-rouge">lattice-copy</code>. Words are stored in lattices as integer IDs, so the output makes more sense if the IDs (third field) are converted to words using the mapping found from <code class="language-plaintext highlighter-rouge">words.txt</code> in the lang directory:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>lattice-copy <span class="s2">"ark:zcat DECODE-DIR/lat.N.gz |"</span> ark,t:- | utils/int2sym.pl <span class="nt">-f</span> 3 LANG-DIR/words.txt </code></pre></div></div> <p>The firts line of the output contains the utterance ID. The following lines each contain start and end state, word, and a comma-separated list of weights. If the lattice is in CompactLattice format, the weights include acoustic cost, graph cost, and transition IDs:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1-1417560 0 1 tästä 14.6843,-224.556,3_12_18_17_17_17_17_17_17_17_17_17_17_. 0 7275 täst 16.5506,-256.026,3_12_18_17_17_17_17_17_17_17_17_17_.. 1 2 on 2.84427,-10.4875, 2 3 hyvä 3.4889,-118.49,22729_22738_22884_3502_3718_3788_17302_... 3 4 jatkaa 11.5558,-299.218,25331_3552_3636_3635_3738_3737_11434_. </code></pre></div></div> <p>In the above example, the first transition from state 0 to state 1 is associated with language model cost 14.6843. The graph cost -224.556 is negative, which is possible because a neural network acoustic model has been used to create the lattice. The transition IDs start with a long sequence of silence states. There are no transition IDs associated with the word <code class="language-plaintext highlighter-rouge">on</code>. This is because generally the weights are not synchronized with each other and the word identities—it is only meaningful to look at a whole path from the initial node to the final node.</p> <p>Often it is useful to align the transition IDs with word boundaries, so that one arc corresponds to one word, by calling <code class="language-plaintext highlighter-rouge">lattice-align-words</code> first:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>lattice-align-words LANG-DIR/phones/word_boundary.int <span class="se">\</span> MODEL-DIR/final.mdl <span class="se">\</span> <span class="s2">"ark:zcat DECODE-DIR/lat.N.gz |"</span> ark,t:- | utils/int2sym.pl <span class="nt">-f</span> 3 LANG-DIR/words.txt </code></pre></div></div> <p>Now the output starts with two transitions that produce the <code class="language-plaintext highlighter-rouge">&lt;eps&gt;</code> token. It is a special token meaning that there is no word on this arc. The arcs correspond to the silence at the beginning of the utterance. The transition IDs on the other arcs correspond to the words:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1-1417560 0 4 &lt;eps&gt; 14.6843,-224.556,3_12_18_17_17_17_17_17_17_17_17_17_17_. 0 1 &lt;eps&gt; 16.5506,-256.026,3_12_18_17_17_17_17_17_17_17_17_17_17_. 1 2 täst 0,0,22026_22025_22025_22025_22025_22152_22151_22314_22313 2 3 on 8.15453,-140.39,17450_17449_17606_17714_16186_16226_16462 3 7 hyvä 11.6047,-299.218,7496_7596_7760_25852_25906_25905_25972_. 4 5 tästä 6.33317,-128.978,22026_22025_22025_22025_22025_22152_... 5 6 on 0,0,17302_17538_17656_16186_16226_16462 6 7 hyvä 11.5558,-299.218,7496_7596_7760_25852_25906_25972_25226_. 7 8 jatkaa 0,0,11434_11433_11433_11482_11534_11533_11533_1910_1909 </code></pre></div></div> <h3 id="decoding-best-path-and-n-best-lists">Decoding best path and N-best lists</h3> <p><code class="language-plaintext highlighter-rouge">lattice-best-path</code> is a simple utility to decode the best path of lattices and output word IDs. A word symbol table can be provided for mapping the IDs to words, but that affects the debug output only. In order to write text transcripts, one can output in text format and convert the word IDs to words using <code class="language-plaintext highlighter-rouge">int2sym.pl</code> (excluding the first field, which is the utterance ID):</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>lattice-best-path <span class="nt">--lm-scale</span><span class="o">=</span>12 <span class="se">\</span> <span class="nt">--word-symbol-table</span><span class="o">=</span>LANG-DIR/words.txt <span class="se">\</span> <span class="s2">"ark:zcat DECODE-DIR/lat.N.gz |"</span> ark,t:- | utils/int2sym.pl <span class="nt">-f</span> 2- LANG-DIR/words.txt <span class="o">&gt;</span>transcript.ref </code></pre></div></div> <p>N-best list in Kaldi are represented as lattices with n distinct (linear) paths. A lattice with n best paths can be decoded using <code class="language-plaintext highlighter-rouge">lattice-nbest</code>, and the single best path can be decoded using <code class="language-plaintext highlighter-rouge">lattice-1best</code>. Transition IDs, language model costs, acoustic costs, and transcripts can be extracted from linear FSTs using <code class="language-plaintext highlighter-rouge">nbest-to-linear</code> utility:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>lattice-1best <span class="nt">--lm-scale</span><span class="o">=</span>12 <span class="se">\</span> <span class="s2">"ark:zcat DECODE-DIR/lat.N.gz |"</span> ark:- | nbest-to-linear ark:- <span class="se">\</span> ark,t:transition-ids.txt <span class="se">\</span> ark,t:- <span class="se">\</span> ark,t:lm-costs.txt <span class="se">\</span> ark,t:acoustic-costs.txt | utils/int2sym.pl <span class="nt">-f</span> 2- LANG-DIR/words.txt <span class="o">&gt;</span>transcript.ref </code></pre></div></div> <p>A linear lattice can be converted into time marked CTM transcript using <code class="language-plaintext highlighter-rouge">nbest-to-ctm</code>, which is useful if you need to know when each word was spoken. First the lattice must be aligned with word boundaries. Again the word IDs (the fifth field) need to be converted to words:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>lattice-1best <span class="nt">--lm-scale</span><span class="o">=</span>12 <span class="se">\</span> <span class="s2">"ark:zcat DECODE-DIR/lat.N.gz |"</span> ark:- | lattice-align-words LANG-DIR/phones/word_boundary.int <span class="se">\</span> MODEL-DIR/final.mdl <span class="se">\</span> ark:- ark:- | nbest-to-ctm ark:- - | utils/int2sym.pl <span class="nt">-f</span> 5 LANG-DIR/words.txt <span class="o">&gt;</span>transcript.ctm </code></pre></div></div> <p>The CTM file will contain five fields on each line: utterance ID, audio channel, begin time in seconds, duration in seconds, and the word.</p> <h3 id="pruning-and-evaluating-lattices">Pruning and evaluating lattices</h3> <p>Some operations, for example rescoring with a neural network language model, can take a lot of time if the lattices are large. <code class="language-plaintext highlighter-rouge">lattice-prune</code> can be used to prune paths that are not close enough to the best path, in the same way as <code class="language-plaintext highlighter-rouge">lattice-tool -posterior-prune</code> does (from SRILM toolkit). The threshold for pruning is given in the <code class="language-plaintext highlighter-rouge">--beam</code> argument:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>lattice-prune <span class="nt">--inv-acoustic-scale</span><span class="o">=</span>12 <span class="nt">--beam</span><span class="o">=</span>5 <span class="se">\</span> <span class="s2">"ark:zcat DECODE-DIR/lat.N.gz |"</span> <span class="se">\</span> <span class="s2">"ark:| gzip &gt;PRUNED-DIR/lat.N.gz"</span> </code></pre></div></div> <p>When optimizing the lattice size, it is helpful to know the oracle word error rate. <code class="language-plaintext highlighter-rouge">lattice-oracle</code> computes the minimum error rate that can be obatined from the lattice, in a similar way that can be done using the SRILM command <code class="language-plaintext highlighter-rouge">lattice-tool -ref-file</code>. The command takes as input the lattice file and the reference transcript. Transcripts are expected to be in form of utterance ID followed by word IDs, so <code class="language-plaintext highlighter-rouge">utils/sym2int.pl</code> should be used to map the words to word IDs. Any out-of-vocabulary words in the reference transcripts need to be mapped to the <code class="language-plaintext highlighter-rouge">[oov]</code> tag first:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>utils/sym2int.pl <span class="nt">--map-oov</span> <span class="o">[</span>oov] <span class="nt">-f</span> 2- LANG-DIR/words.txt <span class="se">\</span> &lt;transcript.ref | lattice-oracle <span class="nt">--word-symbol-table</span><span class="o">=</span>LANG-DIR/words.txt <span class="se">\</span> <span class="s2">"ark:zcat PRUNED-DIR/lat.N.gz |"</span> <span class="se">\</span> ark:- <span class="se">\</span> ark,t:oracle-transcript.int </code></pre></div></div> <p>The program also displays the oracle word sequence, and writes the word IDs to the file given in the third positional argument.</p> <h3 id="converting-kaldi-lattices-to-slf">Converting Kaldi lattices to SLF</h3> <p>Kaldi lattices can be converted to SLF for processing with external tools. There are two scripts in the <code class="language-plaintext highlighter-rouge">egs/wsj/s5/utils</code> directory that are designed for that: <code class="language-plaintext highlighter-rouge">convert_slf.pl</code> converts a single lattice, and <code class="language-plaintext highlighter-rouge">convert_slf_parallel.sh</code> supports converting a batch of lattices in a compute cluster.</p> <p>The command for submitting batch jobs is given to <code class="language-plaintext highlighter-rouge">convert_slf_parallel.sh</code> in the <code class="language-plaintext highlighter-rouge">--cmd</code> argument as usual. The script also takes path to the data directory just for checking that the number of lattices matches the number of utterances, lang directory that contains the necessary word tables, and decode directory, where the lattices are:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>utils/convert_slf_parallel.sh <span class="nt">--cmd</span> BATCH-CMD <span class="se">\</span> DATA-DIR <span class="se">\</span> LANG-DIR <span class="se">\</span> DECODE-DIR </code></pre></div></div> <p>Note that the language model scores in the created SLF lattices are actually graph scores, as explained above.</p>Seppo Enarviseppo2021@marjaniemi.comIntroduction to working with Kaldi lattices, and differences to SLFAvoiding problems in Theano computation graphs2016-11-20T00:00:00+01:002016-11-20T00:00:00+01:00https://senarvi.github.io/avoiding-problems-in-theano-graphs<h2 id="notes-on-writing-testing-and-debugging-theano-computation-graphs">Notes on writing, testing, and debugging Theano computation graphs</h2> <h3 id="numpy-as-a-reference">NumPy as a reference</h3> <p>Theano interface is made as similar to NumPy as possible. Theano code often closely resembles NumPy code, but the interface is limited and some differences are necessary because of how Theano works. If you’re not confident that the matrix operations do what you intended, you can esily test them by running the same operations in an interactive Python session using NumPy.</p> <p>While Theano documentation is not perfect, it often helps to look at the corresponding NumPy documentation. If you’re new to both Theano and NumPy, you should at least familiarize yourself with <a href="http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html">broadcasting</a>, and <a href="https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html">slicing and indexing</a>, which are explained more thoroughly in NumPy documentation.</p> <p>A couple notable deviations from NumPy are worth mentioning, though. Theano uses integers to represent booleans. Thus you cannot index a matrix with a boolean matrix. You’ll have to convert the boolean matrix to a matrix of indices with <code class="language-plaintext highlighter-rouge">nonzero()</code>. And since Theano never modifies a tensor, for creating a tensor where a subset of the matrix elements have been updated, you need to call <code class="language-plaintext highlighter-rouge">set_subtensor()</code>. The following example sets the data elements to zero where the mask is nonzero:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">from</span> <span class="nn">theano</span> <span class="kn">import</span> <span class="n">tensor</span><span class="p">,</span> <span class="n">function</span> <span class="n">data</span> <span class="o">=</span> <span class="n">tensor</span><span class="p">.</span><span class="n">matrix</span><span class="p">(</span><span class="s">'data'</span><span class="p">)</span> <span class="n">mask</span> <span class="o">=</span> <span class="n">tensor</span><span class="p">.</span><span class="n">matrix</span><span class="p">(</span><span class="s">'mask'</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="s">'int8'</span><span class="p">)</span> <span class="n">indices</span> <span class="o">=</span> <span class="n">mask</span><span class="p">.</span><span class="n">nonzero</span><span class="p">()</span> <span class="n">output</span> <span class="o">=</span> <span class="n">tensor</span><span class="p">.</span><span class="n">set_subtensor</span><span class="p">(</span><span class="n">data</span><span class="p">[</span><span class="n">indices</span><span class="p">],</span> <span class="mi">0</span><span class="p">)</span> <span class="n">f</span> <span class="o">=</span> <span class="n">function</span><span class="p">([</span><span class="n">data</span><span class="p">,</span> <span class="n">mask</span><span class="p">],</span> <span class="n">output</span><span class="p">)</span> <span class="n">toy_data</span> <span class="o">=</span> <span class="n">numpy</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">9</span><span class="p">).</span><span class="n">reshape</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span> <span class="n">toy_mask</span> <span class="o">=</span> <span class="n">numpy</span><span class="p">.</span><span class="n">zeros</span><span class="p">((</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="s">'int8'</span><span class="p">)</span> <span class="n">numpy</span><span class="p">.</span><span class="n">fill_diagonal</span><span class="p">(</span><span class="n">toy_mask</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="k">print</span><span class="p">(</span><span class="n">f</span><span class="p">(</span><span class="n">toy_data</span><span class="p">,</span> <span class="n">toy_mask</span><span class="p">))</span> </code></pre></div></div> <p>The example produces a matrix whose main diagonal has been zeroed out:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[[</span> <span class="mf">0.</span> <span class="mf">1.</span> <span class="mf">2.</span><span class="p">]</span> <span class="p">[</span> <span class="mf">3.</span> <span class="mf">0.</span> <span class="mf">5.</span><span class="p">]</span> <span class="p">[</span> <span class="mf">6.</span> <span class="mf">7.</span> <span class="mf">0.</span><span class="p">]]</span> </code></pre></div></div> <p>In NumPy you could simply modify the array in place, using <code class="language-plaintext highlighter-rouge">data[mask == 1] = 0</code>.</p> <h3 id="test-values">Test values</h3> <p>The biggest difference when writing writing a function with Theano versus NumPy is obviously that when expressing a mathematical operation using Theano, the Python code doesn’t process the actual data, but symbolic variables that will be used to construct the computation graph. The concept is easy to understand, but also easy to forget when you’re writing a Theano function, and makes debugging somewhat harder.</p> <p>The fact that the actual data is not known when building the computation graph makes it difficult for Theano to produce understandable error messages. A solution to this is to always set a test value, when creating a shared variable. This allows Theano to produce an error message at the exact location where you are adding an invalid operation to the graph.</p> <p>The test value is a NumPy matrix. Usually random numbers work. The important thing is that the shape and data type (and maybe the range of values) correspond to the data used in the actual application. Typical errors are caused by a mismatch in the dimensionality or shape of the arguments of an operation. The error messages can still be quite difficult to interpret. The nice thing is that you can also print the computed test values (and their <code class="language-plaintext highlighter-rouge">dtype</code>, <code class="language-plaintext highlighter-rouge">shape</code>, etc. attributes).</p> <p>Evaluation of the expressions using test values is enabled by setting the <code class="language-plaintext highlighter-rouge">compute_test_value</code> configuration attribute. Naturally executing the graph using the test values introduces computational overhead, so you probably don’t want to keep this enabled except when you need to debug an error in the graph. The example below prints probably 9, which has been computed using the test value provided for the input data, while the actual data is not available yet:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">from</span> <span class="nn">theano</span> <span class="kn">import</span> <span class="n">tensor</span><span class="p">,</span> <span class="n">config</span> <span class="n">config</span><span class="p">.</span><span class="n">compute_test_value</span> <span class="o">=</span> <span class="s">'warn'</span> <span class="n">data</span> <span class="o">=</span> <span class="n">tensor</span><span class="p">.</span><span class="n">matrix</span><span class="p">(</span><span class="s">'data'</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="s">'int64'</span><span class="p">)</span> <span class="n">data</span><span class="p">.</span><span class="n">tag</span><span class="p">.</span><span class="n">test_value</span> <span class="o">=</span> <span class="n">numpy</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="mi">100</span><span class="p">,</span> <span class="mi">100</span><span class="p">))</span> <span class="n">maximum</span> <span class="o">=</span> <span class="n">data</span><span class="p">.</span><span class="nb">max</span><span class="p">()</span> <span class="k">print</span><span class="p">(</span><span class="n">maximum</span><span class="p">.</span><span class="n">tag</span><span class="p">.</span><span class="n">test_value</span><span class="p">)</span> </code></pre></div></div> <h3 id="printing">Printing</h3> <p>Printing the actual value of a tensor during computation is possible using a <code class="language-plaintext highlighter-rouge">theano.printing.Print</code> operation. The constructor takes an optional message argument. The data that is given to the created operation object as an argument will be printed during execution of the graph, and also passed on as the output of the operation. In order to print something, the computation graph has to use the value returned by the print operation. Printing e.g. the shape of a matrix would be difficult, but luckily the <code class="language-plaintext highlighter-rouge">Print</code> constructor takes another parameter <code class="language-plaintext highlighter-rouge">attrs</code> for the purpose of printing certain attributes instead of the value of a tensor.</p> <p>If you encounter an error while compiling the function, this doesn’t help. In that case you can print the test value only. But the print operation can be used to print values computed from the actual inputs, if necessary. This example prints <code class="language-plaintext highlighter-rouge">identity shape = (3, 3)</code> during the execution of the graph:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">from</span> <span class="nn">theano</span> <span class="kn">import</span> <span class="n">tensor</span><span class="p">,</span> <span class="n">printing</span><span class="p">,</span> <span class="n">function</span> <span class="n">data</span> <span class="o">=</span> <span class="n">tensor</span><span class="p">.</span><span class="n">matrix</span><span class="p">(</span><span class="s">'data'</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="s">'int64'</span><span class="p">)</span> <span class="n">identity</span> <span class="o">=</span> <span class="n">tensor</span><span class="p">.</span><span class="n">identity_like</span><span class="p">(</span><span class="n">data</span><span class="p">)</span> <span class="n">print_op</span> <span class="o">=</span> <span class="n">printing</span><span class="p">.</span><span class="n">Print</span><span class="p">(</span><span class="s">"identity"</span><span class="p">,</span> <span class="n">attrs</span><span class="o">=</span><span class="p">[</span><span class="s">'shape'</span><span class="p">])</span> <span class="n">identity</span> <span class="o">=</span> <span class="n">print_op</span><span class="p">(</span><span class="n">identity</span><span class="p">)</span> <span class="n">f</span> <span class="o">=</span> <span class="n">function</span><span class="p">([</span><span class="n">data</span><span class="p">],</span> <span class="n">identity</span><span class="p">.</span><span class="nb">sum</span><span class="p">())</span> <span class="n">toy_data</span> <span class="o">=</span> <span class="n">numpy</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">9</span><span class="p">).</span><span class="n">reshape</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span> <span class="n">f</span><span class="p">(</span><span class="n">toy_data</span><span class="p">)</span> </code></pre></div></div> <h3 id="assertions">Assertions</h3> <p>Often one would like to make sure for example that the result of an operation is of the correct shape. There is an assertion operation that works in the same way as the print operation. A <code class="language-plaintext highlighter-rouge">theano.tensor.opt.Assert</code> object is added somewhere in the graph. The constructor takes an optional message argument. The first argument is the data that will be passed on as the output of the operation, and the second argument is the assertion. Note that the assertion has to be a tensor, so for comparison you’ll have to use <code class="language-plaintext highlighter-rouge">theano.tensor.eq()</code>, <code class="language-plaintext highlighter-rouge">theano.tensor.neq()</code>, etc. The example below verifies that the number of output and input dimensions are equal:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">from</span> <span class="nn">theano</span> <span class="kn">import</span> <span class="n">tensor</span><span class="p">,</span> <span class="n">function</span> <span class="n">data</span> <span class="o">=</span> <span class="n">tensor</span><span class="p">.</span><span class="n">matrix</span><span class="p">(</span><span class="s">'data'</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="s">'int64'</span><span class="p">)</span> <span class="n">identity</span> <span class="o">=</span> <span class="n">tensor</span><span class="p">.</span><span class="n">identity_like</span><span class="p">(</span><span class="n">data</span><span class="p">)</span> <span class="n">assert_op</span> <span class="o">=</span> <span class="n">tensor</span><span class="p">.</span><span class="n">opt</span><span class="p">.</span><span class="n">Assert</span><span class="p">(</span><span class="s">"Shape mismatch!"</span><span class="p">)</span> <span class="n">output</span> <span class="o">=</span> <span class="n">assert_op</span><span class="p">(</span><span class="n">identity</span><span class="p">,</span> <span class="n">tensor</span><span class="p">.</span><span class="n">eq</span><span class="p">(</span><span class="n">identity</span><span class="p">.</span><span class="n">ndim</span><span class="p">,</span> <span class="n">data</span><span class="p">.</span><span class="n">ndim</span><span class="p">))</span> <span class="n">f</span> <span class="o">=</span> <span class="n">function</span><span class="p">([</span><span class="n">data</span><span class="p">],</span> <span class="n">output</span><span class="p">)</span> <span class="n">toy_data</span> <span class="o">=</span> <span class="n">numpy</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">9</span><span class="p">).</span><span class="n">reshape</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span> <span class="n">f</span><span class="p">(</span><span class="n">toy_data</span><span class="p">)</span> </code></pre></div></div> <p>Assertions are not very convenient to use either, and in case an assertion fails, the printed message usually gives you less information than if you simply let the computation continue until the next error.</p> <h3 id="unit-tests">Unit tests</h3> <p>Testing the correctness of some higher level functions that use neural networks is difficult because of the nondeterministic nature of neural networks. I have separated the network structure from the classes that create and use the actual Theano functions. If I want to write a unit test for a function that performs some operation on neural network output, I replace the neural network with a simple dummy network.</p> <p>So let’s say I have a class <code class="language-plaintext highlighter-rouge">Processor</code> that uses neural network output to perform some task. The function <code class="language-plaintext highlighter-rouge">process_file()</code> reads input from one file and writes output to another file.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">theano</span> <span class="kn">import</span> <span class="n">tensor</span><span class="p">,</span> <span class="n">function</span> <span class="k">class</span> <span class="nc">NeuralNetwork</span><span class="p">:</span> <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="bp">self</span><span class="p">.</span><span class="nb">input</span> <span class="o">=</span> <span class="n">tensor</span><span class="p">.</span><span class="n">scalar</span><span class="p">()</span> <span class="bp">self</span><span class="p">.</span><span class="n">output</span> <span class="o">=</span> <span class="n">complex_theano_operation</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="nb">input</span><span class="p">)</span> <span class="k">class</span> <span class="nc">Processor</span><span class="p">:</span> <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">network</span><span class="p">):</span> <span class="bp">self</span><span class="p">.</span><span class="n">compute_output</span> <span class="o">=</span> <span class="n">function</span><span class="p">([</span><span class="n">network</span><span class="p">.</span><span class="nb">input</span><span class="p">],</span> <span class="n">network</span><span class="p">.</span><span class="n">output</span><span class="p">)</span> <span class="k">def</span> <span class="nf">process_file</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">input_file</span><span class="p">,</span> <span class="n">output_file</span><span class="p">):</span> <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">input_file</span><span class="p">:</span> <span class="n">output_file</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">compute_output</span><span class="p">(</span><span class="nb">float</span><span class="p">(</span><span class="n">line</span><span class="p">))</span> <span class="o">+</span> <span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span> </code></pre></div></div> <p>When writing unit tests for Processor, I would create a dummy neural network that produces simple deterministic output, then pass that dummy network to Processor before testing its functions. The trivial example below tries to illustrate this approach:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">DummyNetwork</span><span class="p">:</span> <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="bp">self</span><span class="p">.</span><span class="nb">input</span> <span class="o">=</span> <span class="n">tensor</span><span class="p">.</span><span class="n">scalar</span><span class="p">()</span> <span class="bp">self</span><span class="p">.</span><span class="n">output</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="nb">input</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">class</span> <span class="nc">ProcessorTest</span><span class="p">(</span><span class="n">unittest</span><span class="p">.</span><span class="n">TestCase</span><span class="p">):</span> <span class="k">def</span> <span class="nf">test_process</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span> <span class="n">network</span> <span class="o">=</span> <span class="n">DummyNetwork</span><span class="p">()</span> <span class="n">processor</span> <span class="o">=</span> <span class="n">Processor</span><span class="p">(</span><span class="n">network</span><span class="p">)</span> <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">'input.txt'</span><span class="p">,</span> <span class="s">'r'</span><span class="p">)</span> <span class="k">as</span> <span class="n">input_file</span><span class="p">,</span> <span class="nb">open</span><span class="p">(</span><span class="s">'output.txt'</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">output_file</span><span class="p">:</span> <span class="n">processor</span><span class="p">.</span><span class="n">process</span><span class="p">(</span><span class="n">input_file</span><span class="p">,</span> <span class="n">output_file</span><span class="p">)</span> <span class="c1"># Assert that each line in the output file equals to the </span> <span class="c1"># corresponding line in the input file plus one. </span></code></pre></div></div> <h3 id="performance-issues-in-computation-graph">Performance issues in computation graph</h3> <p>Performance problems can be very challenging to track down. Looking at the computation graph is necessary to know what Theano is actually doing under the hood. Analyzing the graph is easier if you first try to simplify the computation, as long as the performance issue won’t disappear.</p> <p>Print the computation graph using <code class="language-plaintext highlighter-rouge">theano.printing.debugprint()</code>. You can print the graph at any point when you’re constructing it, but only the final graph compiled using <code class="language-plaintext highlighter-rouge">theano.function()</code> shows the actual operations and memory transfers that will take place. You can display the compiled graph of function <code class="language-plaintext highlighter-rouge">f</code> using <code class="language-plaintext highlighter-rouge">theano.printing.debugprint(f)</code>. If you have Graphviz and pydot installed, you can even print a pretty image using <code class="language-plaintext highlighter-rouge">theano.printing.pydotprint(f, outfile="graph.png")</code>.</p> <p>One thing that can be immediately noted on the graph is the <code class="language-plaintext highlighter-rouge">HostFromGpu</code> and <code class="language-plaintext highlighter-rouge">GpuFromHost</code> operations. These are the expensive memory transfers between the host computer and the GPU memory. You can also notice from the names of the operations of the compiled graph, whether they run on GPU or not—GPU operations have the Gpu prefix. Ideally your shared variables are stored on the GPU and you have only one <code class="language-plaintext highlighter-rouge">HostFromGpu</code> operation in the end, as in the graph below:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>HostFromGpu [id A] '' 136 |GpuElemwise{Composite{((-i0) / i1)}}[(0, 0)] [id B] '' 133 |GpuCAReduce{add}{1,1} [id C] '' 128 | |GpuElemwise{Composite{((log((i0 + (i1 / i2))) + i3) * i4)}} | |CudaNdarrayConstant{[[ 9.99999997e-07]]} [id E] | |GpuElemwise{true_div,no_inplace} [id F] '' 119 | | |GpuElemwise{Exp}[(0, 0)] [id G] '' 118 | | | |GpuReshape{2} [id H] '' 116 | | | |GpuElemwise{Add}[(0, 1)] [id I] '' 114 | | | | |GpuReshape{3} [id J] '' 112 | | | | | |GpuCAReduce{add}{0,1} [id K] '' 110 | | | | | | |GpuReshape{2} [id L] '' 108 | | | | | | |GpuElemwise{mul,no_inplace} [id M] '' 66 | | | | | | | |GpuDimShuffle{0,1,x,2} [id N] '' 58 | | | | | | | | |GpuReshape{3} [id O] '' 55 | | | | | | | | |GpuAdvancedSubtensor1 [id P] '' 35 | | | | | | | | | |layers/projection_layer/W [id Q] </code></pre></div></div> <p>Some operations force memory to be transferred back and forth. If you’re still using the old GPU backend (<code class="language-plaintext highlighter-rouge">device=gpu</code>), chances are that the reason is that the GPU operations are implemented for float32 only. Make sure that you set the flags <code class="language-plaintext highlighter-rouge">floatX=float32</code> and that your shared variables are float32. All the floating point constants should be cast to <code class="language-plaintext highlighter-rouge">numpy.float32</code> as well. Another example is multinomial sampling—uniform sampling is performed on GPU, but <code class="language-plaintext highlighter-rouge">MultinomialFromUniform</code> forces a transfer to host memory:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>MultinomialFromUniform{int64} [id CH] '' |HostFromGpu [id CI] '' | |GpuReshape{2} [id CJ] '' | ... |HostFromGpu [id CT] '' | |GPU_mrg_uniform{CudaNdarrayType(float32, vector),inplace}.1 | |&lt;CudaNdarrayType(float32, vector)&gt; [id CV] | |MakeVector{dtype='int64'} [id CW] '' </code></pre></div></div> <h3 id="profiling-performance">Profiling performance</h3> <p>Profiling is important after making changes to a Theano function, to make sure that the compiled code won’t run inefficiently. Profiling can be enabled by setting the flag <code class="language-plaintext highlighter-rouge">profile=True</code>, or for certain functions individually by passing the argument <code class="language-plaintext highlighter-rouge">profile=True</code> to <code class="language-plaintext highlighter-rouge">theano.function()</code>.</p> <p>When profiling is enabled, the function runs very slowly, so if your program calls it repeatedly, you probably want to exit after a few iterations. When the program exits, theano prints several tables. I have found the Apply table to be the most useful. It displays the time spent in each node of the computation graph, in descending order:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Apply ------ &lt;% time&gt; &lt;sum %&gt; &lt;apply time&gt; &lt;time per call&gt; &lt;#call&gt; &lt;id&gt; 89.2% 89.2% 383.470s 2.52e-01s 1523 116 input 0: dtype=float32, shape=(10001,), strides=(1,) input 1: dtype=float32, shape=(18000,), strides=(1,) input 2: dtype=int64, shape=(18000,), strides=c output 0: dtype=float32, shape=(10001,), strides=(1,) 3.6% 92.8% 15.575s 1.02e-02s 1523 111 input 0: dtype=float32, shape=(10001,), strides=(1,) input 1: dtype=float32, shape=(720,), strides=(1,) input 2: dtype=int64, shape=(720,), strides=c output 0: dtype=float32, shape=(10001,), strides=(1,) </code></pre></div></div> <p>In the above example, the most expensive operation is node 116, which consumes 89 % of the total processing time, so there’s clearly something wrong with this operation. The ID 116 can be used to locate this node in the computation graph:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>GpuAdvancedIncSubtensor1{inplace,inc} [id CY] '' 116 |GpuAdvancedIncSubtensor1{inplace,inc} [id CZ] '' 111 | |GpuAlloc{memset_0=True} [id DA] '' 22 | | |CudaNdarrayConstant{[ 0.]} [id DB] | | |Shape_i{0} [id DC] '' 13 | | |bias [id BU] </code></pre></div></div> <p>It is essential to name all the shared variables by providing the <code class="language-plaintext highlighter-rouge">name</code> argument to the their constructors. This makes it easier to understand which function calls generated a specific part of the graph. In this case, the graph shows that the bottleneck <code class="language-plaintext highlighter-rouge">GpuAdvancedIncSubtensor1</code> operates on the <code class="language-plaintext highlighter-rouge">bias</code> variable and we can find the code that produced this operation. It looks like this:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">value</span> <span class="o">=</span> <span class="n">numpy</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">size</span><span class="p">).</span><span class="n">astype</span><span class="p">(</span><span class="n">theano</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">floatX</span><span class="p">)</span> <span class="n">bias</span> <span class="o">=</span> <span class="n">theano</span><span class="p">.</span><span class="n">shared</span><span class="p">(</span><span class="n">value</span><span class="p">,</span> <span class="n">name</span><span class="o">=</span><span class="s">'bias'</span><span class="p">)</span> <span class="p">...</span> <span class="n">bias</span> <span class="o">=</span> <span class="n">bias</span><span class="p">[</span><span class="n">targets</span><span class="p">]</span> <span class="n">bias</span> <span class="o">=</span> <span class="n">bias</span><span class="p">.</span><span class="n">reshape</span><span class="p">([</span><span class="n">minibatch_size</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">])</span> </code></pre></div></div> <p><code class="language-plaintext highlighter-rouge">GpuAdvancedIncSubtensor1</code> is responsible for updating the specific elements of the bias vector (the elements indexed by <code class="language-plaintext highlighter-rouge">targets</code>), when the bias parameter is updated. So how can the performance be improved? It can be difficult to know what’s wrong, especially while Theano is still under quite heavy development and some things may be broken. If you have a working version without the performance problem, the best bet might be to make small changes to the code to see what causes the problem to appear.</p> <p>In this particular case, turned out that a faster variant of the op, <code class="language-plaintext highlighter-rouge">GpuAdvancedIncSubtensor1_dev20</code> was implemented only for two-dimensional input, and the performance was radically improved by first converting the bias to 2D:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">bias</span> <span class="o">=</span> <span class="n">bias</span><span class="p">[:,</span> <span class="bp">None</span><span class="p">]</span> <span class="n">bias</span> <span class="o">=</span> <span class="n">bias</span><span class="p">[</span><span class="n">targets</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span> </code></pre></div></div> <h3 id="memory-usage">Memory usage</h3> <p>The memory usage of a neural network application can be crucial, current GPU boards usually containing no more than 12 GB of memory. When Theano runs out of memory, it throws an exception (<code class="language-plaintext highlighter-rouge">GpuArrayException</code> in the new backend) with the message <code class="language-plaintext highlighter-rouge">out of memory</code>. It means that some of the variables, either the tensor variables used as the input of a function, shared variables, or the intermediate variables created during the execution of the graph, do not fit in the GPU memory.</p> <p>The size of the shared variables, such as neural network weights, can be easily observed, and it is clear how the layer sizes affect the sizes of the weight matrices. Weight matrix dimensions are defined by the number of inputs and the number of outputs, so the weight matrices get large when two large layers follow each other (or the number of inputs/outputs and the first/last layer are large).</p> <p>The shared variables and inputs constitute just part of the memory usage, however. Theano also needs to save the intermediate results of the graph nodes, e.g. outputs of each layer. The size of these outputs depend on the batch size, as well as the layer size. When Theano fails to save an intermediate result, it prints a lot of useful information, including the opration that produced the data, the node in the computation graph, and all the variables in the memory:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">Apply</span> <span class="n">node</span> <span class="n">that</span> <span class="n">caused</span> <span class="n">the</span> <span class="n">error</span><span class="p">:</span> <span class="n">GpuDot22</span><span class="p">(</span><span class="n">GpuReshape</span><span class="p">{</span><span class="mi">2</span><span class="p">}.</span><span class="mi">0</span><span class="p">,</span> <span class="n">layer_1</span><span class="o">/</span><span class="n">W</span><span class="p">)</span> <span class="n">Toposort</span> <span class="n">index</span><span class="p">:</span> <span class="mi">62</span> <span class="n">Inputs</span> <span class="n">types</span><span class="p">:</span> <span class="p">[</span><span class="n">GpuArrayType</span><span class="o">&lt;</span><span class="bp">None</span><span class="o">&gt;</span><span class="p">(</span><span class="n">float32</span><span class="p">),</span> <span class="n">GpuArrayType</span><span class="o">&lt;</span><span class="bp">None</span><span class="o">&gt;</span><span class="p">(</span><span class="n">float32</span><span class="p">)]</span> <span class="n">Inputs</span> <span class="n">shapes</span><span class="p">:</span> <span class="p">[(</span><span class="mi">482112</span><span class="p">,</span> <span class="mi">200</span><span class="p">),</span> <span class="p">(</span><span class="mi">200</span><span class="p">,</span> <span class="mi">4000</span><span class="p">)]</span> <span class="n">Inputs</span> <span class="n">strides</span><span class="p">:</span> <span class="p">[(</span><span class="mi">800</span><span class="p">,</span> <span class="mi">4</span><span class="p">),</span> <span class="p">(</span><span class="mi">16000</span><span class="p">,</span> <span class="mi">4</span><span class="p">)]</span> <span class="n">Inputs</span> <span class="n">values</span><span class="p">:</span> <span class="p">[</span><span class="s">'not shown'</span><span class="p">,</span> <span class="s">'not shown'</span><span class="p">]</span> <span class="n">Inputs</span> <span class="n">type_num</span><span class="p">:</span> <span class="p">[</span><span class="mi">11</span><span class="p">,</span> <span class="mi">11</span><span class="p">]</span> <span class="n">Outputs</span> <span class="n">clients</span><span class="p">:</span> <span class="p">[[</span><span class="n">GpuReshape</span><span class="p">{</span><span class="mi">3</span><span class="p">}(</span><span class="n">GpuDot22</span><span class="p">.</span><span class="mi">0</span><span class="p">,</span> <span class="n">MakeVector</span><span class="p">{</span><span class="n">dtype</span><span class="o">=</span><span class="s">'int64'</span><span class="p">}.</span><span class="mi">0</span><span class="p">)]]</span> <span class="n">Debugprint</span> <span class="n">of</span> <span class="n">the</span> <span class="nb">apply</span> <span class="n">node</span><span class="p">:</span> <span class="n">GpuDot22</span> <span class="p">[</span><span class="nb">id</span> <span class="n">A</span><span class="p">]</span> <span class="o">&lt;</span><span class="n">GpuArrayType</span><span class="o">&lt;</span><span class="bp">None</span><span class="o">&gt;</span><span class="p">(</span><span class="n">float32</span><span class="p">)</span><span class="o">&gt;</span> <span class="s">''</span> <span class="o">|</span><span class="n">GpuReshape</span><span class="p">{</span><span class="mi">2</span><span class="p">}</span> <span class="p">[</span><span class="nb">id</span> <span class="n">B</span><span class="p">]</span> <span class="o">&lt;</span><span class="n">GpuArrayType</span><span class="o">&lt;</span><span class="bp">None</span><span class="o">&gt;</span><span class="p">(</span><span class="n">float32</span><span class="p">)</span><span class="o">&gt;</span> <span class="s">''</span> <span class="o">|</span> <span class="o">|</span><span class="n">GpuAdvancedSubtensor1</span> <span class="p">[</span><span class="nb">id</span> <span class="n">C</span><span class="p">]</span> <span class="o">&lt;</span><span class="n">GpuArrayType</span><span class="o">&lt;</span><span class="bp">None</span><span class="o">&gt;</span><span class="p">(</span><span class="n">float32</span><span class="p">)</span><span class="o">&gt;</span> <span class="s">''</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span><span class="n">projection_layer</span><span class="o">/</span><span class="n">W</span> <span class="p">[</span><span class="nb">id</span> <span class="n">D</span><span class="p">]</span> <span class="o">&lt;</span><span class="n">GpuArrayType</span><span class="o">&lt;</span><span class="bp">None</span><span class="o">&gt;</span><span class="p">(</span><span class="n">float32</span><span class="p">)</span><span class="o">&gt;</span> <span class="o">|</span> <span class="p">...</span> <span class="o">|</span><span class="n">layer_1</span><span class="o">/</span><span class="n">W</span> <span class="p">[</span><span class="nb">id</span> <span class="n">BA</span><span class="p">]</span> <span class="o">&lt;</span><span class="n">GpuArrayType</span><span class="o">&lt;</span><span class="bp">None</span><span class="o">&gt;</span><span class="p">(</span><span class="n">float32</span><span class="p">)</span><span class="o">&gt;</span> <span class="n">Storage</span> <span class="nb">map</span> <span class="n">footprint</span><span class="p">:</span> <span class="o">-</span> <span class="n">GpuReshape</span><span class="p">{</span><span class="mi">2</span><span class="p">}.</span><span class="mi">0</span><span class="p">,</span> <span class="n">Shape</span><span class="p">:</span> <span class="p">(</span><span class="mi">482112</span><span class="p">,</span> <span class="mi">200</span><span class="p">),</span> <span class="n">ElemSize</span><span class="p">:</span> <span class="mi">4</span> <span class="n">Byte</span><span class="p">(</span><span class="n">s</span><span class="p">),</span> <span class="n">TotalSize</span><span class="p">:</span> <span class="mi">385689600</span> <span class="n">Byte</span><span class="p">(</span><span class="n">s</span><span class="p">)</span> <span class="o">-</span> <span class="n">layer_1</span><span class="o">/</span><span class="n">W</span><span class="p">,</span> <span class="n">Shared</span> <span class="n">Input</span><span class="p">,</span> <span class="n">Shape</span><span class="p">:</span> <span class="p">(</span><span class="mi">200</span><span class="p">,</span> <span class="mi">4000</span><span class="p">),</span> <span class="n">ElemSize</span><span class="p">:</span> <span class="mi">4</span> <span class="n">Byte</span><span class="p">(</span><span class="n">s</span><span class="p">),</span> <span class="n">TotalSize</span><span class="p">:</span> <span class="mi">3200000</span> <span class="n">Byte</span><span class="p">(</span><span class="n">s</span><span class="p">)</span> </code></pre></div></div> <p>The above message (slightly edited for clarity) shows that the product of two matrices, layer 1 weight and the output of the projection layer would not fit in the GPU memory. The output of the projection layer is 482112✕200, which takes 482112✕200✕4 = 386 MB of memory. The weight matrix is 200✕4000, so the result would require 482112✕4000✕4 = 7714 MB of memory. Either the batch size or the layer size needs to be reduced.</p> <p>If you have multiple GPUs, the new gpuarray backend allows defining the <em>context</em> of shared variables, instructing Theano to place the variable in a specific GPU. This way you can split a large model over multiple GPUs. This also causes the computation to be performed and the intermediate results to be saved in the corresponding GPU, when possible.</p> <p>If your program is working but you want to observe the memory usage, you can enable memory profiling by setting the flags <code class="language-plaintext highlighter-rouge">profile=True,profile_memory=True</code>. Theano will print the peak memory usage of each function, and a list of the largest variables.</p>Seppo Enarviseppo2021@marjaniemi.comNotes on writing, testing, and debugging Theano computation graphsStack trace with GDB2016-07-05T00:00:00+02:002016-07-05T00:00:00+02:00https://senarvi.github.io/stack-trace-with-gdb<h2 id="how-to-find-the-location-where-a-program-has-crashed-from-linux-command-line">How to find the location where a program has crashed from Linux command line</h2> <h3 id="stack-backtrace-from-linux-command-line">Stack backtrace from Linux command line</h3> <p>One of the most useful applications of GDB is to get a stack backtrace from Linux console, when a program crashes e.g. due to a segmentation fault. One would typically start the program in GDB, run it, and use the <code class="language-plaintext highlighter-rouge">backtrace</code> command to print a stack trace.</p> <div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">$</span><span class="w"> </span>gdb <span class="nt">-q</span> my-program <span class="go">(gdb) run Starting program: /.../my-program Program received signal SIGSEGV, Segmentation fault. 0x00000000004004fd in fail() () (gdb) backtrace </span><span class="gp">#</span>0 0x00000000004004fd <span class="k">in </span>fail<span class="o">()</span> <span class="o">()</span> <span class="gp">#</span>1 0x0000000000400513 <span class="k">in </span>main <span class="o">()</span> </code></pre></div></div> <p>In order to get as much information out as possible, the program should be compiled with debugging information included in the executable. For example:</p> <div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">$</span><span class="w"> </span>g++ <span class="nt">-ggdb</span> my-program.cc <span class="nt">-o</span> my-program <span class="gp">$</span><span class="w"> </span>gdb <span class="nt">-q</span> my-program <span class="go">Reading symbols from my-program...done. (gdb) run Starting program: /.../my-program Program received signal SIGSEGV, Segmentation fault. 0x00000000004004fd in fail () at my-program.cc:3 </span><span class="gp">3 ++*ptr;</span><span class="w"> </span><span class="go">(gdb) backtrace </span><span class="gp">#</span>0 0x00000000004004fd <span class="k">in </span>fail <span class="o">()</span> at my-program.cc:3 <span class="gp">#</span>1 0x0000000000400513 <span class="k">in </span>main <span class="o">()</span> at my-program.cc:7 </code></pre></div></div> <h3 id="gdb-in-batch-mode">GDB in batch mode</h3> <p>The above examples expect that you start an interactive session with GDB. Sometimes it’s useful to be able to run a program under GDB non-interactively, for example when running a program in a compute cluster. For such cases the batch mode is useful:</p> <div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">$</span><span class="w"> </span>gdb <span class="nt">-q</span> <span class="nt">-batch</span> <span class="nt">-ex</span> run <span class="nt">-ex</span> backtrace my-program <span class="go">Program received signal SIGSEGV, Segmentation fault. 0x00000000004004fd in fail () at my-program.cc:3 </span><span class="gp">3 ++*ptr;</span><span class="w"> </span><span class="gp">#</span>0 0x00000000004004fd <span class="k">in </span>fail <span class="o">()</span> at my-program.cc:3 <span class="gp">#</span>1 0x0000000000400513 <span class="k">in </span>main <span class="o">()</span> at my-program.cc:7 </code></pre></div></div> <p>Often one needs to pass some command line arguments to the program that is debugged. The option <code class="language-plaintext highlighter-rouge">--args</code> tells GDB that the rest of the command line specifies the program to be debugged and the arguments to be passed to the program, i.e.</p> <div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">$</span><span class="w"> </span>gdb arguments-to-gdb <span class="nt">--args</span> my-program arguments-to-my-program </code></pre></div></div> <h3 id="multi-threaded-programs">Multi-threaded programs</h3> <p>By default GDB shows stack trace only for the current thread. When debugging a multi-threaded program, you may want to use the command <code class="language-plaintext highlighter-rouge">thread apply all backtrace</code> to display stack trace for all the threads. Another useful command is <code class="language-plaintext highlighter-rouge">set print thread-events off</code>, which disables printing a message every time a thread starts or exits. Finally, the command <code class="language-plaintext highlighter-rouge">handle &lt;signal&gt; nostop pass</code> can be used to instruct GDB not to stop on a signal that the program should be allowed to handle.</p> <p>Below is a single command that runs a program under GDB, stops when the program receives a signal other than SIGALRM or SIGCHLD, and prints the stack backtrace of all the threads:</p> <div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">$</span><span class="w"> </span>gdb <span class="nt">-q</span> <span class="se">\</span> <span class="nt">-batch</span> <span class="se">\</span> <span class="nt">-ex</span> <span class="s1">'set print thread-events off'</span> <span class="se">\</span> <span class="nt">-ex</span> <span class="s1">'handle SIGALRM nostop pass'</span> <span class="se">\</span> <span class="nt">-ex</span> <span class="s1">'handle SIGCHLD nostop pass'</span> <span class="se">\</span> <span class="nt">-ex</span> <span class="s1">'run'</span> <span class="se">\</span> <span class="nt">-ex</span> <span class="s1">'thread apply all backtrace'</span> <span class="se">\</span> <span class="nt">--args</span> <span class="se">\</span> my-program <span class="se">\</span> arguments-to-my-program </code></pre></div></div>Seppo Enarviseppo2021@marjaniemi.comHow to find the location where a program has crashed from Linux command line