Corrections to 鈥淕eneralization Bounds via Information Density and Conditional Information Density鈥 [Nov 20 824-839]
An error in the proof of the data-dependent tail bounds on the generalization error presented in Hellstr枚m and Durisi (2020) is identified, and a correction is proposed. Furthermore, we note that the absolute continuity requirements in Hellstr枚m and Durisi (2020) need to be strengthened to avoid measurability issues.
Guest Editorial for Special Issue on Coded Computing
Computing is the next frontier for information theory. Intellectually, the goal of coded computing has been of interest from the days of von Neumann and Shannon. von Neumann examined this issue in his 1956 paper 鈥淧robabilistic Logics and the Synthesis of Reliable Organisms From Unreliable Components,鈥 which was in turn motivated intellectually by Shannon鈥檚 1948 paper, and by the application of understanding reliability of seemingly noisy biological systems.
Quantization of Distributed Data for Learning
We consider machine learning applications that train a model by leveraging data distributed over a trusted network, where communication constraints can create a performance bottleneck. A number of recent approaches propose to overcome this bottleneck through compression of gradient updates. However, as models become larger, so does the size of the gradient updates.
Bivariate Polynomial Coding for Efficient Distributed Matrix Multiplication
Coded computing is an effective technique to mitigate 鈥渟tragglers鈥 in large-scale and distributed matrix multiplication. In particular, univariate polynomial codes have been shown to be effective in straggler mitigation by making the computation time depend only on the fastest workers. However, these schemes completely ignore the work done by the straggling workers resulting in a waste of computational resources.
Communication-Efficient and Byzantine-Robust Distributed Learning With Error Feedback
We develop a communication-efficient distributed learning algorithm that is robust against Byzantine worker machines. We propose and analyze a distributed gradient-descent algorithm that performs a simple thresholding based on gradient norms to mitigate Byzantine failures. We show the (statistical) error-rate of our algorithm matches that of Yin et al. (2018), which uses more complicated schemes (coordinate-wise median, trimmed mean).
Coded Sequential Matrix Multiplication for Straggler Mitigation
In this work, we consider a sequence of $J$ matrix multiplication jobs which needs to be distributed by a master across multiple worker nodes. For $i\in \{1,2,\ldots,J\}$ , job- $i$ begins in round- $i$ and has to be completed by round- $(i+T)$ . In order to provide resiliency against slow workers (stragglers), previous works focus on coding across workers, which is the special case of $T=0$ . We propose here two schemes with $T > 0$ , which allow for coding across workers as well as the dimension of time.
SQuARM-SGD: Communication-Efficient Momentum SGD for Decentralized Optimization
In this paper, we propose and analyze SQuARM-SGD, a communication-efficient algorithm for decentralized training of large-scale machine learning models over a network. In SQuARM-SGD, each node performs a fixed number of local SGD steps using Nesterov鈥檚 momentum and then sends sparsified and quantized updates to its neighbors regulated by a locally computable triggering criterion.
Factored LT and Factored Raptor Codes for Large-Scale Distributed Matrix Multiplication
We propose two coding schemes for distributed matrix multiplication in the presence of stragglers. These coding schemes are adaptations of Luby Transform (LT) codes and Raptor codes to distributed matrix multiplication and are termed Factored LT (FLT) codes and Factored Raptor (FRT) codes. We show that all nodes in the Tanner graph of a randomly sampled code have a tree-like neighborhood with high probability. This ensures that the density evolution analysis gives a reasonable estimate of the average recovery threshold of FLT codes.
Compressing Gradients by Exploiting Temporal Correlation in Momentum-SGD
An increasing bottleneck in decentralized optimization is communication. Bigger models and growing datasets mean that decentralization of computation is important and that the amount of information exchanged is quickly growing. While compression techniques have been introduced to cope with the latter, none has considered leveraging the temporal correlations that exist in consecutive vector updates. An important example is distributed momentum-SGD where temporal correlation is enhanced by the low-pass-filtering effect of applying momentum.