## Tuesday, May 24, 2016

### Thesis: Ristretto: Hardware-Oriented Approximation of Convolutional Neural Networks - implementation -

Ristretto: Hardware-Oriented Approximation of Convolutional Neural Networks by  Philipp Gysel

Convolutional neural networks (CNN) have achieved major breakthroughs in recent years. Their performance in computer vision have matched and in some areas even surpassed human capabilities. Deep neural networks can capture complex non-linear features; however this ability comes at the cost of high computational and memory requirements. State-of-art networks require billions of arithmetic operations and millions of parameters. To enable embedded devices such as smartphones, Google glasses and monitoring cameras with the astonishing power of deep learning, dedicated hardware accelerators can be used to decrease both execution time and power consumption. In applications where fast connection to the cloud is not guaranteed or where privacy is important, computation needs to be done locally. Many hardware accelerators for deep neural networks have been proposed recently. A first important step of accelerator design is hardware-oriented approximation of deep networks, which enables energy-efficient inference. We present Ristretto, a fast and automated framework for CNN approximation. Ristretto simulates the hardware arithmetic of a custom hardware accelerator. The framework reduces the bit-width of network parameters and outputs of resource-intense layers, which reduces the chip area for multiplication units significantly. Alternatively, Ristretto can remove the need for multipliers altogether, resulting in an adder-only arithmetic. The tool fine-tunes trimmed networks to achieve high classification accuracy. Since training of deep neural networks can be time-consuming, Ristretto uses highly optimized routines which run on the GPU. This enables fast compression of any given network. Given a maximum tolerance of 1%, Ristretto can successfully condense CaffeNet and SqueezeNet to 8-bit. The code for Ristretto is available.

The page for the Ristretto project is at: http://lepsucd.com/?page_id=621
From the page:

Ristretto is an automated CNN-approximation tool which condenses 32-bit floating point networks. Ristretto is an extention of Caffe and allows to test, train and finetune networks with limited numerical precision.

# Ristretto In a Minute

• Ristretto Tool: The Ristretto tool performs automatic network quantization and scoring, using different bit-widths for number representation, to find a good balance between compression rate and network accuracy.
• Ristretto Layers: Ristretto reimplements Caffe-layers with quantized numbers.
• Testing and Training: Thanks to Ristretto’s smooth integration into Caffe, network description files can be manually changed to quantize different layers. The bit-width used for different layers as well as other parameters can be set in the network’s prototxt file. This allows to directly test and train condensed networks, without any need of recompilation.

# Approximation Schemes

Ristretto allows for three different quantization strategies to approximate Convolutional Neural Networks:
• Dynamic Fixed Point: A modified fixed-point format with more flexibility.
• Mini Floating Point: Bit-width reduced floating point numbers.
• Power-of-two parameters: Layers with power-of-two parameters don’t need any multipliers, when implemented in hardware.

# Cite us

Our approximation framework was presented in an extended abstract at ICLR’16. Check out our poster. All results can be reproduced with our code on Github. If Ristretto helps your research project, please cite us:
1. @article{gysel2016hardware,
2. title={Hardware-oriented Approximation of Convolutional Neural Networks},
3. author={Gysel, Philipp and Motamedi, Mohammad and Ghiasi, Soheil},
4. journal={arXiv preprint arXiv:1604.03168},
5. year={2016}
6. }

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

## Monday, May 23, 2016

### The Shallow Side of Deep Learning

On his Twitter feed, Atlas mentioned three preprints: the first allows the building of deeper construction out of kernels (shallow methods). The next one shows how some deep nets can be considered as some sorts of ensemble of shallower nets. Finally, the last one makes the case that  dropout and stochastic depth are somehow equivalent to averaging over shallower networks.

In this paper, we propose a new image representation based on a multilayer kernel machine that performs end-to-end learning. Unlike traditional kernel methods, where the kernel is handcrafted or adapted to data in an unsupervised manner, we learn how to shape the kernel for a supervised prediction problem. We proceed by generalizing convolutional kernel networks, which originally provide unsupervised image representations, and we derive backpropagation rules to optimize model parameters. As a result, we obtain a new type of convolutional neural network with the following properties: (i) at each layer, learning filters is equivalent to optimizing a linear subspace in a reproducing kernel Hilbert space (RKHS), where we project data, (ii) the network may be learned with supervision or without, (iii) the model comes with a natural regularization function (the norm in the RKHS). We show that our method achieves reasonably competitive performance on some standard "deep learning" image classification datasets such as CIFAR-10 and SVHN, and also state-of-the-art results for image super-resolution, demonstrating the applicability of our approach to a large variety of image-related tasks.

Residual Networks are Exponential Ensembles of Relatively Shallow Networks by Andreas Veit, Michael Wilber, Serge Belongie

In this work, we introduce a novel interpretation of residual networks showing they are exponential ensembles. This observation is supported by a large-scale lesion study that demonstrates they behave just like ensembles at test time. Subsequently, we perform an analysis showing these ensembles mostly consist of networks that are each relatively shallow. For example, contrary to our expectations, most of the gradient in a residual network with 110 layers comes from an ensemble of very short networks, i.e., only 10-34 layers deep. This suggests that in addition to describing neural networks in terms of width and depth, there is a third dimension: multiplicity, the size of the implicit ensemble. Ultimately, residual networks do not resolve the vanishing gradient problem by preserving gradient flow throughout the entire depth of the network - rather, they avoid the problem simply by ensembling many short networks together. This insight reveals that depth is still an open research question and invites the exploration of the related notion of multiplicity.

Swapout: Learning an ensemble of deep architectures by  Saurabh Singh, Derek Hoiem, David Forsyth

We describe Swapout, a new stochastic training method, that outperforms ResNets of identical network structure yielding impressive results on CIFAR-10 and CIFAR-100. Swapout samples from a rich set of architectures including dropout, stochastic depth and residual architectures as special cases. When viewed as a regularization method swapout not only inhibits co-adaptation of units in a layer, similar to dropout, but also across network layers. We conjecture that swapout achieves strong regularization by implicitly tying the parameters across layers. When viewed as an ensemble training method, it samples a much richer set of architectures than existing methods such as dropout or stochastic depth. We propose a parameterization that reveals connections to exiting architectures and suggests a much richer set of architectures to be explored. We show that our formulation suggests an efficient training method and validate our conclusions on CIFAR-10 and CIFAR-100 matching state of the art accuracy. Remarkably, our 32 layer wider model performs similar to a 1001 layer ResNet model.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

## Saturday, May 21, 2016

### Saturday Morning Video: Solvable Model of Unsupervised Feature Learning, Lenka Zdeborova @ Simons Institute

Lenka who will be at our next Paris Machine Learning meetup on June 8th, was recently speaking at the Simons institute's Random Instances and Phase Transitions workshop within the trimester long Counting Complexity and Phase Transitions program. Here is Solvable Model of Unsupervised Feature Learning by Lenka Zdeborova,

We introduce factorization of certain randomly generated matrices as a probabilistic model for unsupervised feature learning. Using the replica method we obtain an exact closed formula for the minimum mean-squared error and sample complexity for reconstruction of the features from data. An alternative way to derive this formula is via state evolution of the associated approximate message passing. Analyzing the fixed points of this formula we identify a gap between computationally tractable and information theoretically optimal inference, associated to a first order phase transition. The model can serve for benchmarking generic-purpose feature learning algorithms, its analysis provides insight into theoretical understanding of unsupervised learning from data.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

## Friday, May 20, 2016

### ASP Vision: Optically Computing the First Layer of Convolutional Neural Networks using Angle Sensitive Pixels

Yes ! The great convergence in action. Merging sensing with the first layer of a deep learning architecture:

ASP Vision: Optically Computing the First Layer of Convolutional Neural Networks using Angle Sensitive Pixels by Huaijin Chen, Suren Jayasuriya, Jiyue Yang, Judy Stephen, Sriram Sivaramakrishnan, Ashok Veeraraghavan, Alyosha Molnar

Deep learning using convolutional neural networks (CNNs) is quickly becoming the state-of-the-art for challenging computer vision applications. However, deep learning's power consumption and bandwidth requirements currently limit its application in embedded and mobile systems with tight energy budgets. In this paper, we explore the energy savings of optically computing the first layer of CNNs. To do so, we utilize bio-inspired Angle Sensitive Pixels (ASPs), custom CMOS diffractive image sensors which act similar to Gabor filter banks in the V1 layer of the human visual cortex. ASPs replace both image sensing and the first layer of a conventional CNN by directly performing optical edge filtering, saving sensing energy, data bandwidth, and CNN FLOPS to compute. Our experimental results (both on synthetic data and a hardware prototype) for a variety of vision tasks such as digit recognition, object recognition, and face identification demonstrate 97% reduction in image sensor power consumption and 90% reduction in data bandwidth from sensor to CPU, while achieving similar performance compared to traditional deep learning pipelines.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

## Thursday, May 19, 2016

### FISTA and "Maximal Sparsity with Deep Networks?", Learning MMSE Optimal Thresholds for FISTA

Ulugbek just sent me the following:

Hi Igor,

I just saw your today’s post on Nuit Blanche on "Maximal Sparsity with Deep Networks?”.

You mentioned both FISTA and AMP, and the number of iterations required for convergence. I generally agree with your observations about studies on FISTA/AMP being hugely insightful. We have been investigating this from our side with Hassan, and I wanted to share a small update (see attached), which will be presented at iTWIST 2016 in Denmark this August. We haven’t uploaded the manuscript to arXiv as iTWIST anyways publishes its proceedings there.

Going from ISTA to FISTA is quite advantageous. We initialize the algorithm with 0 and also set the initial regularizer to 0, i.e. use identity operator instead of a shrinkage (which corresponds to having least-squares estimator) and converge much faster towards a regularizer that promotes sparsity.

Thank you having us all informed here,
Kind Regards,
- Ulugbek
Thanks Ulugbek and  Hassan ! Here is the paper:

Fast iterative shrinkage/thresholding algorithm (FISTA) is one of the most commonly used methods for solving linear inverse problems. In this work, we present a scheme that enables learning of optimal thresholding functions for FISTA from a set of training data. In particular, by relating iterations of FISTA to a deep neural network (DNN), we use the error backpropagation algorithm to find thresholding functions that minimize mean squared error (MSE) of the reconstruction for a given statistical distribution of data. Accordingly, the scheme can be used to computationally obtain MSE optimal variant of FISTA for performing statistical estimation.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

### Maximal Sparsity with Deep Networks?

Back on the longest night of 2015 (see The Great Convergence in Action: Learning optimal nonlinearities for iterative thresholding algorithms ), I asked if Winter was coming to applied mathematics as one can witness a convergence of people looking at changing parameters of iterative solvers with Deep Learning techniques. We also have mentioned the following parallel here on Nuit Blanche for a while. Anyway, it is put in better words here:

....Although seemingly unrelated at rst glance, the layers of a deep neural network (DNN) can be viewed as iterations of some algorithm that have been unfolded into a network structure (Gregor and LeCun, 2010; Hershey et al., 2014). In particular, iterative thresholding approaches such as those mentioned above typically involve an update rule comprised of a xed, linear filter followed by a non-linear activation function that promotes sparsity. Consequently, algorithm execution can be interpreted as passing an input through an extremely deep network with constant layer weights (dependent on) at every layer....
Here is the paper. My remarks are below:Maximal Sparsity with Deep Networks? by Bo Xin, Yizhou Wang, Wen Gao, David Wipf
The iterations of many sparse estimation algorithms are comprised of a fixed linear filter cascaded with a thresholding nonlinearity, which collectively resemble a typical neural network layer. Consequently, a lengthy sequence of algorithm iterations can be viewed as a deep network with shared, hand-crafted layer weights. It is therefore quite natural to examine the degree to which a learned network model might act as a viable surrogate for traditional sparse estimation in domains where ample training data is available. While the possibility of a reduced computational budget is readily apparent when a ceiling is imposed on the number of layers, our work primarily focuses on estimation accuracy. In particular, it is well-known that when a signal dictionary has coherent columns, as quantified by a large RIP constant, then most tractable iterative algorithms are unable to find maximally sparse representations. In contrast, we demonstrate both theoretically and empirically the potential for a trained deep network to recover minimal $\ell_0$-norm representations in regimes where existing methods fail. The resulting system is deployed on a practical photometric stereo estimation problem, where the goal is to remove sparse outliers that can disrupt the estimation of surface normals from a 3D scene.
It's fascinating. Three items:
• The paper ought to mention similar efforts such as the one mentioned earlier here (
• While ISTA and related algorothms are interesting, AMP solvers that have similar structure ought to be investigated as it is expected that the number of iterations is much smaller than that of traditional FISTA et al.
• The following two earlier papers were focused on sparse coding and NMF (they are listed below) It would seem to me that much more insight could be gotten out of the studies on FISTA et related algorithms for compressive sensing but I can imagine I have a bias on that matter.

Deep Unfolding: Model-Based Inspiration of Novel Deep Architectures by John R. Hershey, Jonathan Le Roux, Felix Weninger

Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, problem domain knowledge can be built into the constraints of the model, typically at the expense of difficulties during inference. In contrast, deterministic deep neural networks are constructed in such a way that inference is straightforward, but their architectures are generic and it is unclear how to incorporate knowledge. This work aims to obtain the advantages of both approaches. To do so, we start with a model-based approach and an associated inference algorithm, and \emph{unfold} the inference iterations as layers in a deep network. Rather than optimizing the original model, we \emph{untie} the model parameters across layers, in order to create a more powerful network. The resulting architecture can be trained discriminatively to perform accurate inference within a fixed network size. We show how this framework allows us to interpret conventional networks as mean-field inference in Markov random fields, and to obtain new architectures by instead using belief propagation as the inference algorithm. We then show its application to a non-negative matrix factorization model that incorporates the problem-domain knowledge that sound sources are additive. Deep unfolding of this model yields a new kind of non-negative deep neural network, that can be trained using a multiplicative backpropagation-style update algorithm. We present speech enhancement experiments showing that our approach is competitive with conventional neural networks despite using far fewer parameters.

Learning Fast Approximations of Sparse Coding by Karol Gregor , Yann Lecun

In Sparse Coding (SC), input vectors are reconstructed using a sparse linear combination of basis vectors. SC has become a popular method for extracting features from data. For a given input, SC minimizes a quadratic reconstruction error with an L1 penalty term on the code. The process is often too slow for applications such as real-time pattern recognition. We proposed two versions of a very fast algorithm that produces approximate estimates of the sparse code that can be used to compute good visual features, or to initialize exact iterative algorithms. The main idea is to train a non-linear, feed-forward predictor with a specific architecture and a fixed depth to produce the best possible approximation of the sparse code. A version of the method, which can be seen as a trainable version of Li and Osher’s coordinate descent method, is shown to produce approximate solutions with 10 times less computation than Li and Osher’s for the same approximation error. Unlike previous proposals for sparse code predictors, the system allows a kind of approximate “explaining away ” to take place during inference. The resulting predictor is differentiable and can be included into globallytrained recognition systems. 1.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

## Wednesday, May 18, 2016

### Performance Trade-Offs in Multi-Processor Approximate Message Passing

Nice ! What if you want to use a fast solver like AMP on a distributed architecture or a sensor network topology, the following paper investigate the trade-offs associated with that.

Performance Trade-Offs in Multi-Processor Approximate Message Passing by Junan Zhu, Ahmad Beirami, Dror Baron

We consider large-scale linear inverse problems in Bayesian settings. Our general approach follows a recent line of work that applies the approximate message passing (AMP) framework in multi-processor (MP) computational systems by storing and processing a subset of rows of the measurement matrix along with corresponding measurements at each MP node. In each MP-AMP iteration, nodes of the MP system and its fusion center exchange lossily compressed messages pertaining to their estimates of the input. There is a trade-off between the physical costs of the reconstruction process including computation time, communication loads, and the reconstruction quality, and it is impossible to simultaneously minimize all the costs. We pose this minimization as a multi-objective optimization problem (MOP), and study the properties of the best trade-offs (Pareto optimality) in this MOP. We prove that the achievable region of this MOP is convex, and conjecture how the combined cost of computation and communication scales with the desired mean squared error. These properties are verified numerically.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

### Performance Limits for Noisy Multi-Measurement Vector Problems

Here is some analysis (and attendant phase diagrams) for BP and AMP solvers in the MMV and SMV (CS) cases. As opposed to previous work featuring phase diagrams, this analysis goes into the noisy setting. It's been added to the Advanced Matrix Factorization Jungle page.

Performance Limits for Noisy Multi-Measurement Vector Problems by Junan Zhu, Dror Baron, Florent Krzakala

Compressed sensing (CS) demonstrates that sparse signals can be estimated from under-determined linear systems. Distributed CS (DCS) is an area of CS that further reduces the number of measurements by considering joint sparsity within signal ensembles. DCS with jointly sparse signals has applications in multi-sensor acoustic sensing, magnetic resonance imaging with multiple coils, remote sensing, and array signal processing. Multi-measurement vector (MMV) problems consider the estimation of jointly sparse signals under the DCS framework. Two related MMV settings are studied. In the first setting, each signal vector is measured by a different independent and identically distributed (i.i.d.) measurement matrix, while in the second setting, all signal vectors are measured by the same i.i.d. matrix. Replica analysis is performed for these two MMV settings, and the minimum mean squared error (MMSE), which turns out to be identical for both settings, is obtained as a function of the noise variance and number of measurements. Multiple performance regions for MMV are identified where the MMSE behaves differently as a function of the noise variance and the number of measurements. Belief propagation (BP) is a CS signal estimation framework that often achieves the MMSE asymptotically. A phase transition for BP is identified. This phase transition, verified by numerical results, separates the regions where BP achieves the MMSE and where it is suboptimal. Numerical results also illustrate that more signal vectors in the jointly sparse signal ensemble lead to a better phase transition. To showcase the application of MMV models, the MMSE of complex CS problems with both real and complex measurement matrices is analyzed.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

## Tuesday, May 17, 2016

### Computing Active Subspaces Efficiently with Gradient Sketching

Computing Active Subspaces Efficiently with Gradient Sketching  by Paul G. Constantine, Armin Eftekhari, Michael B. Wakin

Active subspaces are an emerging set of tools for identifying and exploiting the most important directions in the space of a computer simulation's input parameters; these directions depend on the simulation's quantity of interest, which we treat as a function from inputs to outputs. To identify a function's active subspace, one must compute the eigenpairs of a matrix derived from the function's gradient, which presents challenges when the gradient is not available as a subroutine. We numerically study two methods for estimating the necessary eigenpairs using only linear measurements of the function's gradient. In practice, these measurements can be estimated by finite differences using only two function evaluations, regardless of the dimension of the function's input space.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.