## Friday, May 26, 2017

### Relaxed Wasserstein with Applications to GANs

If you are figuring what GANs are about, check on these two Highly Technical Reference pages on the subject. In the meantime:

Relaxed Wasserstein with Applications to GANs by Xin Guo, Johnny Hong, Tianyi Lin, Nan Yang
Generative Adversarial Networks (GANs) provide a versatile class of models for generative modeling. To improve the performance of machine learning models, there has recently been interest in designing objective functions based on Wasserstein distance rather than Jensen-Shannon (JS) divergence. In this paper, we propose a novel asymmetric statistical divergence called Relaxed Wasserstein (RW) divergence as a generalization of Wasserstein-L2 distance of order 2. We show that RW is dominated by Total Variation (TV) and Wasserstein-L2 distance, and establish continuity, differentiability, and duality representation of RW divergence. Finally, we provide a nonasymptotic moment estimate and a concentration inequality for RW divergence. Our experiments show that RWGANs with Kullback-Leibler (KL) divergence produce recognizable images with a ReLU Multi-Layer Perceptron (MLP) generator in fewer iterations, compared to Wasserstein-L1 GAN (WGAN).

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

## Thursday, May 25, 2017

### A Semismooth Newton Method for Fast, Generic Convex Programming - implmentation -

Eric just mentioned it on his twitter feed:

Due to their generality, conic optimization problems, which can represent most convex optimization problems encountered in practice, have been the focus of much recent work, and additionally form the basis of many convex optimization modeling frameworks. In this paper, we introduce Newton-ADMM, a method for fast conic optimization. The basic idea is to view the residuals of consecutive iterates generated by SCS, a state-of-the-art, iterative conic solver, as a fixed point iteration, and then use a nonsmooth Newton method to find a fixed point. We demonstrate theoretically, by extending the theory of semismooth operators, that Newton-ADMM converges rapidly (i.e., quadratically) to a solution; empirically, Newton-ADMM is significantly faster than SCS, on a number of problems. The method also has essentially no tuning parameters, generates certificates of primal or dual infeasibility, when appropriate, and can be specialized to solve specific convex problems.

The Github repository is here: https://github.com/locuslab/newton_admm

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

## Wednesday, May 24, 2017

### The High-Dimensional Geometry of Binary Neural Networks

Ah ! This is interesting.

Recent research has shown that one can train a neural network with binary weights and activations at train time by augmenting the weights with a high-precision continuous latent variable that accumulates small changes from stochastic gradient descent. However, there is a dearth of theoretical analysis to explain why we can effectively capture the features in our data with binary weights and activations. Our main result is that the neural networks with binary weights and activations trained using the method of Courbariaux, Hubara et al. (2016) work because of the high-dimensional geometry of binary vectors. In particular, the ideal continuous vectors that extract out features in the intermediate representations of these BNNs are well-approximated by binary vectors in the sense that dot products are approximately preserved. Compared to previous research that demonstrated the viability of such BNNs, our work explains why these BNNs work in terms of the HD geometry. Our theory serves as a foundation for understanding not only BNNs but a variety of methods that seek to compress traditional neural networks. Furthermore, a better understanding of multilayer binary neural networks serves as a starting point for generalizing BNNs to other neural network architectures such as recurrent neural networks.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

### DEvol - Automated deep neural network design with genetic programming

Joe Davison made available an interesting implementation of an automated tool for deep neural network design using genetic programming (h/t François ). From the page:

## DEvol - Deep Neural Network Evolution

DEvol (DeepEvolution) utilizes genetic programming to automatically architect a deep neural network with optimal hyperparameters for a given dataset using the Keras library. This approach should design an equal or superior model to what a human could design when working under the same constraints as are imposed upon the genetic program (e.g., maximum number of layers, maximum number of convolutional filters per layer, etc.). The current setup is designed for classification problems, though this could be extended to include any other output type as well.
See demo.ipynb for a simple example.

### Evolution

Each model is represented as fixed-width genome encoding information about the network's structure. In the current setup, a model contains a number of convolutional layers, a number of dense layers, and an optimzer. The convolutional layers can be evolved to include varying numbers of feature maps, different activation functions, varying proportions of dropout, and whether to perform batch normalization and/or max pooling. The same options are available for the dense layers with the exception of max pooling. The complexity of these models could easily be extended beyond these capabilities to include any parameters included in Keras, allowing the creation of more complex architectures.
Below is a highly simplified visualization of how genetic crossover might take place between two models.
Genetic crossover and mutation of neural networks

### Results

For demonstration, we ran our program on the MNIST dataset (see demo.ipynb for an example setup) with 20 generations and a population size of 50. We allowed the model up to 6 convolutional layers and 4 dense layers (including the softmax layer). The best accuracy we attained with 10 epochs of training under these constraints was 99.4%, which is higher than we were able to achieve when manually constructing our own models under the same constraints. The graphic below displays the running maximum accuracy for all 1000 nets as they evolve over 20 generations.
Keep in mind that these results are obtained with simple, relatively shallow neural networks with no data augmentation, transfer learning, ensembling, fine-tuning, or other optimization techniques. However, virtually any of these methods could be incorporated into the genetic program.
Running max of MNIST accuracies across 20 generations

### Application

The most significant barrier in using DEvol on a real problem is the complexity of the algorithm. Because training neural networks is often such a computationally expensive process, training hundreds or thousands of different models to evaluate the fitness of each is not always feasible. Below are some approaches to combat this issue:
• Early Stopping - There's no need to train a model for 10 epochs if it stops improving after 3; cut it off early.
• Train on Fewer Epochs - Training in a genetic program serves one purpose: to evaulate a model's fitness in relation to other models. It may not be necessary to train to convergence to make this comparison; you may only need 2 or 3 epochs. However, it is important you exercise caution in decreasing training time because doing so could create evolutionary pressure toward simpler models that converge quickly. This creates a trade-off between training time and accuracy which, depending on the application, may or may not be desirable.
• Parameter Selection - The more robust you allow your models to be, the longer it will take to converge; i.e., don't allow horizontal flipping on a character recognition problem even though the genetic program will eventually learn not to include it. The less space the program has to explore, the faster it will arrive at an optimal solution.
For some problems, it may be ideal to simply plug the data into DEvol and let the program build a complete model for you, but for others, this hands-off approach may not be feasible. In either case, DEvol could give you insights into optimal model design that you may not have considered on your own. For the MNIST digit classification problem, we found that ReLU does far better than a sigmoid function in convolutional layers, but they work about equally well in dense layers. We also found that ADAGRAD was the highest-performing prebuilt optimizer and gained insight on the number of nodes to include in each dense layer.
At worst, DEvol could give you insight into improving your model architecture. At best, it could give you a beautiful, finely-tuned model.

### Wanna Try It?

DEvol is pretty straightforward to use for basic classification problems. See demo.ipynb for an example. There are three basic steps:
1. Prep your dataset. DEvol expects a classification problem with labels that are one-hot encoded as it uses categorical_crossentropy for its loss function. Otherwise, you can prep your data however you'd like. Just pass your input shape into GenomeHandler.
1. Create a GenomeHandler. The GenomeHandler defines the constraints that you apply to your models. Specify the maximum number of convolutional and dense layers, the max dense nodes and feature maps, and the input shape. You can also specify whether you'd like to allow batch_normalization, dropout, and max_pooling, which are included by default. You can also pass in a list of optimizers and activation functions you'd like to allow.
1. Create and run the DEvol. Pass your GenomeHandler to the DEvol constructor, and run it. Here you have a few more options such as the number of generations, the population size, epochs used for fitness evaluation, and an (optional) fitness function which converts a model's accuracy into a fitness score.
See demo.ipynb for a basic example.

## Tuesday, May 23, 2017

### Global Guarantees for Enforcing Deep Generative Priors by Empirical Risk

Credit: NASA, h/t Sarah Horst

Vlad just sent me the following:

Hi Igor,

I'm writing regarding my recent paper with Paul Hand. It's about combining principles of compressed sensing with deep generative priors, which has been previously shown empirically to require 10X fewer measurements than traditional CS in certain scenarios. As deep generative priors (such as those obtained via GANs) get better, this may improve the performance of CS and other inverse problems across the board.

In this paper we prove that the non-convex empirical risk objective for enforcing random deep generative priors subject to compressive random linear observations of the activations of the last layer has no spurious local minima, and for a fixed depth, these guarantees hold at order-optimal sample complexity.

Best,

We examine the theoretical properties of enforcing priors provided by generative deep neural networks via empirical risk minimization. In particular we consider two models, one in which the task is to invert a generative neural network given access to its last layer and another which entails recovering a latent code in the domain of a generative neural network from compressive linear observations of its last layer. We establish that in both cases, in suitable regimes of network layer sizes and a randomness assumption on the network weights, that the non-convex objective function given by empirical risk minimization does not have any spurious stationary points. That is, we establish that with high probability, at any point away from small neighborhoods around two scalar multiples of the desired solution, there is a descent direction. These results constitute the first theoretical guarantees which establish the favorable global geometry of these non-convex optimization problems, and bridge the gap between the empirical success of deep learning and a rigorous understanding of non-linear inverse problems.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

### A 2.9 TOPS/W Deep Convolutional Neural Network SoC in FD-SOI 28nm for Intelligent Embedded Systems (and a Highly Technical Reference page on Neural Networks in silicon.)

So last night I was talking to Thomas at the STMicroelectronics Techno day at Opera de Paris. He was featuring a recent architecture they are designing and presented at the last ISSC conference.

A 2.9 TOPS/W DeepConvolutional Neural NetworkSoC in FD-SOI 28nm forIntelligent Embedded Systems by Giuseppe Desoli, Nitin Chawla, Thomas Boesch, Surinder-pal Singh, Elio Guidetti, Fabio De Ambroggi, Tommaso Majo, Paolo Zambotti, Manuj Ayodhyawasi, Harvinder Singh, Nalin Aggarwal

I also discovered a Highly Technical Reference page on Neural Networks in Silicon by Fengbin Tu. The page is here: https://github.com/fengbintu/Neural-Networks-on-Silicon

The page has been added to the Highly Technical Reference Page.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

## Monday, May 22, 2017

### Short and Deep: Sketching and Neural Networks

This is a revised version of an earlier preprint.
Short and Deep: Sketching and Neural Networks by Amit Daniely, Nevena Lazic, Yoram Singer, Kunal Talwar
Data-independent methods for dimensionality reduction such as random projections, sketches, and feature hashing have become increasingly popular in recent years. These methods often seek to reduce dimensionality while preserving the hypothesis class, resulting in inherent lower bounds on the size of projected data. For example, preserving linear separability requires Ω(1/γ2 ) dimensions, where γ is the margin, and in the case of polynomial functions, the number of required dimensions has an exponential dependence on the polynomial degree. Despite these limitations, we show that the dimensionality can be reduced further while maintaining performance guarantees, using improper learning with a slightly larger hypothesis class. In particular, we show that any sparse polynomial function of a sparse binary vector can be computed from a compact sketch by a single-layer neural network, where the sketch size has a logarithmic dependence on the polynomial degree. A practical consequence is that networks trained on sketched data are compact, and therefore suitable for settings with memory and power constraints. We empirically show that our approach leads to networks with fewer parameters than related methods such as feature hashing, at equal or better performance.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

## Friday, May 19, 2017

### DeepArchitect: Automatically Designing and Training Deep Architectures - implementation -

So now that Learning to Learn is becoming more trendy thanks to Sundar Pichai, we now need better methods to look around for models, because who has 800 GPUs [1] ! Here is one with an implementation, woohoo !

In deep learning, performance is strongly affected by the choice of architecture and hyperparameters. While there has been extensive work on automatic hyperparameter optimization for simple spaces, complex spaces such as the space of deep architectures remain largely unexplored. As a result, the choice of architecture is done manually by the human expert through a slow trial and error process guided mainly by intuition. In this paper we describe a framework for automatically designing and training deep models. We propose an extensible and modular language that allows the human expert to compactly represent complex search spaces over architectures and their hyperparameters. The resulting search spaces are tree-structured and therefore easy to traverse. Models can be automatically compiled to computational graphs once values for all hyperparameters have been chosen. We can leverage the structure of the search space to introduce different model search algorithms, such as random search, Monte Carlo tree search (MCTS), and sequential model-based optimization (SMBO). We present experiments comparing the different algorithms on CIFAR-10 and show that MCTS and SMBO outperform random search. In addition, these experiments show that our framework can be used effectively for model discovery, as it is possible to describe expressive search spaces and discover competitive models without much effort from the human expert. Code for our framework and experiments has been made publicly available.

An implementation of DeepArchitect is here: https://github.com/negrinho/deep_architect

[1] ICLR video: Neural Architecture Search with Reinforcement Learning Video starts at 1:08:44 see Learning the structure of learning.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !

### Soft Recovery With General Atomic Norms

This paper describes a dual certificate condition on a linear measurement operator A (defined on a Hilbert space H and having finite-dimensional range) which guarantees that an atomic norm minimization, in a certain sense, will be able to approximately recover a structured signal v0∈H from measurements Av0. Put very streamlined, the condition implies that peaks in a sparse decomposition of v0 are close the the support of the atomic decomposition of the solution v∗. The condition applies in a relatively general context - in particular, the space H can be infinite-dimensional. The abstract framework is applied to several concrete examples, one example being super-resolution. In this process, several novel results which are interesting on its own are obtained.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

## Thursday, May 18, 2017

### Guaranteed recovery of quantum processes from few measurements / Improving compressed sensing with the diamond norm

Jens mentioned this a while back and I am just playing catch-up on this topic before the upcoming massive ArXiv NIPS2017 release. Enjoy !

Guaranteed recovery of quantum processes from few measurements by Martin Kliesch, Richard Kueng, Jens Eisert, David Gross

Quantum process tomography is the task of reconstructing unknown quantum channels from measured data. In this work, we introduce compressed sensing-based methods that facilitate the reconstruction of quantum channels of low Kraus rank. Our main contribution is the analysis of a natural measurement model for this task: We assume that data is obtained by sending pure states into the channel and measuring expectation values on the output. Neither ancilla systems nor coherent operations across multiple channel uses are required. Most previous results on compressed process reconstruction reduce the problem to quantum state tomography on the channel's Choi matrix. While this ansatz yields recovery guarantees from an essentially minimal number of measurements, physical implementations of such schemes would typically involve ancilla systems. A priori, it is unclear whether a measurement model tailored directly to quantum process tomography might require more measurements. We establish that this is not the case. Technically, we prove recovery guarantees for three different reconstruction algorithms. The reconstructions are based on a trace, diamond, and ℓ2-norm minimization, respectively. Our recovery guarantees are uniform in the sense that with one random choice of measurement settings all quantum channels can be recovered equally well. Moreover, stability against arbitrary measurement noise and robustness against violations of the low-rank assumption is guaranteed. Numerical studies demonstrate the feasibility of the approach.

Improving compressed sensing with the diamond norm by Martin Kliesch, Richard Kueng, Jens Eisert, David Gross
In low-rank matrix recovery, one aims to reconstruct a low-rank matrix from a minimal number of linear measurements. Within the paradigm of compressed sensing, this is made computationally efficient by minimizing the nuclear norm as a convex surrogate for rank.
In this work, we identify an improved regularizer based on the so-called diamond norm, a concept imported from quantum information theory. We show that -for a class of matrices saturating a certain norm inequality- the descent cone of the diamond norm is contained in that of the nuclear norm. This suggests superior reconstruction properties for these matrices. We explicitly characterize this set of matrices. Moreover, we demonstrate numerically that the diamond norm indeed outperforms the nuclear norm in a number of relevant applications: These include signal analysis tasks such as blind matrix deconvolution or the retrieval of certain unitary basis changes, as well as the quantum information problem of process tomography with random measurements.
The diamond norm is defined for matrices that can be interpreted as order-4 tensors and it turns out that the above condition depends crucially on that tensorial structure. In this sense, this work touches on an aspect of the notoriously difficult tensor completion problem.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.