ML paper notes

Posted at — Sep 1, 2017

2017-09

LEARNING FINE-GRAINED IMAGE SIMILARITY WITH DEEP RANKING

describes efficient sampling technique based on reservoir sampling for building triplets; requires an relevance function
multi scale CNN

DEEP METRIC LEARNING USING TRIPLET NETWORK

learns a semantic embedding; results show better discrimination vs siamese network (contrastive loss function)
MSE softmax shows improved performance rather than simple binary softmax (see paper for def)
feed a triplet of x, x1, x2 where x1 is same class as x and x2 is different

DISTILLING THE KNOWLEDGE IN A NEURAL NETWORK

explores compression technique of ensemble model into a single model
Distillation
- softmax qi = exp(zi/T)/Sigma(j)(exp(zj/T) where z are logits and T is temperature
- T is usually 1
- increasing T creates softer probability distribution
- knowledge is tranferred via training smaller/compressed model by targeting over softer target (ie temperature T > 1) from more cumbersome model
- small model trained with higher T as well but in prediction mode uses T = 1
- tranfer training can be improved by using datasets with true label
demonstrate distillation with minist dataset - tranfer works well even when smaller model trained by omitting certain numbers
discusses using soft distribution target technique for training specialists on very large datasets
Google internal JFT data of 100M images

Questions

teacher - student model, relation to curriculum learning?
Feels like specialist training is a separate topic from distillation

REGULARIZING NEURAL NETWORKS BY PENALIZING CONFIDENT OUTPUT DISTRIBUTIONS

using entropy as an extra regularizer term to improve model generalizability
shows improved performance across wide variety of classification tasks

Questions

label smoothing regularization

2017-08

VOCO: TEXT-BASED INSERTION AND REPLACEMENT IN AUDIO NARRATION

keywords: voice conversion, t2s

uses short, non annotated corpus of speech data to synthesize t2s
builds on CUTE and requires small corpus

Question

CUTE: http://gfx.cs.princeton.edu/pubs/Jin_2016_CAC/CUTE-icassp_2016.pdf
HelpingHand: http://gfx.cs.princeton.edu/pubs/Lu_2012_HES/Lu_2012_HES.pdf
forced alignment: http://www.speech.kth.se/prod/publications/files/908.pdf
Mel-Cepstral Distortion (MCD):
Dynamic Time Warping (DTW)

AN OVERLAP-ADD TECHNIQUE BASED ON WAVEFORM SIMILARITY (WSOLA) FOR HIGH QUALITY TIME-SCALE MODIFICATION OF SPEECH

keywords: voice synthesis, text 2 speech

Questions

WSOLA

LEARNING A PREDICTABLE AND GENERATIVE VECTOR REPRESENTATION FOR OBJECTS

a generative embedding for 3D object that also works for 2D
autoencoder for generative and cnn for predictability
joint loss from both input - image + voxel
training performed in 3 stages, 1) autoencoder only, 2) cnn with autoencoder embedding and 3) jointly optimizes with scaled loss function
experimentally verified semantic meaning of the 64D vector by observing effects when changing value in only 1D; compared against PCA to justify nonlinear rep

LEARNING TO COMPARE IMAGE PATCHES VIA CONVOLUTIONAL NEURAL NETWORKS

keyword: feature extraction

learn a similarity function entirely from data
tested 2 channel, siamese, pseudo-siamese architecture
applied technique from [VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION] to break up large CNN into multi layered CNN with ReLU in between
obtained best result with 2 channel

Questions

SIFT
SPP

DATA-DRIVEN SYNTHESIS OF SMOKE FLOWS WITH CNN-BASED FEATURE DESCRIPTORS

keyword: low dimension feature descriptor

Questions

advection step

DEEP UNFOLDING: MODEL-BASED INSPIRATION OF NOVEL DEEP ARCHITECTURES

deep unfolding - start with a model based approach and then convert iterative step into layers in network; untie the parameter across layers to obtain a trainable model across layers

Questions

markov random fields vs CRF
belief propagation
variational approximations

AUDIO-DRIVEN FACIAL ANIMATION BY JOINT END-TO-END LEARNING OF POSE AND EMOTION

keywords: e2e, audio, emotion, lip vertex position

uses a feature vector from a fully connected layer as emotional representation;
emotion database is then later assigned a semantic meaning via playback on target mesh and manually assigning semantic label
had to apply specific smoothing and ensemble (same input, averaged result across 2 runs) to obtain temporal stability
uses PCA to pre-initialize the “bottleneck” layer
defined a loss function of vertex position, vertex motion, and regularization to encourage long term change from emotion and shortterm change from audio

Questions

Source filter model?
autocorrelation
deformation transfer

ARCHITECTURES FOR DEEP NEURAL NETWORK BASED ACOUSTIC MODELS DEFINED OVER WINDOWED SPEECH WAVEFORMS

keywords: raw input

investigated whether windowed speech wavform (WSW) DNN can be on par with MFCC/MFSC feature based DNN
stacked bottleneck layer showed markable improvemnent for WSW and less for MFCC/MFSC - resulting in similar WER
investigated using trained weight matrix structure as initialization vs random initialization

Questions

MFSC vs MFCC??

SPECTRAL SUBBAND CENTROID FEATURES FOR SPEECH RECOGNITION

SSC improves baseline recognition performance as supplemnt features for cepstral coefficients
spectral centroid, SSC, etc. are good features to try in additional to MFCC

LCN, CNN, DNN FOR TEXT DEPENDENT SV

experiment with LCN and CNN as first layer for a DNN text dependent SV system
shows that computation and parameters can be vastly reduced with CNN/LCN without losing too much performance
try simpler layers with fewer parameters first

Resnet

keywords: skip layer,

bottleneck design with skip connection
propagating “raw” information across layer helps with deeper architecture

NETWORK IN A NETWORK

keywords: NIN

https://www.reddit.com/r/MachineLearning/comments/5n01i4/d_network_in_network_nin_is_it_still_useful/dc7qfd1/

AUTOMATIC GAIN CONTROL AND MULTI-STYLE TRAINING FOR ROBUST SMALL-FOOTPRINT KEYWORD SPOTTING WITH DEEP NEURAL NETWORKS

keywords: multi-style training, small-footprint models

trains a frontend system with EM algorithm on peak power by assuming signals from mixed Gaussian signals
then applies automatic gain control (AGC) to speech segment
builds on the DNN keyword system from “Small-footprint key- word spotting using deep neural networks”

ACOUSTIC MODELLING FROM THE SIGNAL DOMAIN USING CNNS

keywords: CNN, raw waveform, statistic extraction layer, Network In Network nonlinearity

LVCSR
directly learn the “features” from raw signals
uses the proposed NIN block to replace RELU and maxpooling
propsed DNN used in combination with traditional iVector approach

END-TO-END TEXT-DEPENDENT SPEAKER VERIFICATION

keywords: speaker verification, end-to-end training

training e2e by using accept or reject as the loss function
LSTM outperforms small footprint DNN but better DNN can improve base performance
utterance vs frame level modeling - utterance level works better
uses locally connected layers

Questions

Is it important for your system to use frame level modeling? Utterance should work?
What are the locally connected layer advantages?

DEEP NEURAL NETWORK-BASED SPEAKER EMBEDDINGS FOR END-TO-END SPEAKER VERIFICATION

keywords: speaker vr, text-indepdendent

uses network in a network (NIN) architecture to project an input vector (di) into output vector (do)
trains directly on distance metric to create neural network that creates more characteristic embedding
need a large dataset before DNN approach exceeded the ivector baseline
uses US phone data with 102K speaker data
2 network - one for generating speaker embedding and another for maximizing log prob of same speaker pair
“curriculum” learning of training on long durations then short durations works better
Custom objective function with PLDA inspired distance metrics

Questions

How well do these results generalize across same speaker but different languages?
DNN + PLDA seems to work better everytime?

2017-07

BOOSTED TREES

keywords: boosted trees, random trees

random forest and tree ensemble are the same except in how they are trained

DEEP LEARNING FOR HATE SPEECH DETECTION IN TWEETS

keywords: glove, fasttext, hate speech, GBDT (gradient boosted decision trees)

network architecture experiment with various embedding
random embedding
task specific embedding/feature vector useful as input to downstream system
(LSTM + Random Embedding + GBDT) is much better than (CNN + Random Embedding + GBDT)

Questions

what does random embedding mean exactly? Why does it work better?
How popular DNN + GBDT architecture?

A TIME DELAY NEURAL NETWORK ARCHITECTURE FOR EFFICIENT MODELING OF LONG TEMPORAL CONTEXTS

keywords: subsampling, dnn,

basically just dilated cnn without the convolution - ie fully connected
[-2,2] means {-2, -1, 0, 1, 2}
asymmetric input splicing - more frames from the past (up to 16 frames) and future (up to 9 frames); more frames prove to be detrimental to word recognition accuracies Q:
- Is TDNN similar to dilation CNN?
- What is the sampling rate

MAXOUT NETWORK SUMMARY

a form activiation designed for dropout

DEEP NEURAL NETWORKS FOR SMALL FOOTPRINT TEXT-DEPENDENT SPEAKER VERIFICATION

DEEP SPEAKER FEATURE LEARNING FOR TEXT-INDEPENDENT SPEAKER VERIFICATION

keywords: SR, speaker feature vector

experiments suggests speaker feature is a short time phenomenon; achieves ERR of 7.68% from 300ms of frame
uses 2 layer CNN with maxpooling plus TDNN with p norm activation function
trained with 5000 speakers
uses tsne to verify speaker cluster well

Questions

Is TDNN similar to dilation CNN?
How will the network perform with ReLU instead?
Why did keras remove maxout dense layer?
Is maxout dense basically maxpooling?

A SIMPLE WAY TO INITIALIZE RECURRENT NETWORKS OF RECTIFIED LINEAR UNITS

keywords: identity matrix, rnn, lstm

using identity matrix initialization for RELU RNN works well with little tuning

MULTISCALE CONTEXT AGGREGATION BY DILATED CONVOLUTION

keywords: dilated, convolution, segmentation, dense prediction

aggregates multiscale information without pooling or sampling
exponetial increase of receptive field with increasing dilation
F_i_+_1 = F_i * 2^i * ki
standard intiailzation didn’t show improvement; uses a form of identity intialization
applied context module to a frontend of adapated VGG16 without last pooling and striding(?) layers
context module improved semantic segmentation task

DENOISING WITH WAVENT

keywords: noncausal dilated convolution, raw signal,

energy preserving loss based on receptive field rather than a single sample

SQUEEZE NET

keywords: deep compression, small model

efficient model microarchitecture for building small and competitive image recognition models
less 5MB,
uses Fire module defined by s11, e11, e33
expander and squeeze sub components
competitive and 50x less parameters than alexnets

MULTI-SCALE CONTEXT AGGREGATION BY DILATED CONVOLUTIONS

keywords: dilated convolution, segmentation, dense prediction

using dilated convolution to segment images
identity reinitialization needed to improve accuracy
combination with CRF-RNN improves accuracy further
experiments suggest no good reason to downsample and then upsample - adds complexity and hurts accuracy to explore: structured prediction, CRF-RNN

2017-06

LAYER NORMALIZATION

batch normalization adapted to RNN

ATTENTION IS ALL YOU NEED

keywords: transformer, self attention, multi-head attention, encoder-decoder

eschews RNN and CNN and uses positional embedding to provide positional information
composes feed forward NN exclusively with attention to learn representation between input and output
transformer model architecture - see paper for picture
multi head attention

Questions

Is it good for seq2seq tasks

LABEL SMOOTHING

a form of regularization that penalizes low entropy predictions

moomou

(ﾉ≧∇≦)ﾉ ﾐ ┸┸

ML paper notes

2017-09

LEARNING FINE-GRAINED IMAGE SIMILARITY WITH DEEP RANKING

DEEP METRIC LEARNING USING TRIPLET NETWORK

DISTILLING THE KNOWLEDGE IN A NEURAL NETWORK

Questions

REGULARIZING NEURAL NETWORKS BY PENALIZING CONFIDENT OUTPUT DISTRIBUTIONS

Questions

2017-08

VOCO: TEXT-BASED INSERTION AND REPLACEMENT IN AUDIO NARRATION

Question

AN OVERLAP-ADD TECHNIQUE BASED ON WAVEFORM SIMILARITY (WSOLA) FOR HIGH QUALITY TIME-SCALE MODIFICATION OF SPEECH

Questions

LEARNING A PREDICTABLE AND GENERATIVE VECTOR REPRESENTATION FOR OBJECTS

LEARNING TO COMPARE IMAGE PATCHES VIA CONVOLUTIONAL NEURAL NETWORKS

Questions

DATA-DRIVEN SYNTHESIS OF SMOKE FLOWS WITH CNN-BASED FEATURE DESCRIPTORS

Questions

DEEP UNFOLDING: MODEL-BASED INSPIRATION OF NOVEL DEEP ARCHITECTURES

Questions

AUDIO-DRIVEN FACIAL ANIMATION BY JOINT END-TO-END LEARNING OF POSE AND EMOTION

Questions

ARCHITECTURES FOR DEEP NEURAL NETWORK BASED ACOUSTIC MODELS DEFINED OVER WINDOWED SPEECH WAVEFORMS

Questions

SPECTRAL SUBBAND CENTROID FEATURES FOR SPEECH RECOGNITION

LCN, CNN, DNN FOR TEXT DEPENDENT SV

Resnet

NETWORK IN A NETWORK

AUTOMATIC GAIN CONTROL AND MULTI-STYLE TRAINING FOR ROBUST SMALL-FOOTPRINT KEYWORD SPOTTING WITH DEEP NEURAL NETWORKS

ACOUSTIC MODELLING FROM THE SIGNAL DOMAIN USING CNNS

END-TO-END TEXT-DEPENDENT SPEAKER VERIFICATION

Questions

DEEP NEURAL NETWORK-BASED SPEAKER EMBEDDINGS FOR END-TO-END SPEAKER VERIFICATION

Questions

2017-07

BOOSTED TREES

DEEP LEARNING FOR HATE SPEECH DETECTION IN TWEETS

Questions

A TIME DELAY NEURAL NETWORK ARCHITECTURE FOR EFFICIENT MODELING OF LONG TEMPORAL CONTEXTS

MAXOUT NETWORK SUMMARY

DEEP NEURAL NETWORKS FOR SMALL FOOTPRINT TEXT-DEPENDENT SPEAKER VERIFICATION

DEEP SPEAKER FEATURE LEARNING FOR TEXT-INDEPENDENT SPEAKER VERIFICATION

Questions

A SIMPLE WAY TO INITIALIZE RECURRENT NETWORKS OF RECTIFIED LINEAR UNITS

MULTISCALE CONTEXT AGGREGATION BY DILATED CONVOLUTION

DENOISING WITH WAVENT

SQUEEZE NET

MULTI-SCALE CONTEXT AGGREGATION BY DILATED CONVOLUTIONS

2017-06

LAYER NORMALIZATION

ATTENTION IS ALL YOU NEED

Questions

LABEL SMOOTHING

(ﾉ≧∇≦)ﾉﾐ ┸┸