2017-09
LEARNING FINE-GRAINED IMAGE SIMILARITY WITH DEEP RANKING
- describes efficient sampling technique based on reservoir sampling for building triplets; requires an relevance function
- multi scale CNN
DEEP METRIC LEARNING USING TRIPLET NETWORK
- learns a semantic embedding; results show better discrimination vs siamese network (contrastive loss function)
- MSE softmax shows improved performance rather than simple binary softmax (see paper for def)
- feed a triplet of x, x1, x2 where x1 is same class as x and x2 is different
DISTILLING THE KNOWLEDGE IN A NEURAL NETWORK
- explores compression technique of ensemble model into a single model
- Distillation
- softmax qi = exp(zi/T)/Sigma(j)(exp(zj/T) where z are logits and T is temperature
- T is usually 1
- increasing T creates softer probability distribution
- knowledge is tranferred via training smaller/compressed model by targeting over softer target (ie temperature T > 1) from more cumbersome model
- small model trained with higher T as well but in prediction mode uses T = 1
- tranfer training can be improved by using datasets with true label
- demonstrate distillation with minist dataset - tranfer works well even when smaller model trained by omitting certain numbers
- discusses using soft distribution target technique for training specialists on very large datasets
- Google internal JFT data of 100M images
Questions
-
teacher - student model, relation to curriculum learning?
-
Feels like specialist training is a separate topic from distillation
REGULARIZING NEURAL NETWORKS BY PENALIZING CONFIDENT OUTPUT DISTRIBUTIONS
- using entropy as an extra regularizer term to improve model generalizability
- shows improved performance across wide variety of classification tasks
Questions
- label smoothing regularization
2017-08
VOCO: TEXT-BASED INSERTION AND REPLACEMENT IN AUDIO NARRATION
keywords: voice conversion, t2s
- uses short, non annotated corpus of speech data to synthesize t2s
- builds on CUTE and requires small corpus
Question
keywords: voice synthesis, text 2 speech
Questions
LEARNING A PREDICTABLE AND GENERATIVE VECTOR REPRESENTATION FOR OBJECTS
- a generative embedding for 3D object that also works for 2D
- autoencoder for generative and cnn for predictability
- joint loss from both input - image + voxel
- training performed in 3 stages, 1) autoencoder only, 2) cnn with autoencoder embedding and 3) jointly optimizes with scaled loss function
- experimentally verified semantic meaning of the 64D vector by observing effects when changing value in only 1D; compared against PCA to justify nonlinear rep
LEARNING TO COMPARE IMAGE PATCHES VIA CONVOLUTIONAL NEURAL NETWORKS
keyword: feature extraction
- learn a similarity function entirely from data
- tested 2 channel, siamese, pseudo-siamese architecture
- applied technique from [VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION] to break up large CNN into multi layered CNN with ReLU in between
- obtained best result with 2 channel
Questions
DATA-DRIVEN SYNTHESIS OF SMOKE FLOWS WITH CNN-BASED FEATURE DESCRIPTORS
keyword: low dimension feature descriptor
Questions
DEEP UNFOLDING: MODEL-BASED INSPIRATION OF NOVEL DEEP ARCHITECTURES
- deep unfolding - start with a model based approach and then convert iterative step into layers in network; untie the parameter across layers to obtain a trainable model across layers
Questions
- markov random fields vs CRF
- belief propagation
- variational approximations
AUDIO-DRIVEN FACIAL ANIMATION BY JOINT END-TO-END LEARNING OF POSE AND EMOTION
keywords: e2e, audio, emotion, lip vertex position
- uses a feature vector from a fully connected layer as emotional representation;
- emotion database is then later assigned a semantic meaning via playback on target mesh and manually assigning semantic label
- had to apply specific smoothing and ensemble (same input, averaged result across 2 runs) to obtain temporal stability
- uses PCA to pre-initialize the “bottleneck” layer
- defined a loss function of vertex position, vertex motion, and regularization to encourage long term change from emotion and shortterm change from audio
Questions
- Source filter model?
- autocorrelation
- deformation transfer
keywords: raw input
- investigated whether windowed speech wavform (WSW) DNN can be on par with MFCC/MFSC feature based DNN
- stacked bottleneck layer showed markable improvemnent for WSW and less for MFCC/MFSC - resulting in similar WER
- investigated using trained weight matrix structure as initialization vs random initialization
Questions
SPECTRAL SUBBAND CENTROID FEATURES FOR SPEECH RECOGNITION
- SSC improves baseline recognition performance as supplemnt features for cepstral coefficients
- spectral centroid, SSC, etc. are good features to try in additional to MFCC
LCN, CNN, DNN FOR TEXT DEPENDENT SV
- experiment with LCN and CNN as first layer for a DNN text dependent SV system
- shows that computation and parameters can be vastly reduced with CNN/LCN without losing too much performance
- try simpler layers with fewer parameters first
Resnet
keywords: skip layer,
- bottleneck design with skip connection
- propagating “raw” information across layer helps with deeper architecture
NETWORK IN A NETWORK
keywords: NIN
keywords: multi-style training, small-footprint models
- trains a frontend system with EM algorithm on peak power by assuming signals from mixed Gaussian signals
- then applies automatic gain control (AGC) to speech segment
- builds on the DNN keyword system from “Small-footprint key- word spotting using deep neural networks”
ACOUSTIC MODELLING FROM THE SIGNAL DOMAIN USING CNNS
keywords: CNN, raw waveform, statistic extraction layer, Network In Network nonlinearity
- LVCSR
- directly learn the “features” from raw signals
- uses the proposed NIN block to replace RELU and maxpooling
- propsed DNN used in combination with traditional iVector approach
END-TO-END TEXT-DEPENDENT SPEAKER VERIFICATION
keywords: speaker verification, end-to-end training
- training e2e by using accept or reject as the loss function
- LSTM outperforms small footprint DNN but better DNN can improve base performance
- utterance vs frame level modeling - utterance level works better
- uses locally connected layers
Questions
- Is it important for your system to use frame level modeling? Utterance should work?
- What are the locally connected layer advantages?
DEEP NEURAL NETWORK-BASED SPEAKER EMBEDDINGS FOR END-TO-END SPEAKER VERIFICATION
keywords: speaker vr, text-indepdendent
- uses network in a network (NIN) architecture to project an input vector (di) into output vector (do)
- trains directly on distance metric to create neural network that creates more characteristic embedding
- need a large dataset before DNN approach exceeded the ivector baseline
- uses US phone data with 102K speaker data
- 2 network - one for generating speaker embedding and another for maximizing log prob of same speaker pair
- “curriculum” learning of training on long durations then short durations works better
- Custom objective function with PLDA inspired distance metrics
Questions
- How well do these results generalize across same speaker but different languages?
- DNN + PLDA seems to work better everytime?
2017-07
keywords: boosted trees, random trees
- random forest and tree ensemble are the same except in how they are trained
keywords: glove, fasttext, hate speech, GBDT (gradient boosted decision trees)
- network architecture experiment with various embedding
- random embedding
- task specific embedding/feature vector useful as input to downstream system
- (LSTM + Random Embedding + GBDT) is much better than (CNN + Random Embedding + GBDT)
Questions
- what does random embedding mean exactly? Why does it work better?
- How popular DNN + GBDT architecture?
A TIME DELAY NEURAL NETWORK ARCHITECTURE FOR EFFICIENT MODELING OF LONG TEMPORAL CONTEXTS
keywords: subsampling, dnn,
- basically just dilated cnn without the convolution - ie fully connected
- [-2,2] means {-2, -1, 0, 1, 2}
- asymmetric input splicing - more frames from the past (up to 16 frames) and future (up to 9 frames); more frames prove to be detrimental to word recognition accuracies
Q:
- Is TDNN similar to dilation CNN?
- What is the sampling rate
- a form activiation designed for dropout
DEEP SPEAKER FEATURE LEARNING FOR TEXT-INDEPENDENT SPEAKER VERIFICATION
keywords: SR, speaker feature vector
- experiments suggests speaker feature is a short time phenomenon; achieves ERR of 7.68% from 300ms of frame
- uses 2 layer CNN with maxpooling plus TDNN with p norm activation function
- trained with 5000 speakers
- uses tsne to verify speaker cluster well
Questions
- Is TDNN similar to dilation CNN?
- How will the network perform with ReLU instead?
- Why did keras remove maxout dense layer?
- Is maxout dense basically maxpooling?
A SIMPLE WAY TO INITIALIZE RECURRENT NETWORKS OF RECTIFIED LINEAR UNITS
keywords: identity matrix, rnn, lstm
- using identity matrix initialization for RELU RNN works well with little tuning
MULTISCALE CONTEXT AGGREGATION BY DILATED CONVOLUTION
keywords: dilated, convolution, segmentation, dense prediction
- aggregates multiscale information without pooling or sampling
- exponetial increase of receptive field with increasing dilation
- F_i_+_1 = F_i * 2^i * ki
- standard intiailzation didn’t show improvement; uses a form of identity intialization
- applied context module to a frontend of adapated VGG16 without last pooling and striding(?) layers
- context module improved semantic segmentation task
DENOISING WITH WAVENT
keywords: noncausal dilated convolution, raw signal,
- energy preserving loss based on receptive field rather than a single sample
SQUEEZE NET
keywords: deep compression, small model
- efficient model microarchitecture for building small and competitive image recognition models
- less 5MB,
- uses Fire module defined by s11, e11, e33
- expander and squeeze sub components
- competitive and 50x less parameters than alexnets
MULTI-SCALE CONTEXT AGGREGATION BY DILATED CONVOLUTIONS
keywords: dilated convolution, segmentation, dense prediction
- using dilated convolution to segment images
- identity reinitialization needed to improve accuracy
- combination with CRF-RNN improves accuracy further
- experiments suggest no good reason to downsample and then upsample - adds complexity and hurts accuracy
to explore: structured prediction, CRF-RNN
2017-06
LAYER NORMALIZATION
- batch normalization adapted to RNN
ATTENTION IS ALL YOU NEED
keywords: transformer, self attention, multi-head attention, encoder-decoder
- eschews RNN and CNN and uses positional embedding to provide positional information
- composes feed forward NN exclusively with attention to learn representation between input and output
- transformer model architecture - see paper for picture
- multi head attention
Questions
- Is it good for seq2seq tasks
LABEL SMOOTHING
- a form of regularization that penalizes low entropy predictions