Deep Scene Image Classification with the MFAFVNet

Mandar Dixit


The problem of transferring a deep convolutional network trained for object recognition to the task of scene image classification is considered. An embedded implementation of the recently proposed mixture of factor analyzers Fisher vector (MFA-FV) is proposed. This enables the design of a network architecture, the MFAFVNet, that can be trained in an end to end manner. The new architecture involves the design of a MFA-FV layer that implements a statistically correct version of the MFA-FV, through a combi- nation of network computations and regularization. When compared to previous neural implementations of Fisher vectors, the MFAFVNet relies on a more powerful statistical model and a more accurate implementation. When compared to previous non-embedded models, the MFAFVNet relies on a state of the art model, which is now embedded into a CNN. This enables end to end training. Experiments show that the MFAFVNet has state of the art performance on scene classification.


Published in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.






Architecture: Architecture of the MFAFV network. The input image first goes through a feature extractor, e.g. VGG or ResNet. Then a ROI pooling layer is applied to derive features with different sizes. A MFA-FV layer is further added to implement a trainable MFA operation for the scene recognition.


Architecture: Architecture of the MFA-FV layer. The implement of MFA based fisher vector in neural network.


The key innovation of this paper is to implement mixture of factor analyze (MFA) with a neural network. A MFA-FV layer is designed as shown above with a few matrix operations as well as some approximation.


  1. Embedding MFA implementation into a neural network enables an end-to-end training across the feature extractor and MFA. Therefore, the feature extractor can be optimize during training, which is different from prior works with fixed extracted features
  2. The MFA can be further improved due to the joint training process. A large number of trainable parameters are applied in MFA, i.e. the parameters in the covariance matrix, hence, the quality of MFA can also be improved and adjusted based on the newly generated features



Results on VGG: The gains brought by MFAFVNet on MITIndoor and SUN dataset with VGG as backbone.


Results on AlexNet: The gains brought by MFAFVNet on MITIndoor and SUN dataset with AlexNet as backbone.


Yunsheng Li

UC San Diego