Research has shown that feature extraction in convolutional network weights trained on ImageNet outperforms the conventional feature extraction methods such as SURF, Deformable Part Descriptors (DPDs), Histogram of Oriented Gradients (HOG), and bag of words (BoW). This means that convolutional features can be used equally well wherever the conventional visual representations work, with the only drawback being that deeper architectures might require a longer time to extract the features.
When a deep convolutional neural network is trained on ImageNet the visualization of convolution filters in the first layers (refer to the following illustration) shows that they learn low-level features similar to edge detection filters, while the convolution filters at the last layers learn high-level features that capture the class-specific information. Hence, if we extract the features for ImageNet after the first pooling layer and embed them into a 2D space (using, for example, t-SNE), the visualization...