Quo Vadis, Action Recognition?
A New Model and the Kinetics
Dataset
This paper aims to provide an answer to the question. Whether training an
action classification network on sufficiently large datasets will give a similar
boost in performance when applied to different temporal task or dataset.
This paper uses the Kinetics Human Action Video Dataset which is two orders
of magnitude larger than previous datasets.
Action Classification Architectures
Some of the major differences in current video architectures are whether the
convolutional and layers operators use 2D or 3D kernels; whether the input to
the network is just and RGB video or it also includes pre-computed optical flow
and in the case of 2D ConvNets, how information is propagated across frames,
which can be done either using temporally-recurrent layers such as LSTMs or
feature aggregation over time.
This paper considers ConvNets with LSTMs on top and two-stream networks
with two different types of stream fusion. Also consider a 3D ConvNet: C3D.
They also introduce a new architecture Two-Stream Inflated 3D ConvNets
(I3D).
Many of these models (all but C3D) have an ImageNet pre-trained model as a
sub-component. Our experimental strategy assumes a common ImageNet pre-
trained image classification architecture as a back bone.
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset 1
A New Model and the Kinetics
Dataset
This paper aims to provide an answer to the question. Whether training an
action classification network on sufficiently large datasets will give a similar
boost in performance when applied to different temporal task or dataset.
This paper uses the Kinetics Human Action Video Dataset which is two orders
of magnitude larger than previous datasets.
Action Classification Architectures
Some of the major differences in current video architectures are whether the
convolutional and layers operators use 2D or 3D kernels; whether the input to
the network is just and RGB video or it also includes pre-computed optical flow
and in the case of 2D ConvNets, how information is propagated across frames,
which can be done either using temporally-recurrent layers such as LSTMs or
feature aggregation over time.
This paper considers ConvNets with LSTMs on top and two-stream networks
with two different types of stream fusion. Also consider a 3D ConvNet: C3D.
They also introduce a new architecture Two-Stream Inflated 3D ConvNets
(I3D).
Many of these models (all but C3D) have an ImageNet pre-trained model as a
sub-component. Our experimental strategy assumes a common ImageNet pre-
trained image classification architecture as a back bone.
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset 1