The Inception micro-architecture was first introduced by Szegedy and others in their 2014 paper, Going Deeper with Convolutions:
The goal of the inception module is to act as a multi-level feature extractor by computing 1×1, 3×3, and 5×5 convolutions within the same module of the network—the output of these filters is then stacked along the channel dimension before being fed into the next layer in the network. The original incarnation of this architecture was called GoogLeNet, but subsequent manifestations have simply been called Inception vN, where N refers to the version number put out by Google.
You might wonder why we are using different types of convolution on the same input. The answer is that it is not always possible to obtain enough useful features to perform an accurate classification with a single convolution, as far as its parameters have been carefully studied. In fact, with some input it works better with convolutions...