Object detection is a different problem to localization as we can have a variable number of objects in the image. Consequently it becomes very tricky to handle variable number of outputs if we consider detection as just a simple regression problem like we did for localization. Therefore we consider detection as a classification problem instead.
One very common approach that has been in use for a long time is to do object detection using sliding windows. The idea is to slide a window of fixed size across the input image. What is inside the window at each location is then sent to a classifier that will tell us if the window contains an object of interest or not.
For this purpose, one can first train a CNN classifier with small closely cropped images - resized to the same size as the window - of objects we want to detect e.g. cars. At test time our fixed size window is moved in a sliding fashion across the whole image that we want to detect...