Let's start by considering the example of training a network for image classification. In this case, our data will be a collection of images with an associated label. One way we might store our data is in a directory-like structure of folders. For each label, we will have a folder containing the images belonging to that label:
-Data - Person -im1.png - Cat -im2.png - Dog -im3.png
Although this might seem a simple way to store our data, it has some major drawbacks as soon as the dataset size becomes too big. One big disadvantage comes when we start loading it.
Opening a file is a time-consuming operation, and having to open many millions of files multiple times is going to add a large overhead to training time. On top of this, as we have our data all split up, it is not going to be in one nice block of memory. The hard drive is going to have to do even more work trying to locate and access them all.
What is the solution? We put them all into a single...