Optimizing data storage and retrieval
When training SOTA DL models, you typically need a large dataset for a model to train. It can be expensive to store and retrieve such large datasets. For instance, the popular computer vision dataset COCO2017 is approximately 30 GB, while the Common Crawl dataset for NLP tasks has a size of hundreds of TB. Dealing with such large datasets requires careful consideration of where to store the dataset and how to retrieve it at inference or training time. In this section, we will discuss some of the optimization strategies you can use when choosing storage and retrieval strategies.
Choosing a storage solution
When choosing an optimal storage solution, you may consider the following factors, among others:
- The cost of storage and data retrieval
- The latency and throughput requirements for data retrieval
- Data partitioning
- How frequently data is refreshed
Let’s take a look at the pros and cons of various storage solutions...