When working with unstructured data there are two key considerations that need to be taken into account. The first is that there will likely to be a large amount of data (possibly tens of terabytes) to be loaded in. This requires an infrastructure that can handle large contiguous writes. The hardware used must be able to support high speed storage of hundreds of megabytes of data in bursts.
The other is that reading in data will result in large reads, potentially concurrently, at possibly random locations across any part of the storage. If it's not random and well known, an option to cache the well-known data items to improve performance must be considered. The I/O (or network) channel can become a bottleneck. The best solution for performance is to support multiple channels.