The datasets of various scientific applications range from several MB to a few GB. For some specific applications, the datasets may be huge. These gigantic datasets may span up to a couple of petabytes. We usually understand MB and GB; let's just get an idea of the scale of a petabyte. Suppose we store one petabyte of data in compact disks (CDs) and arrange these CDs in the form of a stack. The size of this stack will be approximately 1.75 kilometers. Due to recent advances in networking and distributed computing technologies, these days, there are a number of applications that process datasets of several petabytes. In order to efficiently process large-scale datasets, there are a number of options available at all levels of software or hardware.
There are several efficient frameworks for processing datasets of all scales. These frameworks can process small-, medium-, or large-scale data with equal efficiency, depending on the infrastructure provided...