Many data analysis problems utilize a pattern of processing data, known as split-apply-combine. In this pattern, three steps are taken to analyze data:
A data set is split into smaller pieces
Each of these pieces are operated upon independently
All of the results are combined back together and presented as a single unit
The following diagram demonstrates a simple split-apply-combine process to sum groups of numbers:
This process is actually very similar to the concepts in MapReduce. In MapReduce, massive sets of data, that are too big for a single computer, are divided into pieces and dispatched to many systems to spread the load in manageable pieces (split). Each system then performs analysis on the data and calculates a result (apply). The results are then collected from each system and used for decision making (combine).
Split-apply-combine, as implemented in pandas, differs in the scope of the data and processing. In pandas, all of the data is in...