## Defining a methodology

A data scientist has many options in selecting and implementing a classification or clustering algorithm.

Firstly, a mathematical or statistical model is to be selected to extract knowledge from the raw input data or the output of a data upstream transformation. The selection of the model is constrained by the following parameters:

Business requirements such as accuracy of results or computation time

Availability of training data, algorithms, and libraries

Access to a domain or subject matter expert, if needed

Secondly, the engineer has to select a computational and deployment framework suitable for the amount of data to be processed. The computational context is to be defined by the following parameters:

Available resources such as machines, CPU, memory, or I/O bandwidth

An implementation strategy such as iterative versus recursive computation or caching

Requirements for the responsiveness of the overall process such as duration of computation or display of intermediate results...