SciPy is organized as a family of modules. We like to think of each module as a different field of mathematics. And as such, each has its own particular techniques and tools. The following is an exhaustive list of the different modules in SciPy:
The names of the modules are mostly self explanatory. For instance, the field of statistics deals with the study of the collection, organization, analysis, interpretation, and presentation of data. The objects with which statisticians deal for their research are usually represented as arrays of multiple dimensions. The result of certain operations on these arrays then offers information about the objects they represent (for example, the mean and standard deviation of a dataset). A well-known set of applications is based upon these operations; confidence intervals for the mean, hypothesis testing, or data mining, for instance. When facing any research problem that needs any tool of this branch of mathematics, we access the corresponding functions from the scipy.stats
module.
Let us use some of its functions to solve a simple problem.
The following table shows the IQ test scores of 31 individuals:
114 |
100 |
104 |
89 |
102 |
91 |
114 |
114 |
103 |
105 |
108 |
130 |
120 |
132 |
111 |
128 |
118 |
119 |
86 |
72 |
111 |
103 |
74 |
112 |
107 |
103 |
98 |
96 |
112 |
112 |
93 |
A stem plot of the distribution of these 31 scores shows that there are no major departures from normality, and thus we assume the distribution of the scores to be close to normal. Estimate the mean IQ score for this population, using a 99 percent confidence interval.
We start by loading the data into memory, as follows:
>>> scores=numpy.array([114, 100, 104, 89, 102, 91, 114, 114, 103, 105, 108, 130, 120, 132, 111, 128, 118, 119, 86, 72, 111, 103, 74, 112, 107, 103, 98, 96, 112, 112, 93])
At this point, if we type scores followed by a dot [.
], and press the Tab key, the system offers us all possible methods inherited by the data from the NumPy library, as it is customary in Python. Technically, we could compute at this point the required mean, xmean, and corresponding confidence interval according to the formula, xmean ± zcrit * sigma / sqrt(n)
, where sigma
and n
are respectively the standard deviation and size of the data, and zcrit
is the critical value corresponding to the confidence. In this case, we could look up a table on any statistics book to obtain a crude approximation to its value, zcrit = 2.576. The remaining values may be computed in our session and properly combined, as follows:
>>>xmean = numpy.mean(scores) >>> sigma = numpy.std(scores) >>> n = numpy.size(scores) >>>xmean, xmean - 2.576*sigma /numpy.sqrt(n), \ ... xmean + 2.756*sigma / numpy.sqrt(n) (105.83870967741936, 99.343223715529746, 112.78807276397517)
We have thus computed the estimated mean IQ score (with value 105.83870967741936) and the interval of confidence (from about 99.34 to approximately 112.79). We have done so using purely NumPy-based operations, while following a known formula. But instead of making all these computations by hand, and looking for critical values on tables, we could directly ask SciPy for assistance.
Note how the scipy.stats
module needs to be loaded before we use any of its functions, or request any help on them:
>>> from scipy import stats >>> result=scipy.stats.bayes_mvs(scores)
The variable result contains the solution of our problem, and some more information. Note first that result is a tuple with three entries, as the help documentation suggests the following:
>>> help(scipy.stats.bayes_mvs)
This gives us the following output:
The solution to our problem is then the first entry of the tuple result
. To show the contents of this entry, we request it as usual:
>>> result[0] (105.83870967741936, (98.789863768428674, 112.88755558641004))
Note how this output gives us the same average, but a slightly different confidence interval. This is, of course, more accurate than the one we computed in the previous steps.