pandas is, by far, the most preferable data preprocessing module in Python. The way it handles data is very similar to R. Its data frame not only gives you visually appealing printouts of tables, but also allows you to access data in a more instinctive way. If you are not familiar with R, try to think of using a spreadsheet software such as Microsoft Excel or SQL tables but in a programmatic way. This covers a lot of that what pandas does.
You can download and install pandas from its official site at http://pandas.pydata.org/. A more preferable way is to use pip or install Python scientific distributions, such as Anaconda.
Remember how we used numpy.genfromtxt()
to read the csv
data in Chapter 4, NumPy Core and Libs Submodules? Actually, using pandas to read tables and pass pre-processed data to ndarray
(simply performing np.array(data_frame)
will transfer a data frame into a multidimensional ndarray
) would be a more preferable workflow for analytics. In this section, we are going to...