Book Image

Data Science Algorithms in a Week - Second Edition

By : David Natingga
Book Image

Data Science Algorithms in a Week - Second Edition

By: David Natingga

Overview of this book

Machine learning applications are highly automated and self-modifying, and continue to improve over time with minimal human intervention, as they learn from the trained data. To address the complex nature of various real-world data problems, specialized machine learning algorithms have been developed. Through algorithmic and statistical analysis, these models can be leveraged to gain new knowledge from existing data as well. Data Science Algorithms in a Week addresses all problems related to accurate and efficient data classification and prediction. Over the course of seven days, you will be introduced to seven algorithms, along with exercises that will help you understand different aspects of machine learning. You will see how to pre-cluster your data to optimize and classify it for large datasets. This book also guides you in predicting data based on existing trends in your dataset. This book covers algorithms such as k-nearest neighbors, Naive Bayes, decision trees, random forest, k-means, regression, and time-series analysis. By the end of this book, you will understand how to choose machine learning algorithms for clustering, classification, and regression and know which is best suited for your problem
Table of Contents (16 chapters)
Title Page
Packt Upsell
Glossary of Algorithms and Methods in Data Science


Data science is a discipline at the intersection of machine learning, statistics, and data mining with the objective of gaining new knowledge from existing data by means of algorithmic and statistical analysis. In this book, you will learn the seven most important ways in data science of analyzing the data. Each chapter first explains its algorithm or analysis as a simple concept, supported by a trivial example. Further examples and exercises are used to build and expand your knowledge of a particular type of analysis.

Who this book is for

This book is for aspiring data science professionals who are familiar with Python and have a background of sorts in statistics. Developers who are currently implementing one or two data science algorithms and who now want to learn more to expand their skillset will find this book quite useful.

What this book covers

Chapter 1Classification Using K-Nearest Neighbors, classifies a data item based on the most similar k items.

Chapter 2Naive Bayes, delves into Bayes' Theorem with a view to computing the probability a data item belonging to a certain class.

Chapter 3Decision Trees, organizes your decision criteria into the branches of a tree, and uses a decision tree to classify a data item into one of the classes at the leaf node.

Chapter 4Random Forests, classifies a data item with an ensemble of decision trees to improve the accuracy of the algorithm by reducing the negative impact of the bias.

Chapter 5Clustering into K Clusters, divides your data into k clusters to discover the patterns and similarities between the data items and goes into how to exploit these patterns to classify new data.

Chapter 6Regression, models phenomena in your data by using a function that can predict the values of the unknown data in a simple way.

Chapter 7Time-Series Analysis, unveils the trends and repeating patterns in time-dependent data to predict the future of the stock market, Bitcoin prices, and other time events.

Appendix A,  Python Reference, is a reference of the basic Python language constructs, commands, and functions used throughout the book.

Appendix B, Statistics, provides a summary of the statistical methods and tools that are useful to a data scientist.

Appendix C, Glossary of Algorithms and Methods in Data Science, provides a glossary of some of the most important and powerful algorithms and methods from the fields of data science and machine learning.

To get the most out of this book

To get the most out of this book, you require, first and foremost, an active attitude to think of the problems—a lot of new content is presented in the exercises at the end of the chapter in the section entitled Problems. You also then need to be able to run Python programs on the operating system of your choice. The author ran the programs on the Linux operating system using the command line

Download the example code files

You can download the example code files for this book from your account at If you purchased this book elsewhere, you can visit and register to have the files emailed directly to you.

You can download the code files by following these steps:

  1. Log in or register at
  2. Select the SUPPORT tab.
  3. Click on Code Downloads & Errata.
  4. Enter the name of the book in the Search box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

  • WinRAR/7-Zip for Windows
  • Zipeg/iZip/UnRarX for Mac
  • 7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here:

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system."

A block of code is set as follows:

def dic_key_count(dic, key):
if key is None:
return 0
if dic.get(key, None) is None:
return 0
return int(dic[key])

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

def construct_general_tree(verbose, heading, complete_data,
enquired_column, m):
available_columns = []
for col in range(0, len(heading)):
if col != enquired_column:

Any command-line input or output is written as follows:

$ python chess.csv

Bold: Indicates a new term, an important word, or words that you see on screen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Select System info from the Administration panel."


Warnings or important notes appear like this.


Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit


Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit