Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Book Overview & Buying Principles of Data Science
  • Table Of Contents Toc
Principles of Data Science

Principles of Data Science - Second Edition

By : Sinan Ozdemir, Kakade, Tibaldeschi
close
close
Principles of Data Science

Principles of Data Science

By: Sinan Ozdemir, Kakade, Tibaldeschi

Overview of this book

Need to turn programming skills into effective data science skills? This book helps you connect mathematics, programming, and business analysis. You’ll feel confident asking—and answering—complex, sophisticated questions of your data, making abstract and raw statistics into actionable ideas. Going through the data science pipeline, you'll clean and prepare data and learn effective data mining strategies and techniques to gain a comprehensive view of how the data science puzzle fits together. You’ll learn fundamentals of computational mathematics and statistics and pseudo-code used by data scientists and analysts. You’ll learn machine learning, discovering statistical models that help control and navigate even the densest datasets, and learn powerful visualizations that communicate what your data means.
Table of Contents (17 chapters)
close
close
16
Index

Why Python?

We will use Python for a variety of reasons, listed as follows:

  • Python is an extremely simple language to read and write, even if you've never coded before, which will make future examples easy to understand and read later on, even after you have read this book.
  • It is one of the most common languages, both in production and in the academic setting (one of the fastest growing, as a matter of fact).
  • The language's online community is vast and friendly. This means that a quick search for the solution to a problem should yield many people who have faced and solved similar (if not exactly the same) situations
  • Python has prebuilt data science modules that both the novice and the veteran data scientist can utilize.

The last point is probably the biggest reason we will focus on Python. These prebuilt modules are not only powerful, but also easy to pick up. By the end of the first few chapters, you will be very comfortable with these modules. Some of these modules include the following:

  • pandas
  • scikit-learn
  • seaborn
  • numpy/scipy
  • requests (to mine data from the web)
  • BeautifulSoup (for web–HTML parsing)

Python practices

Before we move on, it is important to formalize many of the requisite coding skills in Python.

In Python, we have variables that are placeholders for objects. We will focus on just a few types of basic objects at first, as shown in the following table:

Object Type

Example

int (an integer)

3, 6, 99, -34, 34, 11111111

float (a decimal)

3.14159, 2.71, -0.34567

boolean (either True or False)

  • The statement "Sunday is a weekend" is True
  • The statement "Friday is a weekend" is False
  • The statement "pi is exactly the ratio of a circle's circumference to its diameter" is True (crazy, right?)

string (text or words made up of characters)

"I love hamburgers" (by the way, who doesn't?)

"Matt is awesome"

A tweet is a string

list (a collection of objects)

[1, 5.4, True, "apple"]

We will also have to understand some basic logistical operators. For these operators, keep the Boolean datatype in mind. Every operator will evaluate to either True or False. Let's take a look at the following operators:

Operators

Example

==

Evaluates to True if both sides are equal; otherwise, it evaluates to False, as shown in the following examples:

  • 3 + 4 == 7 (will evaluate to True)
  • 3 - 2 == 7 (will evaluate to False)

< (less than)

  • 3 < 5 (True)
  • 5 < 3 (False)

<= (less than or equal to)

  • 3 <= 3 (True)
  • 5 <= 3 (False)

> (greater than)

  • 3 > 5 (False)
  • 5 > 3 (True)

>= (greater than or equal to)

  • 3 >= 3 (True)
  • 5 >= 7 (False)

When coding in Python, I will use a pound sign (#) to create a "comment," which will not be processed as code, but is merely there to communicate with the reader. Anything to the right of a # sign is a comment on the code being executed.

Example of basic Python

In Python, we use spaces/tabs to denote operations that belong to other lines of code.

Note

The print True statement belongs to the if x + y == 15.3: line preceding it because it is tabbed right under it. This means that the print statement will be executed if, and only if, x + y equals 15.3.

Note that the following list variable, my_list, can hold multiple types of objects. This one has an int, a float, a boolean, and string inputs (in that order):

my_list = [1, 5.7, True, "apples"] 
 
len(my_list) == 4  # 4 objects in the list 
 
my_list[0] == 1    # the first object 
 
 
my_list[1] == 5.7    # the second object 

In the preceding code, I used the len command to get the length of the list (which was 4). Also, note the zero-indexing of Python. Most computer languages start counting at zero instead of one. So if I want the first element, I call index 0, and if I want the 95th element, I call index 94.

Example – parsing a single tweet

Here is some more Python code. In this example, I will be parsing some tweets about stock prices (one of the important case studies in this book will be trying to predict market movements based on popular sentiment regarding stocks on social media):

tweet = "RT @j_o_n_dnger: $TWTR now top holding for Andor, unseating $AAPL" 
 
words_in_tweet = tweet.split(' ') # list of words in tweet 
 
for word in words_in_tweet:             # for each word in list 
  if "$" in word:                       # if word has a "cashtag"  
  print("THIS TWEET IS ABOUT", word)  # alert the user 

I will point out a few things about this code snippet line by line, as follows:

  • First, we set a variable to hold some text (known as a string in Python). In this example, the tweet in question is "RT @robdv: $TWTR now top holding for Andor, unseating $AAPL".
  • The words_in_tweet variable tokenizes the tweet (separates it by word). If you were to print this variable, you would see the following:
    ['RT', 
    '@robdv:', 
    '$TWTR', 
    'now', 
    'top', 
    'holding', 
    'for', 
    'Andor,', 
    'unseating',
    '$AAPL']
  • We iterate through this list of words; this is called a for loop. It just means that we go through a list one by one.
  • Here, we have another if statement. For each word in this tweet, if the word contains the $ character it represents stock tickers on Twitter.
  • If the preceding if statement is True (that is, if the tweet contains a cashtag), print it and show it to the user.

The output of this code will be as follows:

THIS TWEET IS ABOUT $TWTR
THIS TWEET IS ABOUT $AAPL

We get this output as these are the only words in the tweet that use the cashtag. Whenever I use Python in this book, I will ensure that I am as explicit as possible about what I am doing in each line of code.

Domain knowledge

As I mentioned earlier, domain knowledge focuses mainly on having knowledge of the particular topic you are working on. For example, if you are a financial analyst working on stock market data, you have a lot of domain knowledge. If you are a journalist looking at worldwide adoption rates, you might benefit from consulting an expert in the field. This book will attempt to show examples from several problem domains, including medicine, marketing, finance, and even UFO sightings!

Does this mean that if you're not a doctor, you can't work with medical data? Of course not! Great data scientists can apply their skills to any area, even if they aren't fluent in it. Data scientists can adapt to the field and contribute meaningfully when their analysis is complete.

A big part of domain knowledge is presentation. Depending on your audience, it can matter greatly on how you present your findings. Your results are only as good as your vehicle of communication. You can predict the movement of the market with 99.99% accuracy, but if your program is impossible to execute, your results will go unused. Likewise, if your vehicle is inappropriate for the field, your results will go equally unused.

CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
Principles of Data Science
notes
bookmark Notes and Bookmarks search Search in title playlist Add to playlist download Download options font-size Font size

Change the font size

margin-width Margin width

Change margin width

day-mode Day/Sepia/Night Modes

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Confirmation

Modal Close icon
claim successful

Buy this book with your credits?

Modal Close icon
Are you sure you want to buy this book with one of your credits?
Close
YES, BUY

Submit Your Feedback

Modal Close icon
Modal Close icon
Modal Close icon