Book Image

Hands-On Data Analysis with Pandas - Second Edition

By : Stefanie Molin
Book Image

Hands-On Data Analysis with Pandas - Second Edition

By: Stefanie Molin

Overview of this book

Extracting valuable business insights is no longer a ‘nice-to-have’, but an essential skill for anyone who handles data in their enterprise. Hands-On Data Analysis with Pandas is here to help beginners and those who are migrating their skills into data science get up to speed in no time. This book will show you how to analyze your data, get started with machine learning, and work effectively with the Python libraries often used for data science, such as pandas, NumPy, matplotlib, seaborn, and scikit-learn. Using real-world datasets, you will learn how to use the pandas library to perform data wrangling to reshape, clean, and aggregate your data. Then, you will learn how to conduct exploratory data analysis by calculating summary statistics and visualizing the data to find patterns. In the concluding chapters, you will explore some applications of anomaly detection, regression, clustering, and classification using scikit-learn to make predictions based on past data. This updated edition will equip you with the skills you need to use pandas 1.x to efficiently perform various data manipulation tasks, reliably reproduce analyses, and visualize your data for effective decision making – valuable knowledge that can be applied across multiple domains.
Table of Contents (21 chapters)
1
Section 1: Getting Started with Pandas
4
Section 2: Using Pandas for Data Analysis
9
Section 3: Applications – Real-World Analyses Using Pandas
12
Section 4: Introduction to Machine Learning with Scikit-Learn
16
Section 5: Additional Resources
18
Solutions

Exercises

Complete the following exercises to practice the concepts covered in this chapter:

  1. Run the simulation for December 2018 into new log files without making the user base again. Be sure to run python3 simulate.py -h to review the command-line arguments. Set the seed to 27. This data will be used for the remaining exercises.
  2. Find the number of unique usernames, attempts, successes, and failures, as well as the success/failure rates per IP address, using the data simulated from exercise 1.
  3. Create two subplots with failures versus attempts on the left, and failure rate versus distinct usernames on the right. Draw decision boundaries for the resulting plots. Be sure to color each data point by whether or not it is a hacker IP address.
  4. Build a rule-based criteria using the percentage difference from the median that flags an IP address if the failures and attempts are both five times their respective medians, or if the distinct usernames count is five times its...