Book Image

Data Analysis with Python

By : David Taieb
Book Image

Data Analysis with Python

By: David Taieb

Overview of this book

Data Analysis with Python offers a modern approach to data analysis so that you can work with the latest and most powerful Python tools, AI techniques, and open source libraries. Industry expert David Taieb shows you how to bridge data science with the power of programming and algorithms in Python. You'll be working with complex algorithms, and cutting-edge AI in your data analysis. Learn how to analyze data with hands-on examples using Python-based tools and Jupyter Notebook. You'll find the right balance of theory and practice, with extensive code files that you can integrate right into your own data projects. Explore the power of this approach to data analysis by then working with it across key industry case studies. Four fascinating and full projects connect you to the most critical data analysis challenges you’re likely to meet in today. The first of these is an image recognition application with TensorFlow – embracing the importance today of AI in your data analysis. The second industry project analyses social media trends, exploring big data issues and AI approaches to natural language processing. The third case study is a financial portfolio analysis application that engages you with time series analysis - pivotal to many data science applications today. The fourth industry use case dives you into graph algorithms and the power of programming in modern data science. You'll wrap up with a thoughtful look at the future of data science and how it will harness the power of algorithms and artificial intelligence.
Table of Contents (16 chapters)
Data Analysis with Python
Contributors
Preface
Other Books You May Enjoy
3
Accelerate your Data Analysis with Python Libraries
Index

Preface

 

"Developers are the most-important, most-valuable constituency in business today, regardless of industry."

 
 --Stephen O'Grady, author of The New Kingmakers

First, let me thank you and congratulate you, the reader, for the decision to invest some of your valuable time to read this book. Throughout the chapters to come, I will take you on a journey of discovering or even re-discovering data science from the perspective of a developer and will develop the theme of this book which is that data science is a team sport and that if it is to be successful, developers will have to play a bigger role in the near future and better collaborate with data scientists. However, to make data science more inclusive to people of all backgrounds and trades, we first need to democratize it by making data simple and accessible—this is in essence what this book is about.

Why am I writing this book?

As I'll explain in more detail in Chapter 1, Programming and Data Science – A New Toolset, I am first and foremost a developer with over 20 years, experience of building software components of a diverse nature; frontend, backend, middleware, and so on. Reflecting back on this time, I realize how much getting the algorithms right always came first in my mind; data was always somebody else's problem. I rarely had to analyze it or extract insight from it. At best, I was designing the right data structure to load it in a way that would make my algorithm run more efficiently and the code more elegant and reusable.

However, as the Artificial Intelligence and data science revolution got under way, it became obvious to me that developers like myself needed to get involved, and so 7 years ago in 2011, I jumped at the opportunity to become the lead architect for the IBM Watson core platform UI & Tooling. Of course, I don't pretend to have become an expert in machine learning or NLP, far from it. Learning through practice is not a substitute for getting a formal academic background.

However, a big part of what I want to demonstrate in this book is that, with the right tools and approach, someone equipped with the right mathematical foundations (I'm only talking about high-school level calculus concepts really) can quickly become a good practitioner in the field. A key ingredient to being successful is to simplify as much as possible the different steps of building a data pipeline; from acquiring, loading, and cleaning the data, to visualizing and exploring it, all the way to building and deploying machine learning models.

It was with an eye to furthering this idea of making data simple and accessible to a community beyond data scientists that, 3 years ago, I took on a leading role at the IBM Watson Data Platform team with the mission of expanding the community of developers working with data with a special focus on education and activism on their behalf. During that time as the lead developer advocate, I started to talk openly about the need for developers and data scientists to better collaborate in solving complex data problems.

Note

Note: During discussions at conferences and meetups, I would sometimes get in to trouble with data scientists who would get upset because they interpreted my narrative as me saying that data scientists are not good software developers. I want to set the record straight, including with you, the data scientist reader, that this is far from the case.

The majority of data scientists are excellent software developers with a comprehensive knowledge of computer science concepts. However, their main objective is to solve complex data problems which require rapid, iterative experimentations to try new things, not to write elegant, reusable components.

But I didn't want to only talk the talk; I also wanted to walk the walk and started the PixieDust open source project as my humble contribution to solving this important problem. As the PixieDust work progressed nicely, the narrative became crisper and easier to understand with concrete example applications that developers and data scientists alike could become excited about.

When I was presented with the opportunity to write a book about this story, I hesitated for a long time before embarking on this adventure for mainly two reasons:

  • I have written extensively in blogs, articles, and tutorials about my experience as a data science practitioner with Jupyter Notebooks. I also have extensive experience as a speaker and workshop moderator at a variety of conferences. One good example is the keynote speech I gave at ODSC London in 2017 titled, The Future of Data Science: Less Game of Thrones, More Alliances (https://odsc.com/training/portfolio/future-data-science-less-game-thrones-alliances). However, I had never written a book before and had no idea of how big a commitment it would be, even though I was warned many times by friends that had authored books before.

  • I wanted this book to be inclusive and target equally the developer, the data scientist, and the line of business user, but I was struggling to find the right content and tone to achieve that goal.

In the end, the decision to embark on this adventure came pretty easily. Having worked on the PixieDust project for 2 years, I felt we had made terrific progress with very interesting innovations that generated lots of interest in the open-source community and that writing a book would complement nicely our advocacy work on helping developers get involved in data science.

As a side note, for the reader who is thinking about writing a book and who has similar concerns, I can only advise on the first one with a big, "Yes, go for it." For sure, it is a big commitment that requires a substantial amount of sacrifice but provided that you have a good story to tell with solid content, it is really worth the effort.

Who this book is for

This book will serve the budding data scientist and developer with an interest in developing their skills or anyone wishing to become a professional data scientist. With the introduction of PixieDust from its creator, the book will also be a great desk companion for the already accomplished Data Scientist.

No matter the individual's level of interest, the clear, easy-to-read text and real-life scenarios would suit those with a general interest in the area, since they get to play with Python code running in Jupyter Notebooks.

To produce a functioning PixieDust dashboard, only a modicum of HTML and CSS is required. Fluency in data interpretation and visualization is also necessary since this book addresses data professionals such as business and general data analysts. The later chapters also have much to offer.

What this book covers

The book contains two logical parts of roughly equal length. In the first half, I lay down the theme of the book which is the need to bridge the gap between data science and engineering, including in-depth details about the Jupyter + PixieDust solution I'm proposing. The second half is dedicated to applying what we learned in the first half, to four industry cases.

Chapter 1, Programming and Data Science – A New Toolset, I attempt to provide a definition of data science through the prism of my own experience, building a data pipeline that performs sentiment analysis on Twitter posts. I defend the idea that it is a team sport and that most often, silos exist between the data science and engineering teams that cause unnecessary friction, inefficiencies and, ultimately, a failure to realize its full potential. I also argue the point of view that data science is here to stay and that eventually, it will become an integral part of what is known today as computer science (I like to think that someday new terms will emerge, such as computer data science that better capture this duality).

Chapter 2, Python and Jupyter Notebooks to Power your Data Analysis, I start diving into popular data science tools such as Python and its ecosystem of open-source libraries dedicated to data science, and of course Jupyter Notebooks. I explain why I think Jupyter Notebooks will become the big winner in the next few years. I also introduce the PixieDust open-source library capabilities starting from the simple display() method that lets the user visually explore data in an interactive user interface by building compelling charts. With this API, the user can choose from multiple rendering engines such as Matplotlib, Bokeh, Seaborn, and Mapbox. The display() capability was the only feature in the PixieDust MVP (minimum viable product) but, over time, as I was interacting with a lot of data science practitioners, I added new features to what would quickly become the PixieDust toolbox:

  • sampleData(): A simple API for easily loading data into pandas and Apache Spark DataFrames

  • wrangle_data(): A simple API for cleaning and massaging datasets. This capability includes the ability to destructure columns into new columns using regular expressions to extract content from unstructured text. The wrangle_data() API can also make recommendations based on predefined patterns.

  • PackageManager: Lets the user install third-party Apache Spark packages inside a Python Notebook.

  • Scala Bridge: Enables the user to run the Scala code inside a Python Notebook. Variables defined in the Python side are accessible in Scala and vice-versa.

  • Spark Job Progress Monitor: Lets you track the status of your Spark Job with a real-time progress bar that displays directly in the output cell of the code being executed.

  • PixieApp: Provides a programming model centered around HTML/CSS that lets developers build sophisticated dashboards to operationalize the analytics built in the Notebook. PixieApps can run directly in the Jupyter Notebook or be deployed as analytic web applications using the PixieGateway microservice. PixieGateway is an open-source companion project to PixieDust.

The following diagram summarizes the PixieDust development journey, including recent additions such as the PixieGateway and the PixieDebugger which is the first visual Python debugger for Jupyter Notebooks:

PixieDust journey

One key message to take away from this chapter is that PixieDust is first and foremost an open-source project that lives and breathes through the contributions of the developer community. As is the case for countless open-source projects, we can expect many more breakthrough features to be added to PixieDust over time.

Chapter 3, Accelerate your Data Analysis with Python Libraries, I take the reader through a deep dive of the PixieApp programming model, illustrating each concept along the way with a sample application that analyzes GitHub data. I start with a high-level description of the anatomy of a PixieApp including its life cycle and the execution flow with the concept of routes. I then go over the details of how developers can use regular HTML and CSS snippets to build the UI of the dashboard, seamlessly interacting with the analytics and leveraging the PixieDust display() API to add sophisticated charts.

The PixieApp programming model is the cornerstone of the tooling strategy for bridging the gap between data science and engineering, as it streamlines the process of operationalizing the analytics, thereby increasing collaboration between data scientists and developers and reducing the time-to-market of the application.

Chapter 4, Publish your Data Analysis to the Web - the PixieApp Tool, I discuss the PixieGateway microservice which enables developers to publish PixieApps as analytical web applications. I start by showing how to quickly deploy a PixieGateway microservice instance both locally and on the cloud as a Kubernetes container. I then go over the PixieGateway admin console capabilities, including the various configuration profiles and how to live-monitor the deployed PixieApps instances and the associated backend Python kernels. I also feature the chart sharing capability of the PixieGateway that lets the user turn a chart created with the PixieDust display() API into a web page accessible by anyone on the team.

The PixieGateway is a ground-breaking innovation with the potential of seriously speeding up the operationalization of analytics—which is sorely needed today—to fully capitalize on the promise of data science. It represents an open-source alternative to similar products that already exist on the market, such as the Shiny Server from R-Studio (https://shiny.rstudio.com/deploy) and Dash from Plotly (https://dash.plot.ly)

Chapter 5, Python and PixieDust Best Practices and Advanced Concepts, I complete the deep-dive of the PixieDust toolbox by going over advanced concepts of the PixieApp programming model:

  • @captureOutput decorator: By default, PixieApp routes require developers to provide an HTML fragment that will be injected in the application UI. This is a problem when we want to call a third-party Python library that is not aware of the PixieApp architecture and directly generate the output to the Notebook. @captureOutput solves this problem by automatically redirecting the content generated by the third-party Python library and encapsulating it into a proper HTML fragment.

  • Leveraging Python class inheritance for greater modularity and code reuse: Breaks down the PixieApp code into logical classes that can be composed together using the Python class inheritance capability. I also show how to call an external PixieApp using the pd_app custom attribute.

  • PixieDust support for streaming data: Shows how PixieDust display() and PixieApp can also handle streaming data.

  • Implementing Dashboard drill-down with PixieApp events: Provides a mechanism for letting PixieApp components publish and subscribe to events generated when the user interacts with the UI (for example, charts, and buttons).

  • Building a custom display renderer for the PixieDust display() API: Walks through the code of a simple renderer that extends the PixieDust menus. This renderer displays a custom HTML table showing the selected data.

  • Debugging techniques: Go over the various debugging techniques that PixieDust offers including the visual Python debugger called PixieDebugger and the %%PixiedustLog magic for displaying Python logging messages.

  • Ability to run Node.js code: We discuss the pixiedust_node extension that manages the life cycle of a Node.js process responsible for executing arbitrary Node.js scripts directly from within the Python Notebook.

Thanks to the open-source model with its transparent development process and a growing community of users who provided some valuable feedback, we were able to prioritize and implement a lot of these advanced features over time. The key point I'm trying to make is that following an open-source model with an appropriate license (PixieDust uses the Apache 2.0 license available here https://www.apache.org/licenses/LICENSE-2.0) does work very well. It helped us grow the community of users, which in turn provided us with the necessary feedback to prioritize new features that we knew were high value, and in some instances contributed code in the form of GitHub pull requests.

Chapter 6, Analytics Study: AI and Image Recognition with TensorFlow, I dive into the first of four industry cases. I start with a high-level introduction to machine learning, followed by an introduction to deep learning—a subfield of machine learning—and the TensorFlow framework that makes it easier to build neural network models. I then proceed to build an image recognition sample application including the associated PixieApp in four parts:

  • Part 1: Builds an image recognition TensorFlow model by using the pretrain ImageNet model. Using the TensorFlow for poets tutorial, I show how to build analytics to load and score a neural network model.

  • Part 2: Creates a PixieApp that operationalizes the analytics created in Part 1. This PixieApp scrapes the images from a web page URL provided by the user, scores them against the TensorFlow model and then graphically shows the results.

  • Part 3: I show how to integrate the TensorBoard Graph Visualization component directly in the Notebook, providing the ability to debug the neural network model.

  • Part 4: I show how to retrain the model with custom training data and update the PixieApp to show the results from both models.

I decided to start the series of sample applications with deep learning image recognition with TensorFlow because it's an important use case that is growing in popularity and demonstrating how we can build the models and deploy them in an application in the same Notebook represents a powerful statement toward the theme of bridging the gap between data science and engineering.

Chapter 7, Analytics Study: NLP and Big Data with Twitter Sentiment Analysis, I talk about doing natural language processing at Twitter scale. In this chapter, I show how to use the IBM Watson Natural Language Understanding cloud-based service to perform a sentiment analysis of the tweets. This is very important because it reminds the reader that reusing managed hosted services rather building the capability in-house can sometimes be an attractive option.

I start with an introduction to the Apache Spark parallel computing framework, and then move on to building the application in four parts:

  • Part 1: Acquiring the Twitter data with Spark Structured Streaming

  • Part 2: Enriching the data with sentiment and most relevant entity extracted from the text

  • Part 3: Operationalizing the analytics by creating a real-time dashboard PixieApp.

  • Part 4: An optional section that re-implements the application with Apache Kafka and IBM Streaming Designer hosted service to demonstrate how to add greater scalability.

I think the reader, especially those who are not familiar with Apache Spark, will enjoy this chapter as it is a little easier to follow than the previous one. The key takeaway is how to build analytics that scale with Jupyter Notebooks that are connected to a Spark cluster.

Chapter 8, Analytics Study: Prediction - Financial Time Series Analysis and Forecasting, I talk about time series analysis which is a very important field of data science with lots of practical applications in the industry. I start the chapter with a deep dive into the NumPy library which is foundational to so many other libraries, such as pandas and SciPy. I then proceed with the building of the sample application, which analyzes a time series comprised of historical stock data, in two parts:

  • Part 1: Provides a statistical exploration of the time series including various charts such as autocorrelation function (ACF) and partial autocorrelation function (PACF)

  • Part 2: Builds a predictive model based on the ARIMA algorithms using the statsmodels Python library

Time series analysis is such an important field of data science that I consider to be underrated. I personally learned a lot while writing this chapter. I certainly hope that the reader will enjoy it as well and that reading it will spur an interest to know more about this great topic. If that's the case, I also hope that you'll be convinced to try out Jupyter and PixieDust on your next learnings about time series analysis.

Chapter 9, Analytics Study: Graph Algorithms - US Domestic Flight Data Analysis, I complete this series of industry use cases with the study of Graphs. I chose a sample application that analyzes flight delays because the data is readily available, and it's a good fit for using graph algorithms (well, for full disclosure, I may also have chosen it because I had already written a similar application to predict flight delays based on weather data where I used Apache Spark MLlib: https://developer.ibm.com/clouddataservices/2016/08/04/predict-flight-delays-with-apache-spark-mllib-flightstats-and-weather-data).

I start with an introduction to graphs and associated graph algorithms including several of the most popular graph algorithms such as Breadth First Search and Depth First Search. I then proceed with an introduction to the networkx Python library that is used to build the sample application.

The application is made of four parts:

  • Part 1: Shows how to load the US domestic flight data into a graph.

  • Part 2: Creates the USFlightsAnalysis PixieApp that lets the user select an origin and destination airport and then display a Mapbox map of the shortest path between the two airports according to a selected centrality

  • Part 3: Adds data exploration to the PixieApp that includes various statistics for each airline that flies out of the selected origin airport

  • Part 4: Use the techniques learned in Chapter 8, Analytics Study: Prediction - Financial Time Series Analysis and Forecasting to build an ARIMA model for predicting flight delays

Graph theory is also another important and growing field of data science and this chapter nicely rounds up the series, which I hope provides a diverse and representative set of industry use cases. For readers who are particularly interested in using graph algorithms with big data, I recommend looking at Apache Spark GraphX (https://spark.apache.org/graphx) which implements many of the graph algorithms using a very flexible API.

Chapter 10, The Future of Data Analysis and Where to Develop your Skills, I end the book by giving a brief summary and explaining my take on Drew's Conway Venn Diagram. Then I talk about the future of AI and data science and how companies could prepare themselves for the AI and data science revolution. Also, I have listed some great references for further learning.

Appendix, PixieApp Quick-Reference, is a developer quick-reference guide that provides a summary of all the PixieApp attributes. This explains the various annotations, custom HTML attributes, and methods with the help of appropriate examples.

But enough about the introduction: let's get started on our journey with the first chapter titled Programming and Data Science – A New Toolset.

To get the most out of this book

  • Most of the software needed to follow the example is open source and therefore free to download. Instructions are provided throughout the book, starting with installing anaconda which includes the Jupyter Notebook server.

  • In Chapter 7, Analytics Study: NLP and Big Data with Twitter Sentiment Analysis, the sample application requires the use of IBM Watson cloud services including NLU and Streams Designer. These services come with a free tier plan, which is sufficient to follow the example along.

Download the example code files

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

  1. Log in or register at http://www.packtpub.com.

  2. Select the SUPPORT tab.

  3. Click on Code Downloads & Errata.

  4. Enter the name of the book in the Search box and follow the on-screen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

  • WinRAR / 7-Zip for Windows

  • Zipeg / iZip / UnRarX for Mac

  • 7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Thoughtful-Data-Science. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: http://www.packtpub.com/sites/default/files/downloads/ThoughtfulDataScience_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. For example: "You can use the {%if ...%}...{%elif ...%}...{%else%}…{%endif%} notation to conditionally output text."

A block of code is set as follows:

import pandas
data_url = "https://data.cityofnewyork.us/api/views/e98g-f8hy/rows.csv?accessType=DOWNLOAD"
building_df = pandas.read_csv(data_url)
building_df

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

import pandas
data_url = "https://data.cityofnewyork.us/api/views/e98g-f8hy/rows.csv?accessType=DOWNLOAD"
building_df = pandas.read_csv(data_url)
building_df

Any command-line input or output is written as follows:

jupyter notebook --generate-config

Bold: Indicates a new term, an important word, or words that you see on the screen, for example, in menus or dialog boxes, also appear in the text like this. For example: " The next step is to create a new route that takes the user value and returns the results. This route will be invoked by the Submit Query button."

Note

Warnings or important notes appear like this.

Note

Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email , and mention the book's title in the subject of your message. If you have questions about any aspect of this book, please email us at .

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book we would be grateful if you would report this to us. Please visit, http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit http://authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packtpub.com.