Data Analysis with Python

Data Analysis with Python

By : David Taieb

Buy this Book

Data Analysis with Python

By: David Taieb

Buy this Book

Overview of this book

Data Analysis with Python offers a modern approach to data analysis so that you can work with the latest and most powerful Python tools, AI techniques, and open source libraries. Industry expert David Taieb shows you how to bridge data science with the power of programming and algorithms in Python. You'll be working with complex algorithms, and cutting-edge AI in your data analysis. Learn how to analyze data with hands-on examples using Python-based tools and Jupyter Notebook. You'll find the right balance of theory and practice, with extensive code files that you can integrate right into your own data projects. Explore the power of this approach to data analysis by then working with it across key industry case studies. Four fascinating and full projects connect you to the most critical data analysis challenges you’re likely to meet in today. The first of these is an image recognition application with TensorFlow – embracing the importance today of AI in your data analysis. The second industry project analyses social media trends, exploring big data issues and AI approaches to natural language processing. The third case study is a financial portfolio analysis application that engages you with time series analysis - pivotal to many data science applications today. The fourth industry use case dives you into graph algorithms and the power of programming in modern data science. You'll wrap up with a thoughtful look at the future of data science and how it will harness the power of algorithms and artificial intelligence.

Data Analysis with Python

Contributors

Preface

Other Books You May Enjoy

Free Chapter

Programming and Data Science – A New Toolset

What is data science

Is data science here to stay?

Why is data science on the rise?

What does that have to do with developers?

Putting these concepts into practice

Deep diving into a concrete example

Data pipeline blueprint

What kind of skills are required to become a data scientist?

IBM Watson DeepQA

Back to our sentiment analysis of Twitter hashtags project

Lessons learned from building our first enterprise-ready data pipeline

Data science strategy

Jupyter Notebooks at the center of our strategy

Summary

Python and Jupyter Notebooks to Power your Data Analysis

Why choose Python?

Introducing PixieDust

SampleData – a simple API for loading data

Wrangling data with pixiedust_rosie

Display – a simple interactive API for data visualization

Filtering

Bridging the gap between developers and data scientists with PixieApps

Architecture for operationalizing data science analytics

Summary

Accelerate your Data Analysis with Python Libraries

Anatomy of a PixieApp

Summary

Publish your Data Analysis to the Web - the PixieApp Tool

Overview of Kubernetes

Installing and configuring the PixieGateway server

Summary

Python and PixieDust Best Practices and Advanced Concepts

Use @captureOutput decorator to integrate the output of third-party Python libraries

Increase modularity and code reuse

Run Node.js inside a Python Notebook

Summary

Analytics Study: AI and Image Recognition with TensorFlow

What is machine learning?

What is deep learning?

Getting started with TensorFlow

Image recognition sample application

Summary

Analytics Study: NLP and Big Data with Twitter Sentiment Analysis

Getting started with Apache Spark

Twitter sentiment analysis application

Part 1 – Acquiring the data with Spark Structured Streaming

Part 2 – Enriching the data with sentiment and most relevant extracted entity

Part 3 – Creating a real-time dashboard PixieApp

Part 4 – Adding scalability with Apache Kafka and IBM Streams Designer

Summary

Analytics Study: Prediction - Financial Time Series Analysis and Forecasting

Getting started with NumPy

Statistical exploration of time series

Putting it all together with the StockExplorer PixieApp

Time series forecasting using the ARIMA model

Summary

Analytics Study: Graph Algorithms - US Domestic Flight Data Analysis

Introduction to graphs

Getting started with the networkx graph library

Part 1 – Loading the US domestic flight data into a graph

Part 2 – Creating the USFlightsAnalysis PixieApp

Part 3 – Adding data exploration to the USFlightsAnalysis PixieApp

Part 4 – Creating an ARIMA model for predicting flight delays

Summary

The Future of Data Analysis and Where to Develop your Skills

Forward thinking – what to expect for AI and data science

References

PixieApp Quick-Reference

Annotations

Custom HTML attributes

Methods

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Jupyter Notebooks at the center of our strategy

In essence, Notebooks are web documents composed of editable cells that let you run commands interactively against a backend engine. As their name indicates, we can think of them as the digital version of a paper scratch pad used to write notes and results about experiments. The concept is very powerful and simple at the same time: a user enters code in the language of his/her choice (most implementations of Notebooks support multiple languages, such as Python, Scala, R, and many more), runs the cell and gets the results interactively in an output area below the cell that becomes part of the document. Results could be of any type: text, HTML, and images, which is great for graphing data. It's like working with a traditional REPL (short for, Read-Eval-Print-Loop) program on steroids since the Notebook can be connected to powerful compute engines (such as Apache Spark (https://spark.apache.org) or Python Dask (https://dask.pydata.org) clusters) allowing you to experiment with big data if needed.

Within Notebooks, any classes, functions, or variables created in a cell are visible in the cells below, enabling you to write complex analytics piece by piece, iteratively testing your hypotheses and fixing problems before moving on to the next phase. In addition, users can also write rich text using the popular Markdown language or mathematical expressions using LaTeX (https://www.latex-project.org/), to describe their experiments for others to read.

The following figure shows parts of a sample Jupyter Notebook with a Markdown cell explaining what the experiment is about, a code cell written in Python to create 3D plots, and the actual 3D charts results:

ample Jupyter Notebook

Why are Notebooks so popular?

In the last few years, Notebooks have seen a meteoric growth in popularity as the tool of choice for data science-related activities. There are multiple reasons that can explain it, but I believe the main one is its versatility, making it an indispensable tool not just for data scientists but also for most of the personas involved in building data pipelines, including business analysts and developers.

For data scientists, Notebooks are ideal for iterative experimentation because it enables them to quickly load, explore, and visualize data. Notebooks are also an excellent collaboration tool; they can be exported as JSON files and easily shared across the team, allowing experiments to be identically repeated and debugged when needed. In addition, because Notebooks are also web applications, they can be easily integrated into a multi-users cloud-based environment providing an even better collaborative experience.

These environments can also provide on-demand access to large compute resources by connecting the Notebooks with clusters of machines using frameworks such as Apache Spark. Demand for these cloud-based Notebook servers is rapidly growing and as a result, we're seeing an increasing number of SaaS (short for, Software as a Service) solutions, both commercial with, for example, IBM Data Science Experience (https://datascience.ibm.com) or DataBricks (https://databricks.com/try-databricks) and open source with JupyterHub (https://jupyterhub.readthedocs.io/en/latest).

For business analysts, Notebooks can be used as presentation tools that in most cases provide enough capabilities with its Markdown support to replace traditional PowerPoints. Charts and tables generated can be directly used to effectively communicate results of complex analytics; there's no need to copy and paste anymore, plus changes in the algorithms are automatically reflected in the final presentation. For example, some Notebook implementations, such as Jupyter, provide an automated conversion of the cell layout to the slideshow, making the whole experience even more seamless.

Note

For reference, here are the steps to produce these slides in Jupyter Notebooks:

Using the View | Cell Toolbar | Slideshow, first annotate each cell by choosing between Slide, Sub-Slide, Fragment, Skip, or Notes.
Use the nbconvert jupyter command to convert the Notebook into a Reveal.js-powered HTML slideshow:
Optionally, you can fire up a web application server to access these slides online:

      
jupyter nbconvert <pathtonotebook.ipynb> --to slides
      jupyter nbconvert <pathtonotebook.ipynb> --to slides –post serve

For developers, the situation is much less clear-cut. On the one hand, developers love REPL programming, and Notebooks offer all the advantages of an interactive REPL with the added bonuses that it can be connected to a remote backend. By virtue of running in a browser, results can contain graphics and, since they can be saved, all or part of the Notebook can be reused in different scenarios. So, for a developer, provided that your language of choice is available, Notebooks offer a great way to try and test things out, such as fine-tuning an algorithm or integrating a new API. On the other hand, there is little Notebook adoption by developers for data science activities that can complement the work being done by data scientists, even though they are ultimately responsible for operationalizing the analytics into applications that address customer needs.

To improve the software development life cycle and reduce time to value, they need to start using the same tools, programming languages, and frameworks as data scientists, including Python with its rich ecosystem of libraries and Notebooks, which have become such an important data science tool. Granted that developers have to meet the data scientist in the middle and get up to speed on the theory and concept behind data science. Based on my experience, I highly recommend using MOOCs (short for, Massive Open Online Courses) such as Coursera (https://www.coursera.org) or EdX (http://www.edx.org), which provide a wide variety of courses for every level.

However, having used Notebooks quite extensively, it is clear that, while being very powerful, they are primarily designed for data scientists, leaving developers with a steep learning curve. They also lack application development capabilities that are so critical for developers. As we've seen in the Sentiment analysis of Twitter Hashtags project, building an application or a dashboard based on the analytics created in a Notebook can be very difficult and require an architecture that can be difficult to implement and that has a heavy footprint on the infrastructure.

It is to address these gaps that I decided to create the PixieDust (https://github.com/ibm-watson-data-lab/pixiedust) library and open source it. As we'll see in the next chapters, the main goal of PixieDust is to lower the cost of entry for new users (whether it be data scientists or developers) by providing simple APIs for loading and visualizing data. PixieDust also provides a developer framework with APIs for easily building applications, tools, and dashboards that can run directly in the Notebook and also be deployed as web applications.

Data Analysis with Python

By : David Taieb

Data Analysis with Python

By: David Taieb

Overview of this book

Related Content you might be interested in

Current Title:

Data Analysis with Python

Cognitive Computing with IBM Watson

Jupyter Notebooks at the center of our strategy

Why are Notebooks so popular?

Note