Book Image

Data Analysis with Python

By : David Taieb
Book Image

Data Analysis with Python

By: David Taieb

Overview of this book

Data Analysis with Python offers a modern approach to data analysis so that you can work with the latest and most powerful Python tools, AI techniques, and open source libraries. Industry expert David Taieb shows you how to bridge data science with the power of programming and algorithms in Python. You'll be working with complex algorithms, and cutting-edge AI in your data analysis. Learn how to analyze data with hands-on examples using Python-based tools and Jupyter Notebook. You'll find the right balance of theory and practice, with extensive code files that you can integrate right into your own data projects. Explore the power of this approach to data analysis by then working with it across key industry case studies. Four fascinating and full projects connect you to the most critical data analysis challenges you’re likely to meet in today. The first of these is an image recognition application with TensorFlow – embracing the importance today of AI in your data analysis. The second industry project analyses social media trends, exploring big data issues and AI approaches to natural language processing. The third case study is a financial portfolio analysis application that engages you with time series analysis - pivotal to many data science applications today. The fourth industry use case dives you into graph algorithms and the power of programming in modern data science. You'll wrap up with a thoughtful look at the future of data science and how it will harness the power of algorithms and artificial intelligence.
Table of Contents (16 chapters)
Data Analysis with Python
Contributors
Preface
Other Books You May Enjoy
3
Accelerate your Data Analysis with Python Libraries
Index

Back to our sentiment analysis of Twitter hashtags project


The quick data pipeline prototype we built gave us a good understanding of the data, but then we needed to design a more robust architecture and make our application enterprise ready. Our primary goal was still to gain experience in building data analytics, and not spend too much time on the data engineering part. This is why we tried to leverage open source tools and frameworks as much as possible:

  • Apache Kafka (https://kafka.apache.org): This is a scalable streaming platform for processing the high volume of tweets in a reliable and fault-tolerant way.

  • Apache Spark (https://spark.apache.org): This is an in-memory cluster-computing framework. Spark provides a programming interface that abstracts a complexity of parallel computing.

  • Jupyter Notebooks (http://jupyter.org): These interactive web-based documents (Notebooks) let users remotely connect to a computing environment (Kernel) to create advanced data analytics. Jupyter Kernels support a variety of programming languages (Python, R, Java/Scala, and so on) as well as multiple computing frameworks (Apache Spark, Hadoop, and so on).

For the sentiment analysis part, we decided to replace the code we wrote using the textblob Python library with the Watson Tone Analyzer service (https://www.ibm.com/watson/services/tone-analyzer), which is a cloud-based rest service that provides advanced sentiment analysis including detection of emotional, language, and social tone. Even though the Tone Analyzer is not open source, a free version that can be used for development and trial is available on IBM Cloud (https://www.ibm.com/cloud).

Our architecture now looks like this:

Twitter sentiment analysis data pipeline architecture

In the preceding diagram, we can break down the workflow in to the following steps:

  1. Produce a stream of tweets and publish them into a Kafka topic, which can be thought of as a channel that groups events together. In turn, a receiver component can subscribe to this topic/channel to consume these events.

  2. Enrich the tweets with emotional, language, and social tone scores: use Spark Streaming to subscribe to Kafka topics from component 1 and send the text to the Watson Tone Analyzer service. The resulting tone scores are added to the data for further downstream processing. This component was implemented using Scala and, for convenience, was run using a Jupyter Scala Notebook.

  3. Data analysis and exploration: For this part, we decided to go with a Python Notebook simply because Python offer a more attractive ecosystem of libraries, especially around data visualizations.

  4. Publish results back to Kafka.

  5. Implement a real-time dashboard as a Node.js application.

With a team of three people, it took us about 8 weeks to get the dashboard working with real-time Twitter sentiment data. There are multiple reasons for this seemingly long time:

  • Some of the frameworks and services, such as Kafka and Spark Streaming, were new to us and we had to learn how to use their APIs.

  • The dashboard frontend was built as a standalone Node.js application using the Mozaïk framework (https://github.com/plouc/mozaik), which made it easy to build powerful live dashboards. However, we found a few limitations with the code, which forced us to dive into the implementation and write patches, hence adding delays to the overall schedule.

The results are shown in the following screenshot:

Twitter sentiment analysis real-ime dashboard