Book Image

Mastering Spark for Data Science

By : Andrew Morgan, Antoine Amend, Matthew Hallett, David George
Book Image

Mastering Spark for Data Science

By: Andrew Morgan, Antoine Amend, Matthew Hallett, David George

Overview of this book

Data science seeks to transform the world using data, and this is typically achieved through disrupting and changing real processes in real industries. In order to operate at this level you need to build data science solutions of substance –solutions that solve real problems. Spark has emerged as the big data platform of choice for data scientists due to its speed, scalability, and easy-to-use APIs. This book deep dives into using Spark to deliver production-grade data science solutions. This process is demonstrated by exploring the construction of a sophisticated global news analysis service that uses Spark to generate continuous geopolitical and current affairs insights.You will learn all about the core Spark APIs and take a comprehensive tour of advanced libraries, including Spark SQL, Spark Streaming, MLlib, and more. You will be introduced to advanced techniques and methods that will help you to construct commercial-grade data products. Focusing on a sequence of tutorials that deliver a working news intelligence service, you will learn about advanced Spark architectures, how to work with geographic data in Spark, and how to tune Spark algorithms so they scale linearly.
Table of Contents (22 chapters)
Mastering Spark for Data Science
About the Authors
About the Reviewer
Customer Feedback

About the Authors

Andrew Morgan is a specialist in data strategy and its execution, and has deep experience in the supporting technologies, system architecture, and data science that bring it to life. With over 20 years of experience in the data industry, he has worked designing systems for some of its most prestigious players and their global clients – often on large, complex and international projects. In 2013, he founded ByteSumo Ltd, a data science and big data engineering consultancy, and he now works with clients in Europe and the USA. Andrew is an active data scientist, and the inventor of the TrendCalculus algorithm. It was developed as part of his ongoing research project investigating long-range predictions based on machine learning the patterns found in drifting cultural, geopolitical and economic trends. He also sits on the Hadoop Summit EU data science selection committee, and has spoken at many conferences on a variety of data topics. He also enjoys participating in the Data Science and Big Data communities where he lives in London.

This book is dedicated to my wife Steffy, to my children Alice and Adele, and to all my friends and colleagues who have been endlessly supportive. It is also dedicated to the memory of one my earliest mentors whom I studied under at the University of Toronto, a Professor Ferenc Csillag. Back in 1994, Ferko inspired me with visions of a future where we could use planet-wide data collection and sophisticated algorithms to monitor and optimize the world around us. It was an idea that changed my life, and his dream of a world saved by Big Data Science, is one I’m still chasing.

Antoine Amend is a data scientist passionate about big data engineering and scalable computing. The book’s theme of torturing astronomical amounts of unstructured data to gain new insights mainly comes from his background in theoretical physics. Graduating in 2008 with a Msc. in Astrophysics, he worked for a large consultancy business in Switzerland before discovering the concept of big data at the early stages of Hadoop. He has embraced big data technologies ever since, and is now working as the Head of Data Science for cyber security at Barclays Bank. By combining a scientific approach with core IT skills, Antoine qualified two years running for the Big Data World Championships finals held in Austin TX. He Placed in the top 12 in both 2014 and 2015 edition (over 2000+ competitors) where he additionally won the Innovation Award using the methodologies and technologies explained in this book.

I would like to thank my wife for standing beside me, she has been my motivation for continuing to improve my knowledge and move my career forward. I thank my wonderful kids for always teaching me how to step back whenever it is necessary to clear my mind and get fresh new ideas.

I would like to extend my thanks to my co-workers, especially Dr. Samuel Assefa, Dr. Eirini Spyropoulou and Will Hardman for their patience listening to my crazy theories, and everyone else I had the pleasure to work with over the past few years. Finally, I want to address a special thanks to all my previous managers and mentors who helped me shape up my career in data and analytics; thanks to Manu, Toby, Gary and Harry.

David George is a distinguished distributed computing expert with 15+ years of data systems experience, mainly with globally recognized IT consultancies and brands. Working with core Hadoop technologies since the early days, he has delivered implementations at the largest scale. David always takes a pragmatic approach to software design and values elegance in simplicity.

Today he continues to work as a lead engineer, designing scalable applications for financial sector customers with some of the toughest requirements. His latest projects focus on the adoption of advanced AI techniques for increasing levels of automation across knowledge-based industries.

For Ellie, Shannon, Pauline and Pumpkin – here’s to the sequel!

Matthew Hallett is a Software Engineer and Computer Scientist with over 15 years of industry experience. He is an expert Object Oriented programmer and systems engineer with extensive knowledge of low level programming paradigms and, for the last 8 years, has developed an expertise in Hadoop and distributed programming within mission critical environments, comprising multi­thousand­node data centres. With consultancy experience in distributed algorithms and the implementation of distributed computing architectures, in a variety of languages, Matthew is currently a Consultant Data Engineer in the Data Science & Engineering team at a top four audit firm.

Lynnie, thanks for your understanding and sacrifices that afforded me the time during late nights, weekends and holidays to write this book. Nugget, you make it all worthwhile.

We would also like to thank Gary Richardson, Dr David Pryce, Dr Helen Ramsden, Dr Sima Reichenbach and Dr Fabio Petroni for their invaluable advice and guidance that has led to the completion of this huge project – without their help and contributions this book may never have been completed!