The world is evolving day by day, from automated call assistance to smart devices taking intelligent decisions, to self-driven decision-making cars to humanoid robots, all driven by processing large amount of data and analyzing it. We are rapidly approaching to the new era of data age. The IDC whitepaper (https://www.seagate.com/www-content/our-story/trends/files/Seagate-WP-DataAge2025-March-2017.pdf) on data evolution published in 2017 predicts data volumes to reach 163 zettabytes (1 zettabyte = 1 trillion terabytes) by the year 2025. This will involve digitization of all the analog data that we see between now and then. This flood of data will come from a broad variety of different device types, including IoT devices (sensor data) from industrial plants as well as home devices, smart meters, social media, wearables, mobile phones, and so on.
In our day-to-day life, we have seen ourselves participating in this evolution. For example, I started using a mobile phone in 2000 and, at that time, it had basic functions such as calls, torch, radio, and SMS. My phone could barely generate any data as such. Today, I use a 4G LTE smartphone capable of transmitting GBs of data including my photos, navigation history, and my health parameters from my smartwatch, on different devices over the internet. This data is effectively being utilized to make smart decisions.
Let's look at some real-world examples of big data:
- Companies such as Facebook and Instagram are using face recognition tools to identify photos, classify them, and bring you friend suggestions by comparison
- Companies such as Google and Amazon are looking at human behavior based on navigation patterns and location data, providing automated recommendations for shopping
- Many government organizations are analyzing information from CCTV cameras, social media feeds, network traffic, phone data, and bookings to trace criminals and predict potential threats and terrorist attacks
- Companies are using sentiment analysis from message posts and tweets to improve the quality of their products, as well as brand equities, and have targeted business growth
- Every minute, we send 204 million emails, view 20 million photos on Flickr, perform 2 million searches on Google, and generate 1.8 million likes on Facebook (Source)
With this data growth, the demands to process, store, and analyze data in a faster and scalable manner will arise. So, the question is: are we ready to accommodate these demands? Year after year, computer systems have evolved and so has storage media in terms of capacities; however, the capability to read-write byte data is yet to catch up with these demands. Similarly, data coming from various sources and various forms needs to be correlated together to make meaningful information. For example, with a combination of my mobile phone location information, billing information, and credit card details, someone can derive my interests in food, social status, and financial strength. The good part is that we see a lot of potential of working with big data. Today, companies are barely scratching the surface; however, we are still struggling to deal with storage and processing problems unfortunately.
This chapter is intended to provide the necessary background for you to get started on Apache Hadoop. It will cover the following key topics:
- How it all started
- What Apache Hadoop is and why it is important
- How Apache Hadoop works
- Hadoop 3.0 releases and new features
- Choosing the right Hadoop distribution