Managing and analyzing big data is no job of a child in fact, the data itself is present in so many forms that collecting and analyzing it has become a great challenge. In order to tackle these challenges which companies and teams faces in day-to-day life while analyzing big data developers have created new set of technologies (open source) based around Hadoop.
Since its birth Apache Hadoop and Apache Software Foundation project has grown a lot and added many new members in its family. The single software today has evolved into an entire ecosystem.
Spark, Hive, Hbase and Storm are some options which companies are using. These technologies enable them to deal with the massive amounts of data in real time. These technologies are constantly trying to enhance the companies experience with the big data on day-to-day basis.
There are many projects in Hadoop ecosystem and below we are taking a look at some of the most significant one.
It is a flagship technology which became the center of gravity for an entire ecosystem. Developed at Yahoo, it was originally a side project because developers at yahoo needed a way to store and process large amounts of data they were gathering from their new search engine. This technology eventually contributed to the Apache Software Foundation.
It was originally developed by the Facebook and later contributed to the Apache software foundation. A data warehouse infrastructure built on top of Hadoop, its main job is to provide services like data summarization, query and analysis.
This project was born at the company named Powerset, and later it was acquired by the Microsoft. Its main goal is to process large amount of data for natural language processing. At its core it is a non-relational, distributed database based on the Google’s Big Table. It joined the Apache family in 2010.
It is the new rising star of the Apache ecosystem. Developed in UC Berkley, it a fast alternative to the Hadoop’s MapReduce technology and depending upon the application can be 100 times faster. Spark developers provide support to the Apache software foundation and also offers a commercial service known as Spark-as-a-Service.
Originally it was a project of LinkedIn as was developed as a messaging system for the real-time data which is generated and processed by company’s career website and platform and was eventually donated to open source in 2011.
It is real time computation system which makes it easy to process the unbounded streams of data reliably. Sometimes this technology is described as an alternative to Spark and the company which originally developed it was BackType which was acquired by the Twitter in 2011.
Nifi stands for Niagara files is a technology which is developed by US National Security Agency (NSA). Its main role is to automate the flow of data between the systems and is offered via a web based interface. Since it is developed by NSA it supports SSL, SSH, HTTPS and role-based authentication and authorization.
Flink is a distributed data analysis engine and is used to process the batch and streaming data.
Apache arrow is technology which was developed by the company named Dremio. The same company contributed to the Apache Drill project. In fact, Arrow is based on the code from Apache Drill project.
These were some highlights on Apache’s Hadoop ecosystem. But this is not all; work on many other projects is going on and on these as well. Documentation for these is available on Apache Software Foundation Website.