Big Data has managed to catch the attraction of many companies in the world and they are relying heavily on their Big Data to understand the behaviors of their customers. But in all this we fail to understand one thing and that is Big Data is not about real-time vs. batch processing. But given the number of options it has to offer such a common man's perspective about it is neither surprising nor controversial.
But, still if one has to scale it’d rather be more controversial than more surprising. Controversial is the nature of the infrastructure required I order to get the most out of Big Data. The analytics is very addictive and it soon converts into the problem if one’s infrastructure is unable to keep up with requirements of the Data Structure.
There is definitely more than Spark or Hadoop to the success of big data and that is infrastructure. Definitely cloud has a big role to play in the Big Data analytics but the much bigger factor here is the data processing and that totally depends on the infrastructure.
Though, cloud is emerging as an increasingly popular option for the purpose of testing and developing the new analytics application and also for the processing of the big data which is being generated outside the walls of the enterprise.
There are three main components to the analytics system and they are:
Some of the most famous data services like AWS provide multiple ways to store the single source of truth and they varies from S3 storage to databases like DynamoDB or RDS to data warehousing solutions like Redshift.
This data from the single source of truth is often augmented with the streaming data such as website clickstream or financial transactions. In this regard AWS offer Kinesis for the real-time data processing other options like Apache Storm and Spark also exist.
Task clusters are a group of instances running on the distributed network specifically for a very specific task such as data visualization.
With these components Big Data is not about the real-time vs. batch processing but is all about the broad set of tools which allow you to handle data in many ways.
Real-time data processing is absolutely very important but it is an additive part of the big data ecosystem.
Many believe that Big Data is only related to large volumes of data and tend to neglect the complexities inherent in variety and velocity of data and even the term volume is not that simple as it may sound.
The real challenge in dealing with data is not the absolute scale of the data but rather the relative scale of the data. Today’s business world want a platform which allows it to move from one scale to another easily and graciously because those who go out and buy expensive infrastructure to answer one problem notices that by the time they get around answering the question the real question has changed and the business world has moved on.
Cloud is great but it not ultimate “all in on cloud” vs. “all in on premise” situation. In case where the bulk data is being created on the premise analytics will remain stuck to the premise and in other case where the data is created using stream processing the natural starting point is cloud.