Big Data has managed to catch the attraction of many companies in the world and they are relying heavily on their Big Data to understand the behaviors of their customers. But in all this we fail to understand one thing and that is Big Data is not about real-time vs. batch processing. But given the number of options it has to offer such a common man's perspective about it is neither surprising nor controversial.
But, still if one has to scale it’d rather be more controversial than more surprising. Controversial is the nature of the infrastructure required in order to get the most out of Big Data. Analytics is very addictive and it soon converts into a problem if one’s infrastructure is unable to keep up with the requirements of the Data Structure.
There is definitely more than Spark or Hadoop to the success of big data and that is infrastructure. Definitely, the cloud has a big role to play in Big Data analytics but the much bigger factor here is the data processing and that totally depends on the infrastructure.
Though, the cloud is emerging as an increasingly popular option for the purpose of testing and developing the new analytics application and also for the processing of the big data which is being generated outside the walls of the enterprise.
There are three main components to the analytics system and they are:
Some of the most famous data services like AWS provide multiple ways to store the single source of truth and they vary from S3 storage to databases like DynamoDB or RDS to data warehousing solutions like Redshift.
This data from the single source of truth is often augmented with streaming data such as website clickstream or financial transactions. In this regard AWS offer Kinesis for real-time data processing other options like Apache Storm and Spark also exist.
Task clusters are a group of instances running on the distributed network specifically for a very specific task such as data visualization.
With these components Big Data is not about real-time vs. batch processing but is all about the broad set of tools which allow you to handle data in many ways.
Real-time data processing is absolutely very important but it is an additive part of the big data ecosystem.
Many believe that Big Data is only related to large volumes of data and tend to neglect the complexities inherent in the variety and velocity of data and even the term volume is not as simple as it may sound.
The real challenge in dealing with data is not the absolute scale of the data but rather the relative scale of the data. Today’s business world wants a platform which allows it to move from one scale to another easily and graciously because those who go out and buy expensive infrastructure to answer one problem notice that by the time they get around to answering the question the real question has changed and the business world has moved on.
Cloud is great but it not ultimate “all in on cloud” vs. “all in on premise” situation. In case where the bulk data is being created on the premise analytics will remain stuck to the premise and in another case where the data is created using stream processing the natural starting point is the cloud.