A lot of discussion among experts in the field of big data analytics is over which of the two data analytics engines, the Hadoop or the Spark, is the better performer when it comes to applications in business. While Hadoop has been around for a long time, Spark is a new data analytics system released just couple of months ago. Both systems have been developed by apache, with both systems being an open source platform. Both Hadoop and Spark have their own plus points with regard to performance. There are some applications in which Hadoop scores above Spark, but Sparks ease of use and speed of operations is way ahead of Hadoop.
In the book Hadoop: The definitive guide, Tom white quotes Grace Hopper, "In pioneer days they used oxen for heavy pulling, and when one ox couldn't budge a log, they didn't try to grow a larger ox. We shouldn't be trying for bigger computers, but for more systems of computers." For long Hadoop has been the data analytics system preferred by businesses all over. The recent entry of the spark engine has however given businesses an option other than Hadoop for data analytics purposes. A lot of discussion among experts in the field of big data analytics is over which of the two data analytics engines, the Hadoop or the Spark, is the better performer when it comes to applications in business.
There are two basic types of graph engines: (1) Graph databases providing real-time, traversal-based algorithms over linked-list graphs represented on a single-server (vendors include Neo4j, OrientDB, DEX, and InfiniteGraph). With Hadoop, instead of focusing on a particular vertex-centric BSP-based graph-processing package such as Hama or Giraph, the results presented are via Hadoop (HDFS MapReduce). Moreover, instead of developing the MapReduce algorithms in Java, the R programming language is used. When a graph is on the order of 100 billion elements (vertices edges), then a single-server graph database will not be able to represent nor process the graph. A multi-machine graph engine is required.
The sudden increase in the volume of data from the order of gigabytes to zettabytes has created the need for a more organized file system for storage and processing of data. The demand stemming from the data market has brought Hadoop in the limelight making it one of biggest players in the industry. Hadoop Distributed File System (HDFS), the commonly known file system of Hadoop and Hbase (Hadoop's database) are the most topical and advanced data storage and management systems available in the market. HDFS is fault-tolerant by design and supports rapid data transfer between nodes even during system failures. HBase is a non-relational and open source Not-Only-SQL database that runs on top of Hadoop.
For anyone who gets into the Big Data world, the terms Big Data and Hadoop become synonyms. As they learn the ecosystem along with the tools and their workings, people become more aware about what big data actually means, and what role Hadoop has in the big data ecosystem. According to Wikipedia, "Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate". To put it in simple terms, as the size of data increases the usual processing methods takes too longer or proves to be too costly. Hadoop was created in,2005, by Doug Cutting, who was inspired by Google's white papers on GFS and MapReduce.