So what is Hadoop? this so-called “Hadoop” may be new for your ear, but in fact, it has been around since late 2011. Hadoop is a java-based framework which allows us to first store Big Data in a distributed environment, so we can process it parallelly. There are 2 main components:
The first component is the Hadoop distributed File System or HDFS, which would allow you to store data from various formats across a cluster.
The second component is Yet Another Resource Negotiator, I mean really, YARN really does stands for Yet Another Resource Negotiator. Though it may sound odd, it actually allows parallel processing over the data, so it works continuously with the HDFS.
What is HDFS
Hadoop Distributed File System or HDFS is the main data storage system used by Hadoop applications. It has a DataNode and NameNode architecture to implement a distributed file system that provides high-performance access to data.
The HDFS is a big part of many Hadoop technologies, as it provides a reliable resource for managing big data and supporting related big data analytics applications.
As I have stated above, that YARN stands for “Yet Another Resource Negotiator”. YARN was introduced to Hadoop 2.0 in an initial attempt to remove the bottleneck on Job tracker which was present in the early version 1.0. Now YARN has evolved and be known as a large-scale distributed operating system used for Big Data processing.
What is Hadoop history
As the WWW(World Wide Web) grew in the late 1900s and early 2000s, search engines, and indexing were created to help users locate relevant information inside the text-based content. In the early years of the World Wide Web, search result was returned by humans. But as they realize that the internet is growing and the web was growing from dozens to millions of pages, it was no job for a human, it needs some kind of automation. And thus, we crawlers were invented, many university-led projects and start engine start-ups rocketed(Yahoo, Google, etc).
In 2006 Doug Cutting joined Yahoo and took the Nutch project as well as ideas with him, it was based on Google’s early work with automating distributed data storage and processing. Then the Nutch project was divided – the web crawler part still remain as Nutch and the distributed computing and processing part eventually became Hadoop(fun fact: named after Cutting’s son’s toy elephant).
Finally, in 2008, Yahoo released Hadoop as an open-source project. Today, Hadoop is known for its framework and ecosystem of technologies that are maintained and managed by a non-profit global community of software developers and contributors, named the Apache Software Foundation (ASF).
The programming language that I specialize in is Java, because I think Java programming language is more universal and of course because I like it regardless of any reason.