| From | Sent On | Attachments |
|---|---|---|
| Amandeep Khurana | Feb 15, 2009 5:32 pm |
| Subject: | Re: HDFS architecture based on GFS? | |
|---|---|---|
| From: | Amandeep Khurana (ama...@gmail.com) | |
| Date: | Feb 15, 2009 5:32:23 pm | |
| List: | org.apache.hadoop.core-user | |
This is good information! Thanks a ton. I'll take all this into account.
Amandeep
Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz
On Sun, Feb 15, 2009 at 4:47 PM, Matei Zaharia <mat...@cloudera.com> wrote:
Typically the data flow is like this:1) Client submits a job description to the JobTracker. 2) JobTracker figures out block locations for the input file(s) by talking to HDFS NameNode. 3) JobTracker creates a job description file in HDFS which will be read by the nodes to copy over the job's code etc. 4) JobTracker starts map tasks on the slaves (TaskTrackers) with the appropriate data blocks. 5) After running, maps create intermediate output files on those slaves. These are not in HDFS, they're in some temporary storage used by MapReduce. 6) JobTracker starts reduces on a series of slaves, which copy over the appropriate map outputs, apply the reduce function, and write the outputs to HDFS (one output file per reducer). 7) Some logs for the job may also be put into HDFS by the JobTracker.
However, there is a big caveat, which is that the map and reduce tasks run arbitrary code. It is not unusual to have a map that opens a second HDFS file to read some information (e.g. for doing a join of a small table against a big file). If you use Hadoop Streaming or Pipes to write a job in Python, Ruby, C, etc, then you are launching arbitrary processes which may also access external resources in this manner. Some people also read/write to DBs (e.g. MySQL) from their tasks. A comprehensive security solution would ideally deal with these cases too.
On Sun, Feb 15, 2009 at 3:22 PM, Amandeep Khurana <ama...@gmail.com> wrote:
A quick question here. How does a typical hadoop job work at the system level? What are the various interactions and how does the data flow?
Amandeep
Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz
On Sun, Feb 15, 2009 at 3:20 PM, Amandeep Khurana <ama...@gmail.com> wrote:
Thanks Matei. If the basic architecture is similar to the Google stuff, I can safely just work on the project using the information from the papers.
I am aware of the 4487 jira and the current status of the permissions mechanism. I had a look at them earlier.
Cheers Amandeep
Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz
On Sun, Feb 15, 2009 at 2:40 PM, Matei Zaharia <mat...@cloudera.com> wrote:
Forgot to add, this JIRA details the latest security features that are being worked on in Hadoop trunk: https://issues.apache.org/jira/browse/HADOOP-4487. This document describes the current status and limitations of the permissions mechanism:
http://hadoop.apache.org/core/docs/current/hdfs_permissions_guide.html.
On Sun, Feb 15, 2009 at 2:35 PM, Matei Zaharia <mat...@cloudera.com> wrote:
I think it's safe to assume that Hadoop works like MapReduce/GFS at the level described in those papers. In particular, in HDFS, there is a master node containing metadata and a number of slave nodes (datanodes) containing blocks, as in GFS. Clients start by talking to the master to list directories, etc. When they want to read a region of some file, they tell the master the filename and offset, and they receive a list of block locations (datanodes). They then contact the individual datanodes to read the blocks. When clients write a file, they first obtain a new block
ID
and
list of nodes to write it to from the master, then contact the
datanodes
to
write it (actually, the datanodes pipeline the write as in GFS) and report when the write is complete. HDFS actually has some security
mechanisms
built
in, authenticating users based on their Unix ID and providing
Unix-like
file
permissions. I don't know much about how these are implemented, but they would be a good place to start looking.
On Sun, Feb 15, 2009 at 1:36 PM, Amandeep Khurana <ama...@gmail.com wrote:
Thanks Matie
I had gone through the architecture document online. I am currently working on a project towards Security in Hadoop. I do know how the data moves around in the GFS but wasnt sure how much of that does HDFS follow and how different it is from GFS. Can you throw some light on that?
Security would also involve the Map Reduce jobs following the same protocols. Thats why the question about how does the Hadoop framework integrate with the HDFS, and how different is it from Map Reduce
and
GFS.
The GFS and Map Reduce papers give a good information on how those systems are designed but there is nothing that concrete for Hadoop that I have been able to find.
Amandeep
Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz
On Sun, Feb 15, 2009 at 12:07 PM, Matei Zaharia < mat...@cloudera.com> wrote:
Hi Amandeep, Hadoop is definitely inspired by MapReduce/GFS and aims to
provide
those
capabilities as an open-source project. HDFS is similar to GFS (large blocks, replication, etc); some notable things missing are read-write support in the middle of a file (unlikely to be provided because
few
Hadoop
applications require it) and multiple appenders (the record append operation). You can read about HDFS architecture at http://hadoop.apache.org/core/docs/current/hdfs_design.html. The MapReduce part of Hadoop interacts with HDFS in the same way that Google's MapReduce interacts with GFS (shipping computation to the data), although Hadoop MapReduce also supports running over other distributed filesystems.
Matei
On Sun, Feb 15, 2009 at 11:57 AM, Amandeep Khurana <
ama...@gmail.com
wrote:
Hi
Is the HDFS architecture completely based on the Google
Filesystem?
If
it
isnt, what are the differences between the two?
Secondly, is the coupling between Hadoop and HDFS same as how
it
is
between
the Google's version of Map Reduce and GFS?
Amandeep
Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz





