atom feed5 messages in org.apache.hadoop.core-userRE: Strange Hadoop behavior - Differe...
FromSent OnAttachments
Luca TelloliSep 14, 2007 7:16 am 
Devaraj DasSep 16, 2007 4:48 pm 
Luca TelloliSep 17, 2007 9:34 am 
Ted DunningSep 17, 2007 11:08 am 
Luca TelloliSep 17, 2007 11:21 am 
Subject:RE: Strange Hadoop behavior - Different results on equivalent input
From:Devaraj Das (dd@yahoo-inc.com)
Date:Sep 16, 2007 4:48:14 pm
List:org.apache.hadoop.core-user

Hi Luca,

You really raised my curiousity and I went and tried it myself. I had a bunch of files adding up to 591 MB in a dir, and an equivalent single file in a different dir in the hdfs. Ran two MR jobs with #reducers = 2. The outputs were exactly the same. The split sizes will not affect the outcome in the wordcount case. The #maps is a function of the hdfs block size, #maps the user specified, length/number of files. The RecordReader, org.apache.hadoop.mapred.LineRecordReader has logic for handling cases where files could be split anywhere (newlines could straddle hdfs block boundary). If you look at org.apache.hadoop.mapred.FileInputFormat.getSplits, you can see how all this info is used. Hadoop doesn't honor mapred.map.tasks beyond considering it a hint. But it accepts the user specified mapred.reduce.tasks and doesn't manipulate that. You cannot force mapred.map.tasks but can specify mapred.reduce.tasks. Thanks, Devaraj.

-----Original Message----- From: Luca Telloli [mailto:luca@yahoo.it] Sent: Friday, September 14, 2007 7:17 AM To: hado@lucene.apache.org Subject: Strange Hadoop behavior - Different results on equivalent input

Hello everyone, I'm new to Hadoop and to this mailing list so: Hello. =)

I'm experiencing a problem that I can't understand; I'm performing a wordcount task (from the examples in the source) on a single Hadoop node configured as a pseudo-distributed environment. My input is a set of document I scratched from /usr/share/doc.

I have two inputs: - the first one is a set of three files of 189, 45 and 1.9 MB, named input-compact - the second one is the same as above, put on a single 236MB file with cat, named input-single, so I'm talking about "equivalent" input

Logs report 11 map tasks for one job and 10 for the other, both having a total of 2 reduce tasks. I expect the outcome to be the same, but it's not, as it follows from the tail of my outputs

$ tail /tmp/output-* ==> /tmp/output-compact <== yet.</td> 164 you 23719 You 4603 your 7097 Zend 111 zero, 101 zero 1637 zero-based 114 zval 140 zval* 191

==> /tmp/output-single <== Y 289 (Yasuhiro 105 yet.</td> 164 you 23719 You 4622 your 7121 zero, 101 zero 1646 zero-based 114 zval* 191

- Does the way Hadoop splits its input in block on HDFS influence the possible outcome of the computation?

- Even so: how can the result be so different? I mean, the word zval, having 140 occurrences in the first run, doesn't even appear in the second one!

- Third question: I've been seeing that, when files are small, hadoop tends to make as many maps as the number of files. My initial input was scattered into 13k different small files and was not good for the task, as I realized quite soon, having almost 13k maps running the same task. At that time, I specified a few parameters in my initialization file, like mapred.map.tasks = 10 and mapred.reduce.tasks = 2. I wonder how hadoop decides on the number of maps; on the help it says that mapred.map.tasks is a _value per job_ but I wonder if instead is not some function of <#tasks, #input files> or other parameters.

- Finally, is there a way to completely force these parameters (numbers of maps and reduce)?

Apologies if any of these questions might sound dumb, I'm really new to the software and willing to learn more.

Thanks, Luca