java - Why do Hadoop jobs need so many threads? -


My understanding of Hadoop is that equality on each enumeration node is obtained by starting different jvms for each core.

I see that every JVM has dozens of threads, which can lead to thousands of threads per node. I do not want to think about any reason that so many threads are born. what's going on?

For example, here's a simple pig script that parses and filters some jens:

  / * * Get Tweets with GPS / $ JAR Register Do; Use Json_eb = LOAD '$ IN_DIRS' as com.twitter.elephantbird.pig.load.JsonLoader ('- nestedload') (json: map []); - With Twitter's Library - PRS Jason Parsed0 = Generate FOREACH json_eb STRSPLIT (Jason # 'id', ':'). Tweet as $ 2: Charra, STRSPLIT (Jason # 'Actor' # 'Id', ':'). UserId as $ 2: Charra, JSN # 'Posttime' AS Posttime: Charra, JSN # 'GO' # 'Coordinator' AS GPS: Chararer; Parsed1 = FILTER parsed0 BY (not GPS); PigStorage (); '$ OUT_DIR' in Store Parsed1;  

I run this script and the mapped users start 33 processes on my node (I have 32 cores):

  rfcompton @ Node19 ~ & gt; Ps -u Mapper | GRP-V PID | Search the top:  
  PID user PR ni virit SRR SR CPU CPU% MEM time + command 484 mpad 39 16 1576 m 362 m18 ms 130.8 0.3 0: 09.48 Java 32427 mapped 34 16 1664 m 36 9 m18 m s 122.2 0.3 0: 08.67 go 32694 mapped 36 16 1502 m 23 9 m18 ms s 115.6 0.2 0: 07.94 java 32218 mph 33 16 164 401 m 18 ms 114.6 0.3 0 : 10.29 Java .. jvms has approximately 40 threads each:  
  rfcompton @ node19 ~> Cat / proc / 484 / position | Grep Threads Threads: 43  

All together, there are one thousand threads on the 32-core node in Mapread:

  rfcompton @ node19 ~> Ps -u Mapper | GRP-V PID | Awk '{system ("cat / proc /" $ 1 "/ status")}' | GRP Threads | After reading the relevant section in the (HDOP - Sustainable Guide) it has been suggested that "edu" {SUM + = $ 2} END {print SUM} '1655  

edited Do: From Paul's answer, it seems that there are 40 threads that I should expect. They are available to serve the production of maps on HTTPS post-production stages.

The partition's output file is made available for reducers on HTTP. The number of labor threads used to serve file segmentation is controlled by the work tracker. Http.threads Property - This setting is per workspace, not a map of multiple work slots can increase the default requirement of 40 for large groups running large numbers in large numbers.

All the adjacent implementations I've seen a lot of abundance In fact, The tasks that take up the tasks that are carried out in the management process are overlapped, such as map jobs and work reduces themselves.

On examining "Hadop - The Definition Guide", the authors mention several processes that are multi-threaded. These include

  1. Radoser has a small pool of "copyier" thread to bring map output to Paralel.
  2. Mappers can be multi-threaded (multithreaddemapers) themselves
  3. Datanodes contain data for copying and closing HDFS threads.

Depending on how your cluster is configured, you can get DataNodes and TaskTrackers on the same machine, and this many threads.

I thought that there is significant performance benefits in the heavy use of concurrency, and that is why the implementers have gone that route.


Comments

Popular posts from this blog

ios - How do I use CFArrayRef in Swift? -

eclipse plugin - Run java code error: Workspace is closed -

c - Error on building source code in VC 6 -