Tuesday, May 14, 2013

Input Splits - Anatomy

Input split internal concepts
An input split is a chunk of the input that is processed by a single map. Each map processes a single split. Each split is divided into records, and the map processes each record—a key-value pair—in turn. Input splits are represented by the Java interface, InputSplit

public abstract class InputSplit {
  /**
   * Get the size of the split, so that the input splits can be sorted by size.
   * @return the number of bytes in the split
   * @throws IOException
   * @throws InterruptedException
   */
  public abstract long getLength() throws IOException, InterruptedException;

  /**
   * Get the list of nodes by name where the data for the split would be local.
   * The locations do not need to be serialized.
   * @return a new array of the node nodes.
   * @throws IOException
   * @throws InterruptedException
   */
  public abstract
    String[] getLocations() throws IOException, InterruptedException;
}


Thought: Does Input Split contains input data?
         
Answer: No exactly. Split doesn’t contain the input data; it is just a reference to the data. The storage locations are used by the Map Reduce system to place map tasks as close to the split’s data as possible, and the size is used to order the splits so that the largest get processed first, in an attempt to minimize the job runtime.
               
1.  getLength() – This method returns the length of input splits which is helpful to sort the splitted records based on size, so that we can process the larger splits first.

2.       getLocations() – This method returns array of storage locations which we need to place the map tasks locally to that particular node in the cluster.

Thought: Can we set a boundary for this input split?

Answer: Yes we can set a boundary for input split by using the parameters mentioned in the below given table.



mapred.min.split.size = Minimum split size
mapred.max.split.size = Maximum split size
dfs.blocksize = HDFS block size

Important: By default, split size is the same as the default block size. This is the most optimized way of executing a map-reduce program.

Anatomy in Splits

Assuming input split size = block size
Record reader used is default(“\n” as separator)

Scenario 1 : We have a file contains first line of size > 64 MB(say 100 MB).

Concept of working : If offset value is zero(starting),Map will blindly read from the very starting of file split 1.Else it will check for next ‘\n’ and start reading from the very next position after “\n”.




Red colored part will be processed by Map 1 and all Violet colored part is processed by Map 2.

Scenario 2 : We have a file contains first line of size < 64 MB(say 50 MB).

Concept of working : If offset value is zero(starting),Map will blindly read from the very starting of file split 1.Else it will check for next ‘\n’ and start reading from the very next position after “\n”.


Red colored part will be processed by Map 1 and all Violet colored part is processed by Map 2.

Scenario 3(Interesting and confusing): We have a file contains first line of size = 64 MB

Concept of working : If offset value is zero(starting),Map will blindly read from the very starting of file split 1.Else it will check for next ‘\n’ and start reading from the very next position after “\n”.



Red colored part will be processed by Map 1 and all Violet colored part is processed by Map 2.

Imp : Here if Map 1 process only split 1, then map 2 will not be able to fetch records from first ‘\n’ to second ‘\n’.
Reason: Map 2 will find the position of ‘\n’ in the split provided and start from the very next position of it.

No comments:

Post a Comment