Input split internal concepts
An input split is a chunk of the input that is processed by
a single map. Each map processes a single split. Each split is divided into
records, and the map processes each record—a key-value pair—in turn. Input
splits are represented by the Java interface, InputSplit
public abstract class InputSplit {
/**
* Get the size of the split, so that the input splits can be sorted by
size.
* @return the number of bytes in the split
* @throws IOException
* @throws InterruptedException
*/
public abstract long getLength() throws IOException,
InterruptedException;
/**
* Get the list of nodes by name where the data for the split would be
local.
* The locations do not need to be serialized.
* @return a new array of the node nodes.
* @throws IOException
* @throws InterruptedException
*/
public abstract
String[] getLocations()
throws IOException,
InterruptedException;
}
Thought: Does Input Split contains input data?
Answer: No exactly.
Split doesn’t contain the input data; it is just a reference to
the data. The storage locations are used by the Map Reduce system to place map
tasks as close to the split’s data as possible, and the size is used to order
the splits so that the largest get processed first, in an attempt to minimize
the job runtime.
1.
getLength() – This method
returns the length of input splits which is helpful to sort the splitted
records based on size, so that we can process the larger splits first.
2.
getLocations() – This method
returns array of storage locations which we need to place the map tasks locally
to that particular node in the cluster.
Thought: Can we set a boundary for this input split?
Answer: Yes we can set a
boundary for input split by using the parameters mentioned in the below given
table.
mapred.min.split.size
= Minimum split size
mapred.max.split.size
= Maximum split size
dfs.blocksize =
HDFS block size
Important: By default, split size is the
same as the default block size. This is the most optimized way of executing a
map-reduce program.
Anatomy in Splits
Assuming input split size = block size
Record reader used is default(“\n” as separator)
Scenario 1 : We have a
file contains first line of size > 64
MB(say 100 MB).
Concept of working : If offset value is
zero(starting),Map will blindly read from the very starting of file split
1.Else it will check for next ‘\n’ and start reading from the very next
position after “\n”.
Red colored part will be processed by
Map 1 and all Violet colored part is processed by Map 2.
Scenario 2 : We have a
file contains first line of size < 64
MB(say 50 MB).
Concept of working : If offset value is
zero(starting),Map will blindly read from the very starting of file split
1.Else it will check for next ‘\n’ and start reading from the very next
position after “\n”.
Red colored part will be processed by
Map 1 and all Violet colored part is processed by Map 2.
Scenario 3(Interesting and confusing): We have a file contains first line of size = 64 MB
Concept of working : If offset value is
zero(starting),Map will blindly read from the very starting of file split
1.Else it will check for next ‘\n’ and start reading from the very next
position after “\n”.
Red colored part will be processed by
Map 1 and all Violet colored part is processed by Map 2.
Imp : Here if Map 1 process only split
1, then map 2 will not be able to fetch records from first ‘\n’ to second ‘\n’.
Reason: Map 2 will find the position
of ‘\n’ in the split provided and start from the very next position of it.
No comments:
Post a Comment