Setting both mapreduce.input.fileinputformat.split.maxsize and mapreduce.input.fileinputformat.split.minsize to the same value in most cases controls the number of Mappers used when Hive is running a particular query.
Example:
For a text file with file size of 200000 bytes
- The following configurations triggers two Mappers for the MapReduce job:
set mapreduce.input.fileinputformat.split.maxsize=100000; set mapreduce.input.fileinputformat.split.minsize=100000;
- The following configurations triggers four Mappers for the MapReduce job:
set mapreduce.input.fileinputformat.split.maxsize=50000; set mapreduce.input.fileinputformat.split.minsize=50000;
By default, Hive assigns several small files, whose file size are smaller than mapreduce.input.fileinputformat.split.minsize, to a single Mapper to limit the number of Mappers initialized. Hive also considers the data locality of each file's HDFS blocks. If there are a lot of small files stored across different HDFS DataNodes, Hive will not combine the files into a single Mapper because they are not stored on the same machine.
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; (Default) set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; (Does not combine files)
MappersSetting both mapreduce.input.fileinputformat.split.maxsize and mapreduce.input.fileinputformat.split.minsize to the same value in most cases controls the number of Mappers used when Hive is running a particular query.
Example:
For a text file with file size of 200000 bytes
set mapreduce.input.fileinputformat.split.maxsize=100000; set mapreduce.input.fileinputformat.split.minsize=100000;
set mapreduce.input.fileinputformat.split.maxsize=50000; set mapreduce.input.fileinputformat.split.minsize=50000;
By default, Hive assigns several small files, whose file size are smaller than mapreduce.input.fileinputformat.split.minsize, to a single Mapper to limit the number of Mappers initialized. Hive also considers the data locality of each file's HDFS blocks. If there are a lot of small files stored across different HDFS DataNodes, Hive will not combine the files into a single Mapper because they are not stored on the same machine.
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; (Default) set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; (Does not combine files) ReducersDetermining the number of Reducers often requires more input from the developer than does determining the number of Mappers. Efficient allocation of Reducers will greatly depend on the Hive query and the input data set. The size of the result set of a query against a large input data set will depend greatly on the predicates in the query. If a query heavily filters the input data set with predicates, few Reducers may be required for processing. Alternatively, if a different query against the same large input data set has fewer filtering predicates, many Reducers may be required for processing.Hive employs a simple algorithm for determining the number of Reducers to use based on the total size of the input data set. The number of Reducers can be also be set explicitly by the developer. -- In order to change the average load for a reducer (in bytes) -- CDH uses a default of 67108864 (64MB) -- That is, if the total size of the data set is 1 GB then 16 reducers will be used set hive.exec.reducers.bytes.per.reducer=
It is sometimes the case that certain queries do not use Reducers. INSERT statements, for example, may generate Mapper-only MapReduce jobs. If it is required to use Reducers for performance reasons or the developer would like to control the number of output files generated, there are workarounds available. |
No comments:
Post a Comment