Date Posted:
Product: TIBCO Spotfire®
Product: TIBCO Spotfire®
Problem:
Solving "Small file problem"
Solution:
Question: How do I specify the number of mappers?
I have hundreds or even thousands of small files generated by MapReduce jobs. I'd like to use these files for further analysis, but there are too many files!
Is there any way to consume these many small files as a few larger files?
By default MapReduce will spawn as many mappers as input splits. We'd like to use fewer mappers. Please help!
Solution for Pig Operators
- Choose a dataset that contains many "part-" files.
My dataset has 112 part- files. - Calculate the total size of your dataset using this command.
hadoop fs -du -h /path/to/folder
In the following example, my folder uses 1.8Mb, or approx 1887436.8 bytes -
$ hadoop fs -du -h /path/to/folder 1.8 M 5.4 M /path/to/folder
Calculate the value for this parameter:
pig.maxCombinedSplitSize = total file size in bytes / (desired number of mappers)
For example,
pig.maxCombinedSplitSize = 1887436.8 / 37 mappers = approx 49152 - In the datasource connection, specify this parameter
pig.maxCombinedSplitSize = 49152 - Drag the "dataset" to the canvas.
- Connect with the Column Filter operator, which is a pig operator.
- Check RM of the Pig Job (Column Filter) to determine the number of mappers being used. In this case: 39.
Solution for MapReduce operators
- For MapReduce operators, such as Alpine Forest, I can pass the dataset through a "Column Filter" as above, selecting "all" columns.
- Notice the MapReduce operator (Alpine Forest) uses 39 mappers. 38 part-* files and one metadata file will be generated.
Explore Operators
Bar Chart
Box Plot
Frequency
Histogram
Scatter Plot Matrix
Transform Operators
Aggregation
Column Filter
Join*
Null Value Replacement
Row Filter
Variable
Tools Operators
Pig Execute
(*) Also available with MR implementation.
Comments
0 comments
Article is closed for comments.