-
Notifications
You must be signed in to change notification settings - Fork 5
Test1
This is the page of Test1.
The goal of this test is to see how stable the elapsed times are for a job that has only mappers. Also we try to measure the read speed of the mappers when the mappers don't have output. We try say something about estimation dependencies.
##Setup Input data: 8 GB data in 8 text files. 1 GB each. The block size is 256 MB. Each file contains 16 bytes long rows. That is 16777216 rows (records). TextInputFormat is used for the mappers Mapper: The mappers don't do anything. No output is produced. Reducers: There are no reducers. Test and runs: 32 map tasks were running in one mrrun that is less the available tasks slots (21*4). 10 mrruns (run1...run10) where used in mrtest (test1)
##Observations
The map tasks elapsed times can be seen times on the picture below for all the runs. There are 10 graphs on the picture each shows the elapsed times for the map tasks. The horizontal axis is the mapper tasks (32), the vertical axis shows the elapsed times in ms. We can see on the pictures that there are some peaks for each mrrun for different tasks.
We can see this on this other picture as well, where the total job's elapsed time for each run can be seen.
The following picture shows the average, min, max elapsed times taken for each mrrun. The black line shows the average +- standard deviation, while the red lines shows the min, average, max elapsed times for each mrrun taken on the 32 tasks. From this picture we can see that for some runs the maximum elapsed times much higher than for others.
Let's see now the statistics of the elapsed times by nodes. The lines show the min, average, max elapsed times for each mrrun taken on the 32 map tasks. We can see that the elapsed times for tasks that run on node06 are much higher.
The following picture shows the min, max and average number of mbytes processed per second for each mrrun taken on the 32 map tasks. We can see again that node06 is much slower.
The following picture shows the min, max and average number of records processed per second for each mrrun taken on the 32 map tasks. We can see again that node06 is much slower.
##Conclusions
We can say that the processing speed of the map tasks for the 10 mrruns is stable. Also here we can see an example for a slow node that slows down the whole job. This is also good example of an non homogeneous cluster where not all the nodes have the same hardware configuration. This test is also good for measuring the how fast can a node read data from it's input files if TextInputFormat is used. Maybe it is useful to have some statistics for each node and every possible InputFormat. In this example each records had the same length. The elapsed times can depend on the row lengths as well.
So the estimation of elapsed times for such jobs depend on the followings:
- Number of input files and size of the files (this is known)
- The InputFormat type (this is known)
- Number of records in each file (this is not known, we can just estimate it based on size of the file)
- Size of records in each file (this can be also just estimated)
- Number of records or bytes processed by each node for InputFormat for unit of time (second). (this is also an estimation)