Optimal Map tasks for Map Reduce Applications based on Time Complexity ???
Some Analysis
One of the strong points in Hadoop is its ability to efficiently partition the input files using API hooks in the HDFS and calling the Map tasks with each of these partitions. Its useful to understand how to find these task distribution to get the best possible performance achievable. There are few factors to consider when trying to pick the right number of map tasks but the obvious would be to strike a balance between the speedup from the distribution and the overhead of distribution.
Consider the case where time Complexity of the application is O(n) meaning the running time depend entirely on the size of the input, like in the case of WordCount. We can argue that the efficient sizes of the partition that the system could handle would be the deciding factor because the running time is only growing linearly with the input and you have to read the input anyway. So in such cases letting the Hadoop decide on the partition sizes will be the right approach because it will base the partition size based on the optimal block sizes. Why? Following may convince you why.
Lets assume Hadoop distribution overhead is linear with the Map tasks, then Total time would look like the following equation. This may be oversimplified equation because it leaves out the reduction complexity and without any argument assumed the overhead of Hadoop is linear with Map Tasks. But this simplicity would let you have more insight.
Total time = O (Input Size/Map Tasks) + Overhead * Map Tasks
So if the application has linear time complexity O(n) then ;
Total time = (Input Size/Map Tasks)+ Overhead * Map Tasks
Taking the derivative would tell you that Total time is minimum when
Map Tasks = sqrt(Input Size/Overhead)
Since this number is very dependent on the Overhead it make sense to let Hadoop do the partitioning and distribution.
Structured input for Map Reduce
Consider the case where the time complexity of the application is not linear but higher, then the dominant term in the Total time is O (Input Size/Map Tasks). In case the time complexity is O(n^2) =>
Total time = (Input Size/Map Tasks) ^ 2 + Overhead * Map Tasks
Here the first term grows much faster. So minimizing that by increasing the Map tasks is essential. So in such cases you cannot let the Hadoop come up with the number of Map tasks based on the optimal block sizes that it can handle but rather you need to force the number of map tasks upon it.
For example a java ray tracing application which has O(n^2) complexity according to Michael. In this you try to render a scene using ray tracing and this can be easily parallelized because each pixel calculation is independent of the other. When you parallelize it, you will have different segments you want to compute in the input file and you need to force the Map Reduce to calculate those segments in different map tasks. So you need much structured input than that in the word count example because your number of map tasks and your running time depend on it. Letting the partitioning be done to optimize the file sizes in such a compute intensive application would be unwise and in many cases counterproductive. Although it should be noted Hadoop was designed for data intensive parallelism rather than compute intensive parallelism .
Forcing Hadoop to run a map task for each line in input file
Hadoop provides a input formater called NLineInputFormat where each line in the input file would get mapped to a separate map task which allows you to have control over the number of map tasks that you want to force upon the Hadoop framework. Following is the incomplete code to setup the JobConfiguration and the Bold line shows what you need to do to set the NLineInputFormat. So if you know how to optimally map tasks you need to run you can allow those inputs to lines in the input file so Hadoop will create a new map task for each input.
JobConf conf = new JobConf(HadoopRayTracer.class);
conf.setJobName("raytrace");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(NLineInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class)
.......
Some Analysis
One of the strong points in Hadoop is its ability to efficiently partition the input files using API hooks in the HDFS and calling the Map tasks with each of these partitions. Its useful to understand how to find these task distribution to get the best possible performance achievable. There are few factors to consider when trying to pick the right number of map tasks but the obvious would be to strike a balance between the speedup from the distribution and the overhead of distribution.
Consider the case where time Complexity of the application is O(n) meaning the running time depend entirely on the size of the input, like in the case of WordCount. We can argue that the efficient sizes of the partition that the system could handle would be the deciding factor because the running time is only growing linearly with the input and you have to read the input anyway. So in such cases letting the Hadoop decide on the partition sizes will be the right approach because it will base the partition size based on the optimal block sizes. Why? Following may convince you why.
Lets assume Hadoop distribution overhead is linear with the Map tasks, then Total time would look like the following equation. This may be oversimplified equation because it leaves out the reduction complexity and without any argument assumed the overhead of Hadoop is linear with Map Tasks. But this simplicity would let you have more insight.
Total time = O (Input Size/Map Tasks) + Overhead * Map Tasks
So if the application has linear time complexity O(n) then ;
Total time = (Input Size/Map Tasks)+ Overhead * Map Tasks
Taking the derivative would tell you that Total time is minimum when
Map Tasks = sqrt(Input Size/Overhead)
Since this number is very dependent on the Overhead it make sense to let Hadoop do the partitioning and distribution.
Structured input for Map Reduce
Consider the case where the time complexity of the application is not linear but higher, then the dominant term in the Total time is O (Input Size/Map Tasks). In case the time complexity is O(n^2) =>
Total time = (Input Size/Map Tasks) ^ 2 + Overhead * Map Tasks
Here the first term grows much faster. So minimizing that by increasing the Map tasks is essential. So in such cases you cannot let the Hadoop come up with the number of Map tasks based on the optimal block sizes that it can handle but rather you need to force the number of map tasks upon it.
For example a java ray tracing application which has O(n^2) complexity according to Michael. In this you try to render a scene using ray tracing and this can be easily parallelized because each pixel calculation is independent of the other. When you parallelize it, you will have different segments you want to compute in the input file and you need to force the Map Reduce to calculate those segments in different map tasks. So you need much structured input than that in the word count example because your number of map tasks and your running time depend on it. Letting the partitioning be done to optimize the file sizes in such a compute intensive application would be unwise and in many cases counterproductive. Although it should be noted Hadoop was designed for data intensive parallelism rather than compute intensive parallelism .
Forcing Hadoop to run a map task for each line in input file
Hadoop provides a input formater called NLineInputFormat where each line in the input file would get mapped to a separate map task which allows you to have control over the number of map tasks that you want to force upon the Hadoop framework. Following is the incomplete code to setup the JobConfiguration and the Bold line shows what you need to do to set the NLineInputFormat. So if you know how to optimally map tasks you need to run you can allow those inputs to lines in the input file so Hadoop will create a new map task for each input.
JobConf conf = new JobConf(HadoopRayTracer.class);
conf.setJobName("raytrace");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(NLineInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class)
.......
Labels: each line, Hadoop, Map, Map Reduce, Time complexity
21 Comments:
Arthritis Helpfulvic ionic minerals. I perform different routines for different days so I won't get board. Once I get back home, I immediately hop in the shower. I towel dry and apply baby oil and then put on a terry cloth robe for everything to "soak in." I then fix a quick breakfast,
custom stampsoolong tea benefits
One result of this axis orientation is that, on average during the year, the polar regions of Uranus receive a greater energy input from the Sun than its equatorial regions. Nevertheless, Uranus is hotter at its equator than at its poles. The underlying mechanism which causes this is unknown. The reason for Uranus's unusual axial tilt is also not known with certainty, but the usual speculation is that during the formation of the Solar System, an Earth sized protoplanet collided with Uranus, causing the skewed orientation.[52] Uranus's south pole was pointed almost directly at the Sun at the time of Voyager 2's flyby in 1986.
villa rentals
chiropractor Madison Al
Oxygen-air-breathing propulsion
The Advanced Space Transportation Program is developing technologies for air-breathing rocket engines that could help make future space transportation like today’s air travel. In late 1996, the Marshall Center began testing these radical rocket engines. Powered by engines that "breathe" oxygen from the air, the spacecraft would be completely reusable, take off and land at airport runways, and be ready to fly again within days.
تخسيس الوزن
best edinburgh hotels
The section along York Road that is today known as Sparks was previously named Philopolis. (The name "Philopolis" is from the Greek and means "Love Town." Today, a subdivision of Sparks is named "Loveton Farms."). The original town of Sparks, as distinguished from Philopolis was merely a cluster of homes and farms one mile to the east along the NCR tracks and Sparks Road. Philopolis was the site of the Milton Academy, a well known private day and boarding school for boys. Of note is the fact that one of the school's students was John
HID headlights
its good to share
Though ostensibly a "free state", Illinois had slavery. The French owned black slaves as late as the 1820s. Slavery was nominally banned by the Northwest Ordinance, but that was not enforced. When Illinois became a sovereign state in 1818, the Ordinance no longer applied, and there were about 900 slaves there. As the southern part of the state, known as "Egypt"or "Little Egypt",[29][30] was largely settled by migrants
cdl training
Flash cartoon animations
Therefore, the Punjabis, unlike the Bengalis and the Sindhis, were not allowed to use their mother tongue as an official language. The British first introduced Urdu as an official language in Punjab,[34][35] including Lahore, allegedly due to a fear of Punjabi nationalism. Under British rule (1849–1947), colonial architecture in Lahore combined Mughal, Gothic and Victorian styles. Under British rule, Sir Ganga Ram (referred to as the father of modern Lahore) designed and built the General Post Office, Lahore Museum, Aitchison College, Mayo
coupe de cheveux courts
relogios maquina ETA
The government took over the broadcasting facilities, beginning the Indian State Broadcasting Service (ISBS) on 1 April 1930 (on an experimental basis for two years, and permanently in May 1932). On 8 June 1936 the ISBS was renamed All India Radio. On 1 October 1939 the External Service began with a broadcast in Pushtu; it was intended to counter radio propaganda from Germany directed to Afghanistan, Iran and the Arab nations. When India became independent in 1947 the AIR network had only six stations (in Delhi, Bombay, Calcutta,
Stretch Marks Removal
Essential Oil Diffuser Compact Durable Aromatherapy
Following the Indian Rebellion of 1857, the Government of India Act 1858 made changes in the governance of India at three levels: in the imperial government in London, in the central government in Calcutta, and in the provincial governments in the presidencies (and later in the provinces).
cheap motorcycle insurance quotes
Rosarito Condos
The John Company (and later the colonial government) encouraged new railway companies backed by private investors under a scheme that would provide land and guarantee an annual return of up to five percent during the initial years of operation. The companies were to build and operate the lines under a 99 year lease, with the government having the option to buy them earlier.
electronic cigarette
lace wigs factory
During the period 2000–500 BCE, in terms of culture, many regions of the subcontinent transitioned from the Chalcolithic to the Iron Age.[20] The Vedas, the oldest scriptures of Hinduism,[21] were composed during this period,[22] and historians have analysed these to posit a Vedic culture in the Punjab region and the upper Indo-Gangetic Plain.
Holiday apartments in Madeira
Outerwear sales
The earliest literary writings in India, composed between 1400 BCE and 1200 CE, were in the Sanskrit language.[249][250] Prominent works of this Sanskrit literature include epics such as the Mahābhārata and the Ramayana, the dramas of Kālidāsa such as the Abhijñānaśākuntalam (The Recognition of Śakuntalā), and poetry such as the Mahākāvya.[251][252][253] Developed between 600 BCE and 300 CE in South India, the Sangam literature, consisting of 2,381 poems, is regarded as a predecessor of Tamil literature.[254][255][256][257] From the 14th to the 18th centuries, India's literary traditions went through a period of drastic change becau
DKN Vibration
top ten lists
Cuse attended Harvard University (Class of '81) and was recruited at freshmen registration by the freshman crew coach, Ted Washburn, and became part of the rowing team. In his words, he became "a hardcore athlete." Cuse's original plan was to attend medical school but he instead majored in American history
psicologia las rozas
how lose weight fast
was married to the wealthy Isabel of Gloucester, and was given valuable lands in Lancaster and the counties of Cornwall, Derby, Devon, Dorset, Nottingham and Somerset, all with the aim of buying his loyalty to Richard whilst the king was on crusade.[33] Richard retained royal control of key castles in these counties, thereby preventing John from accumulating too much military and political power.[34] In return, John promised not to visit England for the next three years, thereby in theory giving Richard
arizona medical marijuana card
Knife
ohn's relief operation was blocked by Philip's forces, and John turned back to Brittany in an attempt to draw Philip away from eastern Normandy.[72] John successfully devastated much of Brittany, but did not deflect Philip's main thrust into the east of Normandy.[72] Opinions vary amongst historians as to the military skill shown by John during this campaign, with most recent historians arguing that his performance was passable, although not impressive.
cosmetic dentist dublin
made to measure suit
bstantially all of its assets to Rackable Systems, a deal finalized on May 11, 2009, with Rackable assuming the name "Silicon Graphics International". The remains of Silicon Graphilegal recruitment
renta fast los cabos
online investment
בניית אתריםg computer. They used the "PM1" CPU board, which was a variant of the board that was used in Stanford University's SUN workstation and later in the Sun-1 workstation from Sun Microsystems. The graphics system was composed of the GF1 Frame buffer, the UC3 "Update Controller", DC3 "Display Controller", and the BP2 bitplane. The 1000-series machines were designed around the Multibus standard.
(50,350 sq mi).[103] Most of the country consists of lowland terrain,[100] with mountainous terrain north-west of the Tees-Exe line; including the Cumbrian Mountains of the Lake District, the Pennines and limestone hills of the Peak District, Exmoor and Dartmoor. The main rivers and estuaries are the Thames, Severn and the Humber. England's highest mountain is Scafell Pike (978 metres (3,209 ft)) in the Lake District. Its principal rivers are the Severn, Thames, Humber, Tees, Tyne, Tweed, Avon, Exe and Mersey
dentist dublin
rock climbing guide ireland
The self-produced album received mostly favorable reviews by critics. Publications such as PopMatters and Rolling Stone praised its maturity, while PlayLouder and Pitchfork Media criticized its "dad-rock" sound. While some critics praised the direct lyrical approach, others criticized it when compared to previous Wilco albums. The band licensed six songs from the Sky Blue Sky sessions to a Volkswagen advertisement campaign, a move that generated criticism from fans and the media
Travel Photographer
jobs
On 8 March 2009, the Scottish edition of the Sunday Express published a front page article critical of survivors of the 1996 Dunblane massacre, entitled "Anniversary Shame of Dunblane Survivors". The article criticised the by-then 18-year-old survivors for posting "shocking blogs and photographs of themselves on the internet", revealing that they drank alcohol, made rude gestures, and talked about their sex lives. The article provoked several complaints, leading to the printing of a front-page apology a fortnight later,[23] and a
Filmes Gratis
Energy Saving Light Bulbs
According to John T. Koch and others, England in the Late Bronze Age was part of a maritime trading-networked culture called the Atlantic Bronze Age that included the whole of the British Isles and much of what we now regard as France together with the Iberian Peninsula. Celtic languages developed in those areas; Tartessian may have been the earliest written Celtic language.
4Life
copyshop
Post a Comment
<< Home