Averages & Boxplots.mws

High School Modules > Precalculus

     Averages & Boxplots


An graphical exploration of the mean, median, mode, quartiles, deciles, and box plots - including range and interquartile range.

[Directions : Execute the Code Resource section first. Although there will be no output immediately, these definitions are used later in this worksheet.]

  0. Code

>    restart;

>    with(plots): with(stats):

Warning, the name changecoords has been redefined

>    #======================================================

data  := [3,3,4,8,8,8, 10,13,15, 16,16,18,21,23,24]:

two_extremes_data :=[3,3,3,3,3,3,3,3,3,3,25,25,25,25,25,25,25,25,25,25]:

evenly_distributed_data :=
[2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]:

big_data_set :=
[42,42,42,42,42,43,43,43,43,43,43,45,45,45,45,45,45,45,45,
46,46,46,47,47,47,47,48,48,48,48,49,49,49,51,51,51,51,52,52,55,
55,55,58,58,58,59,61,61,62,62,62,63,63,63,64,64,65,65,65,66,68,69,
69,69,70,71,71 ]:

>    #======================================================

h   := 2:        h2  := .8:      h3 := .45:  
CG := 'color = COLOR(RGB,   .05,.4,.05)':
CBr := 'color = COLOR(RGB,  .3,.2,0)':
CGSS := 'color = COLOR(RGB, .05,.4,.05), symbolsize = 15,   symbol=BOX':

CGF  := 'color = COLOR(RGB, .05,.4,.05), style=patchnogrid, filled = true':
CGrF := 'color = green,  style=patchnogrid, filled = true':
CBF  := 'color = black,  style=patchnogrid, filled = true':
CRF  := 'color = red,    style=patchnogrid, filled = true':
CCF  := 'color = coral,  style=patchnogrid, filled = true':

>    #======================================================

GetAverages := proc(dataset)
   global q1,q2,q3,me,mo,mn,mx, dec;
   local i,MM;
   q1 := evalf( describe[quartile[1]](dataset));
   q2 := evalf( describe[median](dataset) );
   q3 := evalf( describe[quartile[3]](dataset));
   me := evalf( describe[mean](dataset));
   mo := describe[mode](dataset);
   MM := describe[range](dataset);
   mn := op(1,MM);
   mx := op(2,MM);
   dec||0 := mn; dec||10:= mx;
   for i from 1 to 9 do
      dec||i := evalf( describe[decile[i]](dataset));  od;
end proc:

>    #======================================================
GetPlots := proc(dataset)
   global  q1,q2,q3,me,mo,mn,mx,dec,
           MN,MN2,MD,MD2,MO,MO2,DP,LQ,RQ,Ln,LE,RE,LR,LIQR,Tends,PP;
   local   i,h,h2,h3,y;

   #--------- Constants  -----------------------------
   h := 2;        h2 := .8;   h3 := .5;

   #--------- Plots on Original Data -----------------------------
   # base line, Left end, right end, mean for original data, stand dev

   Ln   := plot( [[mn,0],[mx+.5,0]], color = blue, thickness = 4):
   LE   := plot( [[mn,-h3],[mn+h3,-h3],[mn+h3,h3], [mn,h3],[mn,-h3]],CGrF):
   RE   := plot( [[mx+.5,-h3],[mx-h3+.5,-h3],[mx-h3+.5,h3],
                  [mx+.5, h3],[mx+.5,   -h3]], CGrF):
   #------------ MEAN ------------------
   MN   := plot( [[me, h2],[me-h2,0],[me+h2,0],[me, h2]],               CBF):
   MN2  := plot( [[me , +h+h2],[me,0]], color = black, linestyle = 2):

   #------------ MEDIAN ------------------
   MD :=  plot( [[q2 ,-h2],[q2-h2,0],[q2+h2,0],[q2,-h2]],
             color = white, style=patchnogrid, filled = true):
   MD2 := plot( [[q2, h],[q2,-h]], color = white, linestyle = 2):

   #------------ MODE ------------------
   MO :=  plot( [[mo ,-h2],[mo-h2,0],[mo+h2,0],[mo,-h2]],
             CBr, style=patchnogrid, filled = true):
   MO2 := plot( [[mo, h],[mo,-h]], CBr, linestyle = 2):

   #------------ BOXES ------------------
   LQ := plot( [[q1,-h],[q2,-h],[q2,h],[q1,h],[q1,-h]],  CRF):
   RQ := plot( [[q2,-h],[q3,-h],[q3,h],[q2,h],[q2,-h]],  CCF):

   #------------ DECILES ------------------
   DP := plot( [seq( [[dec||i ,-h],[dec||i,-1.4*h]],i = 0..10)],
             color = black):

   #------------ RANGES ------------------
   LR := plot( [[mn,5-h2],[mn,5],[mx,5],[mx,5-h2]],
                color = blue, thickness = 1):
   LIQR := plot( [[q1,4-h2],[q1,4],[q3,4],[q3,4-h2]],
                color = red, thickness = 1):
 
   #------------ TEXT ------------------
   Tends  := plots[textplot]( {[mn, -h/2, mn],[mx, -h/2, mx]},
             align={BOTTOM,RIGHT},font=[TIMES,ROMAN,12],color = black):

 
   #------------ Original Distribution Points ------------------
   PP||1 := pointplot( [dataset[1],-3-h3], CGSS ):
   y := h3;

   for i from 2 to nops(dataset) do
       if (dataset[i]=dataset[i-1]) then y := y+h3; else y:= h3; fi;
       PP||i := pointplot( [dataset[i],-3-y ], CGSS ):  
   od:

end proc:


>    #===================================================
BasicDataPlot := proc(dataset)
   display([ Ln,LE,RE,Tends,  
             seq(PP||i, i = 1..nops(dataset))
           ], scaling = constrained, axes = none );
end proc:

>    #===================================================
MeanPlot := proc(dataset)
   display([ MN, MN2, Ln,LE,RE,Tends,  
             seq(PP||i, i = 1..nops(dataset))
           ], scaling = constrained, axes = none );
end proc:

>    #===================================================
MeanModePlot := proc(dataset)
   display([ MN, MN2,MO,MO2,Ln,LE,RE,LR,Tends,  
             seq(PP||i, i = 1..nops(dataset))
           ], scaling = constrained, axes = none );
end proc:

>    #===================================================
BoxPlot := proc(dataset)
   display([ MN,MN2,MD,MD2,LQ,RQ,Ln,LE,RE,LR,LIQR,Tends,  
             seq(PP||i, i = 1..nops(dataset))
           ], scaling = constrained, axes = none );
end proc:

>    #===================================================
DecPlot := proc(dataset)
   display([ MD,MD2,DP,LQ,RQ,Ln,LE,RE,LIQR,Tends,  
             seq(PP||i, i = 1..nops(dataset))
           ], scaling = constrained, axes = none );
end proc:

>    #===================================================
CompleteDescriptionPlot := proc(dataset)
   display([ MN,MN2,MD,MD2,MO,MO2,DP,LQ,RQ,Ln,LE,RE,LR,LIQR,Tends,  
             seq(PP||i, i = 1..nops(dataset))
           ], scaling = constrained, axes = none );
end proc:

>    PricePlot := proc(dataset)
   display([ MN2,MD2,DP,LQ,RQ,Ln,LE,RE,LR,LIQR,Tends,  
             seq(PP||i, i = 1..nops(dataset))
           ], axes = none );
end proc:

  1. Data Distributions


Any collection of data values can be expressed graphically, by drawing one cell for each occurrences of a particular data value at its location on the x-axis, stacking them if there are multiple occurrences at the same value.

>    GetAverages(data):
GetPlots(data):
data;

[3, 3, 4, 8, 8, 8, 10, 13, 15, 16, 16, 18, 21, 23, 24]

>    BasicDataPlot(data);

[Maple Plot]


There is a box for each data value. The minimum is 3 and maximum is 24. There are some gaps between the green boxes, where no data exists. And the boxes are stacked up when there are multiple data entries sharing the same value. This is a visual representation of the original data distribution.


These distributions can be quite different. The next distribution is evenly distributed. Each data value has a frequency of 1, and every data value in the range is covered so there are no gaps. This has an artificial look to it, but its good to consider a wide variety of distributions.

>    GetAverages(evenly_distributed_data):
GetPlots(evenly_distributed_data):
evenly_distributed_data;

[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]

>    BasicDataPlot(evenly_distributed_data);

[Maple Plot]


 This distribution is quite different. It only consists of two distinct values, each repeated a number of times.

>    GetAverages(two_extremes_data):
GetPlots(two_extremes_data):
two_extremes_data;

[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25]

>    BasicDataPlot(two_extremes_data);

[Maple Plot]


Here is a run of the mill distribution.

>    GetAverages(big_data_set):
GetPlots(big_data_set):
big_data_set;

[42, 42, 42, 42, 42, 43, 43, 43, 43, 43, 43, 45, 45, 45, 45, 45, 45, 45, 45, 46, 46, 46, 47, 47, 47, 47, 48, 48, 48, 48, 49, 49, 49, 51, 51, 51, 51, 52, 52, 55, 55, 55, 58, 58, 58, 59, 61, 61, 62, 62, ...
[42, 42, 42, 42, 42, 43, 43, 43, 43, 43, 43, 45, 45, 45, 45, 45, 45, 45, 45, 46, 46, 46, 47, 47, 47, 47, 48, 48, 48, 48, 49, 49, 49, 51, 51, 51, 51, 52, 52, 55, 55, 55, 58, 58, 58, 59, 61, 61, 62, 62, ...

>    BasicDataPlot(big_data_set);

[Maple Plot]

  2. The Mean


The mean is computed by adding up all the values and dividing by the number of data values.

>    GetAverages(data):
GetPlots(data):
data;

[3, 3, 4, 8, 8, 8, 10, 13, 15, 16, 16, 18, 21, 23, 24]

>    (  data[1]  + data[2]  + data[3]  + data[4]  + data[5]  +
   data[6]  + data[7]  + data[8]  + data[9]  + data[10] +
   data[11] + data[12] + data[13] + data[14] + data[15]  )/nops(data);
`the mean ` = evalf(%,15);

38/3

`the mean ` = 12.6666666666667

>    me;

12.66666667

>    MeanPlot(data);

[Maple Plot]


The mean is indicated by the small black triangle facing upward. This is the balancing point for the data.

Lets look at some other distributions.
 

>    GetAverages(evenly_distributed_data):
GetPlots(evenly_distributed_data):

>    MeanPlot(evenly_distributed_data);

[Maple Plot]

>    GetAverages(big_data_set):
GetPlots(big_data_set):

>    MeanPlot(big_data_set);

[Maple Plot]

  3. The Mode


The mode is the most common data value.

>    GetAverages(data):
GetPlots(data):
data;

[3, 3, 4, 8, 8, 8, 10, 13, 15, 16, 16, 18, 21, 23, 24]

>    mo;

8

>    MeanModePlot(data);

[Maple Plot]


The mode is indicated by the brown triangle pointing downward. The mean is still the black triangle pointing upward. We can see why the mean is 8. There are clearly three boxes stacked at 8 - more than any other value. Note that the mean and mode are quite different - and pretty much unrelated in general.

Sometimes there is a tie among several values.
 

>    GetAverages(big_data_set):
GetPlots(big_data_set):
big_data_set;

[42, 42, 42, 42, 42, 43, 43, 43, 43, 43, 43, 45, 45, 45, 45, 45, 45, 45, 45, 46, 46, 46, 47, 47, 47, 47, 48, 48, 48, 48, 49, 49, 49, 51, 51, 51, 51, 52, 52, 55, 55, 55, 58, 58, 58, 59, 61, 61, 62, 62, ...
[42, 42, 42, 42, 42, 43, 43, 43, 43, 43, 43, 45, 45, 45, 45, 45, 45, 45, 45, 46, 46, 46, 47, 47, 47, 47, 48, 48, 48, 48, 49, 49, 49, 51, 51, 51, 51, 52, 52, 55, 55, 55, 58, 58, 58, 59, 61, 61, 62, 62, ...

>    mo;

45

>    MeanModePlot(big_data_set);

[Maple Plot]

  4. The Median


The median, in theory, divides the data into two halves. Half of the data should be below the median, and half above. Of course, there are complications when the median actually falls on a data value, but we won't go into every detail here and now.

>    GetAverages(data):
GetPlots(data):
data;

[3, 3, 4, 8, 8, 8, 10, 13, 15, 16, 16, 18, 21, 23, 24]


Luckily this data is already sorted for us. There are 15 data values. The middle value, the 8th of 15,  is the number 13.

>    `the median ` = q2;

`the median ` = 13.


We'll postpone plotting this just for a moment, because medians are a part of the box plot that follows.
 

  5. Quartiles & the Box Plot


Medians divide the data into two halves. Quartiles divide the data into four quarters. Again, there are complications when the quartiles actually fall on data values, but this is the rough idea.

>    GetAverages(data):
GetPlots(data):

>    q1; q2; q3;

7.

13.

16.50000000


Thus, the data should be distributed in this way :

            - 1/4 of the data is less than or equal to Q1
             - 1/2 of the data is less than or equal to Q2 = median
             - 3/4 of the data is less than or equal to Q3

This lends itself to what is called a box plot. Such a plot depends on the three quartiles, and the minimum value and maximum value. We draw a box from the first quartile to the second, and from the second to the third. Then we draw "whiskers" (lines) from the minimum to quartile 1, and from quartile 3 to the maximum. A picture will demonstrate the concept.

>    QuartilePlot(data);

QuartilePlot([3, 3, 4, 8, 8, 8, 10, 13, 15, 16, 16, 18, 21, 23, 24])


The green data boxes at the bottom are the original data, just as they were before. Also the black triangle indicates the mean just as before. The red box is the box from Q1 to Q2, and the orange box is the box from Q2 to Q3.  The median is indicated by a white triangle facing downward. As you can see, it is exactly at the place where the red and orange boxes meet.

The blue line above is the range of the data - the difference in the largest and smallest values. The red line is called the interquartile range. It is the distance from Q1 to Q3.

>    `range` = mx - mn;
`interquartile range` = q3 - q1;

range = 21

`interquartile range` = 9.50000000


If we take the ratio of the interquartile range to the range, we get an idea of how closely bunched the middle 50% of the data is compared to the rest.  

>    (q3 - q1)/(mx - mn);

.4523809524

>   

 
Here is another example ...with evenly distributed data, which will yield symmetric results. Note the mean and median agree - which is usually not the case for more randomly distributed data.

>    GetAverages(evenly_distributed_data):
GetPlots(evenly_distributed_data):

>    BoxPlot(evenly_distributed_data);

[Maple Plot]


Here is another example with less orderly data. Note that the red box and left box are not the same width.

>    GetAverages(big_data_set):
GetPlots(big_data_set):

>    BoxPlot(big_data_set);

[Maple Plot]


 

  6. Deciles


Deciles are somewhat similar to quartiles. Quartiles attempt to break the data set into four quarters with equal numbers of data elements. Deciles attempt to break the data into ten equal sized subgroupings.

>    GetAverages(data):
GetPlots(data):
data;

[3, 3, 4, 8, 8, 8, 10, 13, 15, 16, 16, 18, 21, 23, 24]


If we include the minimum value and the maximum value, we get these values for the deciles. There is approximately 10% of the data in between any two consequtive numbers that follow.

>    for i from 0 to 10 do
      dec||i; od;

3

3.

4.

8.

8.

11.50000000

15.

16.

18.

22.

24

>    DecPlot(data);

[Maple Plot]


The black lines indicate the 9 deciles along with min and max - to create ten regions which contain roughly the same number of data values. Since this is small distribution, there are some rough edges.


Lets look at another distribution. This data is evenly distributed, so we would expect to find the deciles and quartiles of equal size.

>    GetAverages(evenly_distributed_data):
GetPlots(evenly_distributed_data):
evenly_distributed_data;

[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]

>    for i from 0 to 10 do
      dec||i; od;

2

3.500000000

6.

8.500000000

11.

13.50000000

16.

18.50000000

21.

23.50000000

26

>    DecPlot(evenly_distributed_data);

[Maple Plot]


A larger and less ordered distribution is more typical.

>    GetAverages(big_data_set):
GetPlots(big_data_set):
big_data_set;

[42, 42, 42, 42, 42, 43, 43, 43, 43, 43, 43, 45, 45, 45, 45, 45, 45, 45, 45, 46, 46, 46, 47, 47, 47, 47, 48, 48, 48, 48, 49, 49, 49, 51, 51, 51, 51, 52, 52, 55, 55, 55, 58, 58, 58, 59, 61, 61, 62, 62, ...
[42, 42, 42, 42, 42, 43, 43, 43, 43, 43, 43, 45, 45, 45, 45, 45, 45, 45, 45, 46, 46, 46, 47, 47, 47, 47, 48, 48, 48, 48, 49, 49, 49, 51, 51, 51, 51, 52, 52, 55, 55, 55, 58, 58, 58, 59, 61, 61, 62, 62, ...

>    for i from 0 to 10 do
      dec||i; od;

42

43.

45.

46.

47.80000000

50.

55.

60.80000000

63.

66.60000000

71

>    DecPlot(big_data_set);

[Maple Plot]


The fact that the data is more concentrated at the left (the smaller numbers) is evident in several ways. The red box for Q1 to Q2 is smaller, than the orange box from Q2 to Q3. In the same way, you'll see the decile lines are closely packed on the left, but more spread out on the right edge.
 

  7. Comparing Descriptions


Lets see all of this information in one diagram and compare the different averages.

>    GetAverages(data):
GetPlots(data):
data;

[3, 3, 4, 8, 8, 8, 10, 13, 15, 16, 16, 18, 21, 23, 24]

>    `mean` = me;
`median` = q2;
`mode` = mo;

mean = 12.66666667

median = 13.

mode = 8


The mode is fairly independent since it a simply a measure of the most common data value. The mean and median are more representative of "averages" of the data set. You can see in this example, they are not too far apart.

>    CompleteDescriptionPlot(data);

[Maple Plot]


Here is another example, with the big data set which is concentrated near the smaller numbers.

>    GetAverages(big_data_set):
GetPlots(big_data_set):
big_data_set;

>    `mean` = me;
`median` = q2;
`mode` = mo;

[42, 42, 42, 42, 42, 43, 43, 43, 43, 43, 43, 45, 45, 45, 45, 45, 45, 45, 45, 46, 46, 46, 47, 47, 47, 47, 48, 48, 48, 48, 49, 49, 49, 51, 51, 51, 51, 52, 52, 55, 55, 55, 58, 58, 58, 59, 61, 61, 62, 62, ...
[42, 42, 42, 42, 42, 43, 43, 43, 43, 43, 43, 45, 45, 45, 45, 45, 45, 45, 45, 46, 46, 46, 47, 47, 47, 47, 48, 48, 48, 48, 49, 49, 49, 51, 51, 51, 51, 52, 52, 55, 55, 55, 58, 58, 58, 59, 61, 61, 62, 62, ...

mean = 53.31343284

median = 51.

mode = 45

>    CompleteDescriptionPlot(big_data_set);

[Maple Plot]

>   



Lets compare these two data sets - which are identical except for one value. Which of the mean, median, and mode will be different.

>    data2A :=
[40,40,50,50,50,50,60,60,60,70,70]:

data2B :=
[10,40,50,50,50,50, 60,60,60,70,70]:

>    data2A;
GetAverages(data2A):
GetPlots(data2A):

[40, 40, 50, 50, 50, 50, 60, 60, 60, 70, 70]

>    `mean` = me;
`median` = q2;
`mode` = mo;

mean = 54.54545455

median = 50.

mode = 50

>   

>    data2B;
GetAverages(data2B):
GetPlots(data2B):

[10, 40, 50, 50, 50, 50, 60, 60, 60, 70, 70]

>    `mean` = me;
`median` = q2;
`mode` = mo;

mean = 51.81818182

median = 50.

mode = 50

>    CompleteDescriptionPlot(data2B);

[Maple Plot]


The mode and median are the same. However, the mean is different. The mean is affected by even a single extreme data value, where as the median is not. This is why the median is often used for statistics on incomes and housing prices - because there are small numbers of extreme values which can skew the mean. Here is a little example

>    `Housing prices in thousands of dollars`;
housing_prices := [ 100, 120, 135, 150, 155, 170, 195,
                    205, 220, 235, 247, 255, 275, 289, 1850
                  ];
GetAverages( housing_prices):
GetPlots(    housing_prices):

`Housing prices in thousands of dollars`

housing_prices := [100, 120, 135, 150, 155, 170, 195, 205, 220, 235, 247, 255, 275, 289, 1850]

>    `mean` = me;
`median` = q2;

mean = 306.7333333

median = 205.


Note that the mean is greater than all but one of the data values!

>    14/15: % = evalf(%,2);

14/15 = .93


Another way of looking at it is that 93% of the data lies below the mean, and only 7% lies above the mean! That doesn't seem to be a good representative average for this data set. If someone said how many members of the community can afford an "average priced" house, and you were to use the mean, then it might be that 93% of the people can NOT afford an average priced house - if the average used were the mean. This is why the median, 205, is much more indicative of an average home price.

>    PricePlot(housing_prices);

[Maple Plot]

>   


Clearly most of the data is on the far left - as are the 9 deciles, the 3 quartiles, and the median. The mean is sticking out above all of the data values. And the single expensive home is on the far right.


         © 2002 Waterloo Maple Inc & Gregory Moore, all rights reserved.