Abstract and 1. Introduction

2. Experiment Definition

3. Experiment Design and Conduct

3.1 Latin Square Designs

3.2 Subjects, Tasks and Objects

3.3 Conduct

3.4 Measures

4. Data Analysis

4.1 Model Assumptions

4.2 Analysis of Variance (ANOVA)

4.3 Treatment Comparisons

4.4 Effect Size and Power Analysis

5. Experiment Limitations and 5.1 Threats to the Conclusion Validity

5.2 Threats to Internal Validity

5.3 Threats to Construct Validity

5.4 Threats to External Validity

6. Discussion and 6.1 Duration

6.2 Effort

7. Conclusions and Further Work, and References

4.1 Model Assumptions

Before we start to draw any conclusion, we must assess the following model assumptions:

  1. All observations are independent (independence)

  2. The variance is the same for all observations (homogeneity)

  3. The observations within each treatment group have a normal distribution (normality)

The first assumption is addressed by the principle of randomization used in this experimental design; all the measures of one sample are not related to those of the other sample. The second and third assumptions are assessed by using the estimated residuals [6, 16]. To assess homogeneity of variances we use a plot to show a scatter plot of the standardized residuals against the estimated mean values (sometimes called fitted values). We also use the Levene test for homogeneity of variances [17]. The third assumption (normality) is evaluated by using a normal probability plot, and applying the KolmogorovSmirnov test for normality [15, 26].

Selecting the duration measure, Fig. 1 shows a scatter plot of the standardized residuals versus fitted values. Violations to the homogeneity variance assumption can be detected with either plot by noting that the variation in the vertical direction seems to differ at different points along the horizontal axis. In this case, Fig. 1 shows a different pattern between the vertical points. Applying the Levene test [17] we get a p-value of 0.0594. Setting an alpha level of 0.05 this test is significant (selecting only two decimal of the p-value with no rounding off), so the assumption of homogeneity is violated.

Taking a further analysis, we found that the time duration to write the second program was less than the first one. In Fig. 1, the first and second vertical data points correspond to the second program (encoder). Fig. 2 shows the mean time duration to write both programs. To fulfill this assumption, in future experiments we will select programs with similar complexity.

Continuing with the next assumption assessment, Fig. 3 shows a normal probability plot. If points (in this case standardized residuals) fall close to a straight line pattern then residuals are approximately normal. Points that are above the straight line pattern correspond to residuals that are bigger than we might expect for normal data. Points that are below the straight line pattern correspond to residuals that are smaller than we might expect for normal data. Applying the Kolmogorov-Smirnov test for normality [15, 26] we get a p-value of 0.8806; it means that we accept the null hypothesis in favor of normality.

With respect to the assumptions assessment for effort, we get similar results to those we report regarding duration; performing the Levene test for homogeneity of variances [17] we get a p-value of 0.0241. Setting an alpha level of 0.05 this test is significant. It means that variances are not equal due to differences between programs duration. The Kolmogorov-Smirnov test for normality [15, 26] gives a pvalue 0.8059. It means that we accept the null hypothesis in favor of normality.

Due to the experimental design used, another assumption that is worth to assess is the additivity. Experiment designs that implement blocking assume that there is no interaction between the treatment and the block. Under this situation it is told that treatment and block effects are additive [16]. We test this assumption by using the Tukey test for nonadditivity [27]. Table 5 shows the results of this test for the Latin square design used in the experiment.

Setting an alpha level of 0.1 (or less), p-values are not significant. It means that experiment results satisfy the assumption of additivity in lack of interaction between treatment and blocks.

Authors:

(1) Omar S. Gómez, full time professor of Software Engineering at Mathematics Faculty of the Autonomous University of Yucatan (UADY);

(2) José L. Batún, full time professor of Statistics at Mathematics Faculty of the Autonomous University of Yucatan (UADY);

(3) Raúl A. Aguilar, Faculty of Mathematics, Autonomous University of Yucatan Merida, Yucatan 97119, Mexico.


This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.