Statistical Analysis of Data obtained by Simulation Campaigns

May 25, 2023

Once one has performed Simulation Campaigns (as described here and here), it is often tempting to suppose, looking at the curves, that the asymptotic complexity for a particular algo-topology-daemon is logarithmic/linear/quadratic/something else.

The R script below performs Statistical Regression Analysis (least square method) using the lm function:

library(dplyr)
library(ggplot2)
er     <- "er.data" # .data files are generated by scripts in tools/simca/
clique <- "clique.data"
ring   <- "ring.data"

load_data <- function(fn){
  data <- data.frame(val=read.table(fn))
  names(data) <- c("n", "Daemons", "Algorithms", "complexity_kind", "min", "mean", "max")
  data
}

#
do_reg_analysis <- function(form, d, a, cm, topo){
 data <- load_data(topo)
 dataf=filter(data,Daemons==d, Algorithms==a,complexity_kind==cm)
 fit=lm(formula = form, data=dataf) # This were the job is done
 prediction <- predict(fit, interval = "confidence")
 dataf <- cbind(dataf, prediction)
 p <- ggplot(dataf, aes_string(x="n",y="mean")) + geom_point() + #stat_smooth(method = lm) +
   ylab(cm)+xlab("Nodes Number")+
   geom_line(aes(y = lwr), color = "red", linetype = "dashed")+
   geom_line(aes(y = fit), color = "blue")+
   geom_line(aes(y = upr), color = "red", linetype = "dashed")+
   ggtitle(paste(a, "/",d,"daemon in", cm,"on", basename(topo),
   "using \"", deparse(form), "\"\nF-value =", sprintf("%.1f",anova(fit)[[4]][1]),
   " ; p-value =" , sprintf("%.3e",anova(fit)[[5]][1])))
 print(p)
 print(anova(fit))
 print(coef(fit))
 fit
}

For example, in order to check the hypothesis that,

for the Algorithm named “Col-a3” (as it its called in tools/simca/coloring_campaign.ml)
on Random Topologies (generated via the Erdos-Renyi algorithm)
under the Synchronous daemon

the number of steps is a linear function of the number of nodes (as the curves in the generated pdf hint), one can use the do_reg_analysis function defined above as follows:

do_reg_analysis(mean~n, "Synchronous", "Col-a2", "steps", er)

R will generate this graphics:

where:

the blue line is the inferred function
the dashed-red lines represent the confidence interval

R outputs various numbers to assess the goodness of the fit (the bigger the F-value is, the better; the smaller the p-value is, the better). More information on the inferred model is actually displayed in the R terminal:

> do_reg_analysis(mean~n, "Synchronous", "Col-a2", "steps", er)
Analysis of Variance Table

Response: mean
          Df  Sum Sq Mean Sq F value   Pr(>F)
n          1 1064.81 1064.81  1363.8 3.17e-10 ***
Residuals  8    6.25    0.78
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Intercept)           n
  8.2266667   0.1197535

Call:
lm(formula = form, data = dataf)

Coefficients:
(Intercept)            n
     8.2267       0.1198

Sometimes, a Linear Fit is not suited. For example:

do_reg_analysis(mean~n,"Synchronous","Col-a1","steps",ring)

The F and p values are not too bad, but a better fit can be obtained via a logarithmic regression:

do_reg_analysis(mean~log(n),"Synchronous","Col-a1","steps",ring)

Then of course, for a better fit, it is better to use more points:

ring   <- "~/sasa/tools/simca/results/2021-05-16_15/ring.data" # there are more points in this experiment
do_reg_analysis(mean~log(n),"Synchronous","Col-a1","steps",ring)

Some entry points to statistical analyses using the R lm function:

Analysis of the 81 curves of the coloring campaign

The following calls to the do_reg_analysis R function correspond the better fit we obtained after trying either:

a linear fit: mean~n
a logarithmic fit: mean~log(n)
a quadratic fit: mean~n + I(n^2)

nb: the more the command is indented to the right, the worst is the fit.

 do_reg_analysis(mean~log(n),"Synchronous","Col-a1","rounds",ring)
 do_reg_analysis(mean~log(n),"Synchronous","Col-a2","rounds",ring)
 do_reg_analysis(mean~log(n),"Synchronous","Col-a3","rounds",ring)
 do_reg_analysis(mean~log(n),"Synchronous","Col-a1","rounds",er)
 do_reg_analysis(mean~(n)+ I(n^2),"Synchronous","Col-a2","rounds",er)
 do_reg_analysis(mean~(n)+ I(n^2),"Synchronous","Col-a3","rounds",er)
do_reg_analysis(mean~log(n),"Synchronous","Col-a1","rounds",clique)
do_reg_analysis(mean~(n),"Synchronous","Col-a2","rounds",clique)
do_reg_analysis(mean~(n),"Synchronous","Col-a3","rounds",clique)
do_reg_analysis(mean~log(n),"Synchronous","Col-a1","steps",ring)
do_reg_analysis(mean~log(n),"Synchronous","Col-a2","steps",ring)
do_reg_analysis(mean~log(n),"Synchronous","Col-a3","steps",ring)
 do_reg_analysis(mean~log(n),"Synchronous","Col-a1","steps",er)
 do_reg_analysis(mean~(n),"Synchronous","Col-a2","steps",er)
 do_reg_analysis(mean~(n),"Synchronous","Col-a3","steps",er)
do_reg_analysis(mean~log(n),"Synchronous","Col-a1","steps",clique)
do_reg_analysis(mean~(n),"Synchronous","Col-a2","steps",clique)
do_reg_analysis(mean~(n),"Synchronous","Col-a3","steps",clique)
do_reg_analysis(mean~(n),"Synchronous","Col-a1","moves",ring)
do_reg_analysis(mean~(n),"Synchronous","Col-a2","moves",ring)
do_reg_analysis(mean~(n),"Synchronous","Col-a3","moves",ring)
do_reg_analysis(mean~(n),"Synchronous","Col-a1","moves",er)
do_reg_analysis(mean~n + I(n^2),"Synchronous","Col-a2","moves",er)
do_reg_analysis(mean~n + I(n^2),"Synchronous","Col-a3","moves",er)
do_reg_analysis(mean~(n),"Synchronous","Col-a1","moves",clique)
do_reg_analysis(mean~(n)+ I(n^2),"Synchronous","Col-a2","moves",clique)
do_reg_analysis(mean~(n)+ I(n^2),"Synchronous","Col-a3","moves",clique)
  do_reg_analysis(mean~log(n),"Distributed","Col-a1","rounds",ring)
  do_reg_analysis(mean~log(n),"Distributed","Col-a2","rounds",ring)
  do_reg_analysis(mean~log(n),"Distributed","Col-a3","rounds",ring)
   do_reg_analysis(mean~log(n),"Distributed","Col-a1","rounds",er)
 do_reg_analysis(mean~(n)+ I(n^2),"Distributed","Col-a2","rounds",er)
do_reg_analysis(mean~log(n),"Distributed","Col-a3","rounds",er)
 do_reg_analysis(mean~log(n),"Distributed","Col-a1","rounds",clique)
do_reg_analysis(mean~(n),"Distributed","Col-a2","rounds",clique)
do_reg_analysis(mean~(n),"Distributed","Col-a3","rounds",clique)
do_reg_analysis(mean~log(n),"Distributed","Col-a1","steps",ring)
do_reg_analysis(mean~log(n),"Distributed","Col-a2","steps",ring)
do_reg_analysis(mean~log(n),"Distributed","Col-a3","steps",ring)
do_reg_analysis(mean~log(n),"Distributed","Col-a1","steps",er)
do_reg_analysis(mean~(n),"Distributed","Col-a2","steps",er)
 do_reg_analysis(mean~log(n),"Distributed","Col-a3","steps",er)
do_reg_analysis(mean~log(n),"Distributed","Col-a1","steps",clique)
do_reg_analysis(mean~(n),"Distributed","Col-a2","steps",clique)
do_reg_analysis(mean~(n),"Distributed","Col-a3","steps",clique)
do_reg_analysis(mean~(n),"Distributed","Col-a1","moves",ring)
do_reg_analysis(mean~(n),"Distributed","Col-a2","moves",ring)
do_reg_analysis(mean~(n),"Distributed","Col-a3","moves",ring)
do_reg_analysis(mean~(n),"Distributed","Col-a1","moves",er)
do_reg_analysis(mean~(n)+ I(n^2),"Distributed","Col-a2","moves",er)
do_reg_analysis(mean~(n)+ I(n^2),"Distributed","Col-a3","moves",er)
do_reg_analysis(mean~(n),"Distributed","Col-a1","moves",clique)
do_reg_analysis(mean~(n)+ I(n^2),"Distributed","Col-a2","moves",clique)
do_reg_analysis(mean~(n)+ I(n^2),"Distributed","Col-a3","moves",clique)
  do_reg_analysis(mean~(n)+ I(n^2),"Locally Central","Col-a1","rounds",ring)
  do_reg_analysis(mean~log(n),"Locally Central","Col-a2","rounds",ring)
  do_reg_analysis(mean~log(n),"Locally Central","Col-a3","rounds",ring)
  do_reg_analysis(mean~log(n),"Locally Central","Col-a1","rounds",er)
  do_reg_analysis(mean~log(n),"Locally Central","Col-a2","rounds",er)
 do_reg_analysis(mean~log(n),"Locally Central","Col-a3","rounds",er)
        do_reg_analysis(mean~(n),"Locally Central","Col-a1","rounds",clique)
  do_reg_analysis(mean~log(n),"Locally Central","Col-a2","rounds",clique)
  do_reg_analysis(mean~log(n),"Locally Central","Col-a3","rounds",clique)
do_reg_analysis(mean~log(n),"Locally Central","Col-a1","steps",ring)
do_reg_analysis(mean~log(n),"Locally Central","Col-a2","steps",ring)
do_reg_analysis(mean~log(n),"Locally Central","Col-a3","steps",ring)
do_reg_analysis(mean~(n),"Locally Central","Col-a1","steps",er)
do_reg_analysis(mean~(n),"Locally Central","Col-a2","steps",er)
do_reg_analysis(mean~(n),"Locally Central","Col-a3","steps",er)
do_reg_analysis(mean~(n),"Locally Central","Col-a1","steps",clique)
do_reg_analysis(mean~(n),"Locally Central","Col-a2","steps",clique)
do_reg_analysis(mean~(n),"Locally Central","Col-a3","steps",clique)
do_reg_analysis(mean~(n),"Locally Central","Col-a1","moves",ring)
do_reg_analysis(mean~(n),"Locally Central","Col-a2","moves",ring)
do_reg_analysis(mean~(n),"Locally Central","Col-a3","moves",ring)
do_reg_analysis(mean~(n),"Locally Central","Col-a1","moves",er)
do_reg_analysis(mean~(n),"Locally Central","Col-a2","moves",er)
do_reg_analysis(mean~(n),"Locally Central","Col-a3","moves",er)
do_reg_analysis(mean~(n),"Locally Central","Col-a1","moves",clique)
do_reg_analysis(mean~(n),"Locally Central","Col-a2","moves",clique)
do_reg_analysis(mean~(n),"Locally Central","Col-a3","moves",clique)

Statistical Analysis of Data obtained by Simulation Campaigns

Analysis of the 81 curves of the coloring campaign

Table of Contents

Tutorial Categories

Tools