Measuring the performance of lustre-mt

This tutorial illustrates the use of:

  • multi-task capabilities of the lv6 compiler.
  • recursive nodes
  • %expand:call% and %task:*% pragmas

The lv6 compiler is able to take advantage of multi-core architectures. via the use of %task:* pragmas at call sites (as explained in lustre-mt).

In order to perform an experiment to demonstrate the benefit of this mechanism, let’s write a parametric Lustre program, whose parameters would allow to control easily the tasks number, and the amount of computations performed in each task.

To that end, follows a Lustre program that generates m (random) matrices of size n × n, and computes the sum of their determinants.

i=0m-1 det(Mi)

Transpose a matrix

To compute the determinant using Laplace formula, we need to be able to transpose a matrix (to get a sub-matrix where a line and a column are removed). It may not be the most efficient way to compute it, but the goal is to perform heavy computations in tasks anyway.

function transpose<<const N:int; const M:int; type t>>(x:t^M^N) returns (y:t^N^M);
%expand:call%
let
  y =
  -- with N = 0 then []  -- No empty array in Lustre! Sigh.
  with N = 1 then
-- since scal_to_array is expanded, this forces to use the --expand-iterators option
-- which is the only way to expand map (and other iterators)
     map<<scal_to_array<<t>>, M >>(x[0])
  else
     add_vect_to_matrix <<N-1,M,t>>(x[0], transpose<<N-1,M,t>>(x[1..N-1]));
tel
function scal_to_array<<type t>>(x:t) returns (res:t^1);
%expand:call%
let
  res = [x];
tel
function add_vect_to_matrix<<const N:int; const M:int; type t>>(x:t^M; y:t^N^M)
returns (res:t^(N+1)^M);
%expand:call%
let
  res =
  with M = 1 then [ [ x[0]] | y[0] ]
  else            [ [ x[0]] | y[0] ]
                  | add_vect_to_matrix<<N,M-1,t>> (x[1..M-1], y[1..M-1]);
tel
function test_transpose(x:int^4^2) returns (res:int^2^4);
%expand:call%
let
  res = transpose<<2,4,int>>(x);
tel

function transpose_is_involutive(x:bool^4^2) returns (res:bool);
var
  y: bool^2^4;
  z: bool^4^2;
let
  y = transpose<<2,4,bool>>(x);
  z = transpose<<4,2,bool>>(y);
  res = ( x = z );
tel
-- The following works
--   lv6 transpose.lus -n transpose_is_involutive -ec -o proove_me.ec &&
--   lesar proove_me.ec transpose_is_involutive
-- so I guess transpose is correct.

Determinant

Now we can define a node that computes the determinant of square matrices of size n using the Laplace (recursive) formula (of O(n!) complexity):

∀ i=1,n det(A)=∑j=1n ai,j (-1)i+jdet(Ai,j)

where the n-1 square matrix Ai,j is made of the matrix A where line i and column j have been removed. This formula can be encoded in Lustre V6 straightforwardly using recursive nodes (again).

-- Compute the determinant of a matrix
function determinant<<const N:int>>(m : int^N^N) returns (res : int);
let
  res = with N = 2 then m[0][0] * m[1][1] - m[0][1] * m[1][0]
                   else iter_line<<N,0>>(1,m);

tel
function iter_line<<const N:int; const J:int>>(sign:int; m:int^N^N) returns (s : int);
%expand:call%
let
  s = with J = N then 0
      else sign * m[0][J] * determinant<<N-1>>(sub_matrix<<0,J,N>> (m)) + iter_line<<N,J+1>>(-sign, m);

tel

-- returns m without its i_th line and j_th column
--   using transpose is not the most efficient way to do that,
--   but here we are interested in heavy computations :)
function sub_matrix<<const I:int; const J:int; const N:int>>(m:int^N^N)
returns (res : int^(N-1)^(N-1));
%expand:call%
var
  t1: int ^ N ^ (N-1);
  t2: int ^ (N-1) ^ N;
  t3: int ^ (N-1) ^ (N-1);
let
  t1 = with I=0   then m[1..N-1] else
       with I=N-1 then m[0..N-2] else
                       m[0..I-1] | m[I+1..N-1];
  t2 = transpose<<N-1,N,int>>(t1);
  t3 = with J=0   then t2[1..N-1] else
       with J=N-1 then t2[0..N-2] else
                       t2[0..J-1] | t2[J+1..N-1];
  res = transpose<<N-1,N-1,int>>(t3);
tel

function test_determinant(m:int^3^3) returns (res:int);
let
  res = determinant<<3>>(m);
tel

Random matrices

To generate random matrices, we need a Pseudo Random Numbers Generator:

-- PRNG
-- A linear congruential generator
-- https://en.wikipedia.org/wiki/Linear_congruential_generator
-- X(n+1) = (a X(n) + c) mod m
--zx81: m=2^16+1, a=75, c=74

function lcg<<const M: int; const A: int; const C: int>>(x_n:int) returns (x_np1:int);
let
  x_np1 =  (A * x_n + C) mod M;
tel
node zx81 = lcg<<65537, 75, 74>>

-- a variant, where the memory is handled internaly
node lcg2<<const M: int; const A: int; const C: int>>(init:int) returns (x:int);
let
  x = init -> (A * pre x + C) mod M;
tel

Hence now can we generate Random 3D matrices using the fill Lustre V6 iterator:

-- generate n x m x p matrices with random numbers
function random_3D<<const N:int; const M:int; const P:int>>(init : int) returns (x : int^N^M^P);
var accout : int;
let
  accout, x = fill<< fill<< fill<<zx81_acc; N>>; M >>; P>>(init);
tel
-- the fill node argument requires an extra output
function zx81_acc(accin : int) returns (accout, val : int);
%expand:call%
let
  accout = val;
  val = zx81(accin);
tel
function test_random_3D(i:int) returns (res:int^3^3^3);
let
  res = random_3D<<3,3,3>>(i);
tel

The program

Now we have everything in hand to write the main node of our experiment:

  • generate M random squares matrices of size N
  • compute the sum of their determinant, in M tasks:
-- Illustrate the use of --2c-multi-task
node multitask_gen<<
        const N:int; -- we work with matrices of size N x N
        const M: int -- controls the tasks number
>> (x:int) returns (res:int);
var
  big_3D_array : int ^ N ^ N ^ M;
let
  big_3D_array = random_3D<<N,N,M>>(x);
  res = sum_determinant<<M,N,M>>(big_3D_array);
tel

To compute the M determinants in M different tasks, and sum them, we can write yet another recursive node:

-- computes the sum of the determinants of t[i] for i=0 to n-1
-- Each determinant will be computed in its own task
node sum_determinant<<const M:int; const N:int; const I:int>>(t:int^N^N^M)
returns (res:int)
%expand:call%
var
  x : int;
let
  x = %MT:task% determinant_not_expanded<<N>>(t[I-1]);
  res = with I = 1 then x
                   else x + sum_determinant<<M,N,I-1>>(t) ;
tel
function determinant_not_expanded<<const N:int>>(m : int^N^N) returns (res : int);
let
  res = determinant<<N>>(m);
tel

Note the use of the %expand:call% pragma in most of the recursive nodes. Indeed, to be able to put the computation of the determinants in different tasks, we do not want the lv6 compiler to expand the call to the determinant node (by default, lv6 expands nodes with static arguments). We will therefore use the --do-not-expand-static-nodes option.

But if we don’t expand nodes that manipulate matrices via (statically) recursive node, such as transpose, the necessary memory would be huge for big values of n. This is actually the reason why such parametric nodes are expanded by default by the compiler. Hence the use of %expand:call% pragmas.

Notice that this pragma needs to be placed in the node definition. Iterators (such as map that is used in transpose) are predefined; but they can be expanded using the --expand-iterators option.

Executing multitask_gen on using several cores

Let’s define a few instances of this generic node:

node multitask = multitask_gen<<4,5>>
node small = multitask_gen<<3,3>>
node big = multitask_gen<<10,20>>

To execute for instance the small node:

lv6  multitask.lus -n small --expand-iterators --do-not-expand-static-nodes --to-c-execute

or, equivalently, using CLI option short versions:

lv6  multitask.lus -n small -ei -desn -2c-exec

In order to take advantage of task parallelism, we need to use the --2c-multi-task; or -2cmt for short:

lv6  multitask.lus -n small -ei -desn -2c-exec  -2cmt

An experimentation

Let’s run this program with n=10 and m = 1 to 49

node e1 = multitask_gen<<10,1>>
node e2 = multitask_gen<<10,2>>
node e3 = multitask_gen<<10,3>>
node e4 = multitask_gen<<10,4>>
node e5 = multitask_gen<<10,5>>
node e6 = multitask_gen<<10,6>>
node e7 = multitask_gen<<10,7>>
node e8 = multitask_gen<<10,8>>
node e9 = multitask_gen<<10,9>>

node e10 = multitask_gen<<10,10>>
node e11 = multitask_gen<<10,11>>
node e12 = multitask_gen<<10,12>>
node e13 = multitask_gen<<10,13>>
node e14 = multitask_gen<<10,14>>
node e15 = multitask_gen<<10,15>>
node e16 = multitask_gen<<10,16>>
node e17 = multitask_gen<<10,17>>
node e18 = multitask_gen<<10,18>>
node e19 = multitask_gen<<10,19>>

node e20 = multitask_gen<<10,20>>
node e21 = multitask_gen<<10,21>>
node e22 = multitask_gen<<10,22>>
node e23 = multitask_gen<<10,23>>
node e24 = multitask_gen<<10,24>>
node e25 = multitask_gen<<10,25>>
node e26 = multitask_gen<<10,26>>
node e27 = multitask_gen<<10,27>>
node e28 = multitask_gen<<10,28>>
node e29 = multitask_gen<<10,29>>

node e30 = multitask_gen<<10,30>>
node e31 = multitask_gen<<10,31>>
node e32 = multitask_gen<<10,32>>
node e33 = multitask_gen<<10,33>>
node e34 = multitask_gen<<10,34>>
node e35 = multitask_gen<<10,35>>
node e36 = multitask_gen<<10,36>>
node e37 = multitask_gen<<10,37>>
node e38 = multitask_gen<<10,38>>
node e39 = multitask_gen<<10,39>>

node e40 = multitask_gen<<10,40>>
node e41 = multitask_gen<<10,41>>
node e42 = multitask_gen<<10,42>>
node e43 = multitask_gen<<10,43>>
node e44 = multitask_gen<<10,44>>
node e45 = multitask_gen<<10,45>>
node e46 = multitask_gen<<10,46>>
node e47 = multitask_gen<<10,47>>
node e48 = multitask_gen<<10,48>>
node e49 = multitask_gen<<10,49>>

Here is a small script that

  • compile a node with and without tasks (i.e., with and without -2cmt)
  • provide (10) inputs (used as PRG seed) to the corresponding executables
  • measure the wall-clock execution times and store them in an org file.
export LC_ALL=C
#set -x -e
PROG=e$1
PROG_MT=e$1_mt
LOG=expe.log.org
ORG=expe.org

# run the program via lv6 -2c
echo "*** $PROG">> $LOG
echo $(uname -a) >> $LOG
d1=`date +%s%N`
# run 10 steps
# nb: the input is used as a seed for the PRGs
INPUTS="1 2 3 4 5 6 7 8 9 0"
TMP=$(echo "$INPUTS q" |./$PROG.exec)
step_nb=`echo $INPUTS | wc -w`
d2=`date +%s%N`
tdfs=$(((d2-d1)/${step_nb}))
tdfss=$((tdfs/1000000000))
tdfsms=$((tdfs/1000000%1000))
echo "=> the program ./$PROG.exec ran for $tdfss.$tdfsms s"
echo "=> the program ./$PROG.exec ran for $tdfss.$tdfsms s">> $LOG

# run the program via lv6 -2cmt
echo "*** $PROG_MT">> $LOG
echo $(uname -a) >> $LOG
d1=`date +%s%N`
TMP=$(echo "$INPUTS q" |./${PROG_MT}.exec)
d2=`date +%s%N`
tdfs_mt=$(((d2-d1)/${step_nb}))
tdfss_mt=$((tdfs_mt/1000000000))
tdfsms_mt=$((tdfs_mt/1000000%1000))
echo "=> the program ./${PROG_MT}.exec ran for $tdfss_mt.$tdfsms_mt s"
echo "=> the program ./${PROG_MT}.exec ran for $tdfss_mt.$tdfsms_mt s">> $LOG
echo "| $@ | $tdfss.$tdfsms | lv6 |">> $ORG
echo "| $@ | $tdfss_mt.$tdfsms_mt | \"lv6 -2cmt\" |">> $ORG
echo "$ORG has been modified"

Here is a Makefile to run this script on the 49 nodes:

TARGET=e1.run e2.run e3.run e4.run e5.run e6.run e7.run e8.run e9.run  \
      e10.run e11.run e12.run e13.run e14.run e15.run e16.run e17.run e18.run e19.run  \
      e20.run e21.run e22.run  e23.run  e24.run e25.run e26.run  e27.run e28.run e29.run \
      e30.run e31.run e32.run  e33.run  e34.run e35.run e36.run  e37.run e38.run e39.run \
      e40.run e41.run e42.run  e43.run  e44.run e45.run e46.run  e47.run e48.run e49.run

all: $(TARGET)

.PRECIOUS: %.exec
LV6=lv6 -2c -cc --expand-iterators --do-not-expand-static-nodes
%.exec:
	$(LV6) multitask.lus -n $* -o $*.exec -dir /tmp/multitask/$*/
	$(LV6) multitask.lus -n $* -o $*_mt.exec -2cmt  -dir /tmp/multitask/$*_mt/

e%.run:e%.exec
	./run-prog.sh $*

clean:
	rm *.exec *.c *.h *.yml *.sh

In order to launch the experiment, we can use a machine with several cores:

ssh malaval.veri # a 24-cores machine
make

In order to visualize the generated data, one can use this small R script:

library(dplyr)
library(ggplot2)

data <- data.frame(val=read.table("expe/expe.org"))
names(data) <- c("|1","n","|2","t","|3","compiler","|4")
data

plot <-
  ggplot(data, aes_string(x="n",y="t", shape="compiler", color = "compiler")) +
  geom_line(size=1)+ geom_point(size=3) +
  xlab("m")+ylab("Time (in sec)")+
  ggtitle("Execution time on 10x10 Random matrices, on a 24-cores machine")

theme_gray(base_size = 23)
pdf("expe/expe-mt.pdf")
print(plot)

that produces this graph:

Figure 1: Visualizing the experimentation results. The number of matrices is in abscissa, and the wall-clock execution time in ordinate.

Figure 1: Visualizing the experimentation results. The number of matrices is in abscissa, and the wall-clock execution time in ordinate.

Another experimentation: assessing the (single-task) performance of the Lustre program

I wanted to check the performances of this Lustre program compared with the equivalent in Ocaml and C (using a single core). It was not the purpose of this experiment, but the result is interesting.

Ocaml version

Here is an OCaml program that mimics the Lustre version as much as possible.

let seed = 0
let x_n = ref seed
let lcg m a c () = x_n := (a * !x_n + c) mod m; !x_n
let zx81 = lcg 65537 75 74;;
let small_numbers = lcg 127 7 1;;
let random_3D n m p =
  let i = Array.init in
  i n (fun _ -> i m (fun _ -> i p (fun _ -> (*zx81*) small_numbers ())))

let transpose : 'a array array -> 'a array array = fun m ->
  let sx = Array.length m.(0) in
  let sy = Array.length m in
  let res = Array.make_matrix sx sy m.(0).(0)in
  for i = 0 to sx-1 do
    for j = 0 to sy-1 do
      res.(i).(j) <- m.(j).(i)
    done;
  done;
  res

let del_line i m =
  let n = Array.length m in
  let res = Array.make_matrix (n-1) n 0 in
  for ii = 0 to n-1 do
    if ii = i then () else
      let ii' = if ii>i then ii-1 else ii in
      res.(ii') <- m.(ii);
  done;
  res
(* a dumb version using transpose to mimick the lustre program *)
let del_line_col i j m = del_line i m |> transpose |> del_line j |> transpose

(* Using Laplace formula on the first column *)
let rec determinant : int array array -> int = fun m ->
  let n = Array.length m in
  assert (n>0);
  assert (n = Array.length m.(0));
  if n = 1 then m.(0).(0) else
  let res = ref 0 in
  let sign = ref 1 in
  for i = 0 to n-1 do
    res := !res + !sign * m.(i).(0) * determinant (del_line_col i 0 m);
    sign := - !sign
  done;
  !res
let m1  = [|[|-2; -1; 2|]; [|2; 1; 4|]; [|-3; 3; -1|]|];;
let m2  = [|[|1;2;3|]; [|4;5;6|]; [|7;8;9|]|];;
let m3  = [|[|5;6|]; [|8;9|]|];;
let _ =
  assert(determinant m1 = 54);
  assert(determinant m2 = 0);
  assert(determinant m3 = -3)

let pm = (* print 3D matrices *)
  let i = Array.iter in
  let nl = print_newline in
  i (fun row -> i (fun col -> i (fun v -> Printf.printf "%d " v) col; nl ()) row; nl ())

let sum_determinant m =
  let n = Array.length m in
  let res = ref 0 in
  for i = 0 to n-1 do
    res := !res + determinant m.(i)
    done;
    !res
let main n m =
  let m = random_3D m n n in
  pm m;
  Printf.printf "The sum of determinants is %d\n" (sum_determinant m);;

main (int_of_string Sys.argv.(1)) (int_of_string Sys.argv.(2))

C version

Well, I have to admit that the C version below was AI-generated from the Ocaml one (shame on me). It first had it wrong tough :)

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

// Pseudo-random number generator (LCG)
int x_n = 0; // Initial seed
int lcg(int m, int a, int c) {
    x_n = (a * x_n + c) % m;
    return x_n;
}

// Generator for small random numbers
int small_numbers() {
    return lcg(127, 7, 1);
}

// Generate a 3D matrix of size x, y, z
int*** random_3D(int x, int y, int z) {
    int*** arr = malloc(x * sizeof(int**));
    for (int i = 0; i < x; i++) {
        arr[i] = malloc(y * sizeof(int*));
        for (int j = 0; j < y; j++) {
            arr[i][j] = malloc(z * sizeof(int));
            for (int k = 0; k < z; k++) {
                arr[i][j][k] = small_numbers();
            }
        }
    }
    return arr;
}

// Transpose a matrix of size n x m_len
int** transpose(int** m, int n, int m_len) {
    int** res = malloc(m_len * sizeof(int*));  // m_len rows, n columns
    for (int i = 0; i < m_len; i++) {
        res[i] = malloc(n * sizeof(int));  // n rows
        for (int j = 0; j < n; j++) {
            res[i][j] = m[j][i];  // Swap rows and columns
        }
    }
    return res;
}

// Delete a row from a matrix of size n x m_len
int** del_line(int i, int** m, int n, int m_len) {
    // New matrix will have (n-1) rows and m_len columns
    int** res = malloc((n - 1) * sizeof(int*));
    for (int ii = 0; ii < n - 1; ii++) {
        res[ii] = malloc(m_len * sizeof(int));  // m_len columns
        if (ii < i) {
            // Copy the row ii as it is
            for (int jj = 0; jj < m_len; jj++) {
                res[ii][jj] = m[ii][jj];
            }
        } else {
            // Copy the row ii+1 (shift rows)
            for (int jj = 0; jj < m_len; jj++) {
                res[ii][jj] = m[ii + 1][jj];
            }
        }
    }
    return res;
}

// Delete a row and a column from a matrix of size n x m_len
int** del_line_col(int i, int j, int** m, int n, int m_len) {
    // 1. Delete the row i from the matrix
    int** temp = del_line(i, m, n, m_len);

    // 2. Transpose the resulting matrix of size (n-1) x m_len
    int** temp_transposed = transpose(temp, n - 1, m_len);

    // 3. Delete the column j in the transposed matrix
    int** result = del_line(j, temp_transposed, m_len, n - 1);

    // Free memory for temporary matrices
    for (int ii = 0; ii < n - 1; ii++) {
        free(temp[ii]);
        free(temp_transposed[ii]);
    }
    free(temp);
    free(temp_transposed);

    return result;
}

// Calculate the determinant using Laplace's expansion for square matrices (n x n)
int determinant(int** m, int n) {
    if (n == 1) {
        return m[0][0];  // Base case for 1x1 matrix
    }

    int res = 0;
    int sign = 1;
    for (int i = 0; i < n; i++) {
        int** submatrix = del_line_col(i, 0, m, n, n);  // Remove row i and column 0
        res += sign * m[i][0] * determinant(submatrix, n - 1);  // Recursive calculation
        sign = -sign;  // Alternate sign for cofactor expansion
        for (int ii = 0; ii < n - 1; ii++) {
            free(submatrix[ii]);  // Free submatrix memory
        }
        free(submatrix);
    }
    return res;
}

// Function to print a matrix of size n x m_len
void print_matrix(int** matrix, int n, int m_len) {
    for (int i = 0; i < n; i++) {
        for (int j = 0; j < m_len; j++) {
            printf("%d ", matrix[i][j]);
        }
        printf("\n");
    }
}

// Function to calculate the sum of determinants for a 3D matrix
int sum_determinant(int*** m, int x, int y, int z) {
    int res = 0;
    for (int i = 0; i < x; i++) {
        res += determinant(m[i], y);  // Calculate determinant for each 2D slice
    }
    return res;
}

// Main function
int main(int argc, char* argv[]) {
    if (argc < 3) {
        printf("Usage: %s <x> <y>\n", argv[0]);  // Ensure correct input
        return 1;
    }

    int x = atoi(argv[1]);
    int y = atoi(argv[2]);

    // Generate a 3D matrix
    int*** matrix = random_3D(x, y, y);

    // Print the generated matrix
    print_matrix(matrix[0], y, y);

    // Calculate the sum of determinants
    int sum = sum_determinant(matrix, x, y, y);
    printf("The sum of determinants is %d\n", sum);

    // Free the allocated memory for the 3D matrix
    for (int i = 0; i < x; i++) {
        for (int j = 0; j < y; j++) {
            free(matrix[i][j]);
        }
        free(matrix[i]);
    }
    free(matrix);

    return 0;
}

Result

Let’s compile and run all those programs (using 10 matrices of size 10):

cd files/lustre-mt-expe/

ocamlopt multitask.ml -o multitask-ocaml
time ./multitask-ocaml 10 10

lv6 -2c multitask.lus -n e10 --expand-iterators --do-not-expand-static-nodes --remove-nested-calls
export C_COMPILER="gcc -O2"; sh e10.sh ; mv e10.exec e10-O2.exec
export C_COMPILER="gcc"    ; sh e10.sh

echo " 0 q" | time ./e10.exec
echo " 0 q" | time ./e10-O2.exec

gcc multitask.c -o multitask-c
gcc -O2 multitask.c -o multitask-c-O2
time ./multitask-c 10 10
time ./multitask-c-O2 10 10

\ls -lhS multitask-ocaml multitask-c-O2 multitask-c e10.exec e10-O2.exec

Here are the timing results:

langage compilation time
OCaml ocamlopt 15,1 s
C gcc 12 s
C gcc -O2 10 s
Lustre V6 lv6 + gcc 6,5 s
Lustre V6 lv6 + gcc -O2 1,2 s

The Lustre program is faster because array accesses are resolved. But of course, it should possible to specialize the C program and inline arrays to get similar (or better) performances. Moreover, on the downside, the resulting Lustre executable is bigger and its size augments with the parameters size (matrix size and number of matrices):

$ ls -lhS multitask-ocaml multitask-c-O2 multitask-c  e1-O2.exec e1.exec
-rwxrwxr-x 1 jahier jahier 1,8M déc.  10 13:55 multitask-ocaml
-rwxrwxr-x 1 jahier jahier 769K déc.  10 15:18 e10.exec
-rwxrwxr-x 1 jahier jahier 553K déc.  10 15:14 e10-O2.exec
-rwxrwxr-x 1 jahier jahier  17K déc.  10 15:13 multitask-c-O2
-rwxrwxr-x 1 jahier jahier  17K déc.  10 14:56 multitask-c