Measuring the performance of lustre-mt
This tutorial illustrates the use of:
- multi-task capabilities of the
lv6compiler. - recursive nodes
%expand:call%and%task:*%pragmas
The lv6 compiler is able to take advantage of multi-core
architectures. via the use of %task:* pragmas at call sites (as
explained in lustre-mt).
In order to perform an experiment to demonstrate the benefit of this mechanism, let’s write a parametric Lustre program, whose parameters would allow to control easily the tasks number, and the amount of computations performed in each task.
To that end, follows a Lustre program that generates m (random)
matrices of size n × n, and computes the sum of their
determinants.
∑i=0m-1 det(Mi)
Transpose a matrix
To compute the determinant using Laplace formula, we need to be able to transpose a matrix (to get a sub-matrix where a line and a column are removed). It may not be the most efficient way to compute it, but the goal is to perform heavy computations in tasks anyway.
function transpose<<const N:int; const M:int; type t>>(x:t^M^N) returns (y:t^N^M);
%expand:call%
let
y =
-- with N = 0 then [] -- No empty array in Lustre! Sigh.
with N = 1 then
-- since scal_to_array is expanded, this forces to use the --expand-iterators option
-- which is the only way to expand map (and other iterators)
map<<scal_to_array<<t>>, M >>(x[0])
else
add_vect_to_matrix <<N-1,M,t>>(x[0], transpose<<N-1,M,t>>(x[1..N-1]));
tel
function scal_to_array<<type t>>(x:t) returns (res:t^1);
%expand:call%
let
res = [x];
tel
function add_vect_to_matrix<<const N:int; const M:int; type t>>(x:t^M; y:t^N^M)
returns (res:t^(N+1)^M);
%expand:call%
let
res =
with M = 1 then [ [ x[0]] | y[0] ]
else [ [ x[0]] | y[0] ]
| add_vect_to_matrix<<N,M-1,t>> (x[1..M-1], y[1..M-1]);
tel
function test_transpose(x:int^4^2) returns (res:int^2^4);
%expand:call%
let
res = transpose<<2,4,int>>(x);
tel
function transpose_is_involutive(x:bool^4^2) returns (res:bool);
var
y: bool^2^4;
z: bool^4^2;
let
y = transpose<<2,4,bool>>(x);
z = transpose<<4,2,bool>>(y);
res = ( x = z );
tel
-- The following works
-- lv6 transpose.lus -n transpose_is_involutive -ec -o proove_me.ec &&
-- lesar proove_me.ec transpose_is_involutive
-- so I guess transpose is correct.
Determinant
Now we can define a node that computes the determinant of square matrices of size n using the Laplace (recursive) formula (of O(n!) complexity):
∀ i=1,n det(A)=∑j=1n ai,j (-1)i+jdet(Ai,j)
where the n-1 square matrix Ai,j is made of the matrix A where line i and column j have been removed. This formula can be encoded in Lustre V6 straightforwardly using recursive nodes (again).
-- Compute the determinant of a matrix
function determinant<<const N:int>>(m : int^N^N) returns (res : int);
let
res = with N = 2 then m[0][0] * m[1][1] - m[0][1] * m[1][0]
else iter_line<<N,0>>(1,m);
tel
function iter_line<<const N:int; const J:int>>(sign:int; m:int^N^N) returns (s : int);
%expand:call%
let
s = with J = N then 0
else sign * m[0][J] * determinant<<N-1>>(sub_matrix<<0,J,N>> (m)) + iter_line<<N,J+1>>(-sign, m);
tel
-- returns m without its i_th line and j_th column
-- using transpose is not the most efficient way to do that,
-- but here we are interested in heavy computations :)
function sub_matrix<<const I:int; const J:int; const N:int>>(m:int^N^N)
returns (res : int^(N-1)^(N-1));
%expand:call%
var
t1: int ^ N ^ (N-1);
t2: int ^ (N-1) ^ N;
t3: int ^ (N-1) ^ (N-1);
let
t1 = with I=0 then m[1..N-1] else
with I=N-1 then m[0..N-2] else
m[0..I-1] | m[I+1..N-1];
t2 = transpose<<N-1,N,int>>(t1);
t3 = with J=0 then t2[1..N-1] else
with J=N-1 then t2[0..N-2] else
t2[0..J-1] | t2[J+1..N-1];
res = transpose<<N-1,N-1,int>>(t3);
tel
function test_determinant(m:int^3^3) returns (res:int);
let
res = determinant<<3>>(m);
tel
Random matrices
To generate random matrices, we need a Pseudo Random Numbers Generator:
-- PRNG
-- A linear congruential generator
-- https://en.wikipedia.org/wiki/Linear_congruential_generator
-- X(n+1) = (a X(n) + c) mod m
--zx81: m=2^16+1, a=75, c=74
function lcg<<const M: int; const A: int; const C: int>>(x_n:int) returns (x_np1:int);
let
x_np1 = (A * x_n + C) mod M;
tel
node zx81 = lcg<<65537, 75, 74>>
-- a variant, where the memory is handled internaly
node lcg2<<const M: int; const A: int; const C: int>>(init:int) returns (x:int);
let
x = init -> (A * pre x + C) mod M;
tel
Hence now can we generate Random 3D matrices using the fill Lustre V6 iterator:
-- generate n x m x p matrices with random numbers
function random_3D<<const N:int; const M:int; const P:int>>(init : int) returns (x : int^N^M^P);
var accout : int;
let
accout, x = fill<< fill<< fill<<zx81_acc; N>>; M >>; P>>(init);
tel
-- the fill node argument requires an extra output
function zx81_acc(accin : int) returns (accout, val : int);
%expand:call%
let
accout = val;
val = zx81(accin);
tel
function test_random_3D(i:int) returns (res:int^3^3^3);
let
res = random_3D<<3,3,3>>(i);
tel
The program
Now we have everything in hand to write the main node of our experiment:
- generate
Mrandom squares matrices of sizeN - compute the sum of their determinant, in
Mtasks:
-- Illustrate the use of --2c-multi-task
node multitask_gen<<
const N:int; -- we work with matrices of size N x N
const M: int -- controls the tasks number
>> (x:int) returns (res:int);
var
big_3D_array : int ^ N ^ N ^ M;
let
big_3D_array = random_3D<<N,N,M>>(x);
res = sum_determinant<<M,N,M>>(big_3D_array);
tel
To compute the M determinants in M different tasks, and sum
them, we can write yet another recursive node:
-- computes the sum of the determinants of t[i] for i=0 to n-1
-- Each determinant will be computed in its own task
node sum_determinant<<const M:int; const N:int; const I:int>>(t:int^N^N^M)
returns (res:int)
%expand:call%
var
x : int;
let
x = %MT:task% determinant_not_expanded<<N>>(t[I-1]);
res = with I = 1 then x
else x + sum_determinant<<M,N,I-1>>(t) ;
tel
function determinant_not_expanded<<const N:int>>(m : int^N^N) returns (res : int);
let
res = determinant<<N>>(m);
tel
Note the use of the %expand:call% pragma in most of the recursive
nodes. Indeed, to be able to put the computation of the determinants
in different tasks, we do not want the lv6 compiler to expand the
call to the determinant node (by default, lv6 expands nodes with
static arguments). We will therefore use the
--do-not-expand-static-nodes option.
But if we don’t expand nodes that manipulate matrices via
(statically) recursive node, such as transpose, the necessary
memory would be huge for big values of n. This is actually the
reason why such parametric nodes are expanded by default by the
compiler. Hence the use of %expand:call% pragmas.
Notice that this pragma needs to be placed in the node
definition. Iterators (such as map that is used in transpose)
are predefined; but they can be expanded using the
--expand-iterators option.
Executing multitask_gen on using several cores
Let’s define a few instances of this generic node:
node multitask = multitask_gen<<4,5>>
node small = multitask_gen<<3,3>>
node big = multitask_gen<<10,20>>
To execute for instance the small node:
lv6 multitask.lus -n small --expand-iterators --do-not-expand-static-nodes --to-c-execute
or, equivalently, using CLI option short versions:
lv6 multitask.lus -n small -ei -desn -2c-exec
In order to take advantage of task parallelism, we need to use the
--2c-multi-task; or -2cmt for short:
lv6 multitask.lus -n small -ei -desn -2c-exec -2cmt
An experimentation
Let’s run this program with n=10 and m = 1 to 49
node e1 = multitask_gen<<10,1>>
node e2 = multitask_gen<<10,2>>
node e3 = multitask_gen<<10,3>>
node e4 = multitask_gen<<10,4>>
node e5 = multitask_gen<<10,5>>
node e6 = multitask_gen<<10,6>>
node e7 = multitask_gen<<10,7>>
node e8 = multitask_gen<<10,8>>
node e9 = multitask_gen<<10,9>>
node e10 = multitask_gen<<10,10>>
node e11 = multitask_gen<<10,11>>
node e12 = multitask_gen<<10,12>>
node e13 = multitask_gen<<10,13>>
node e14 = multitask_gen<<10,14>>
node e15 = multitask_gen<<10,15>>
node e16 = multitask_gen<<10,16>>
node e17 = multitask_gen<<10,17>>
node e18 = multitask_gen<<10,18>>
node e19 = multitask_gen<<10,19>>
node e20 = multitask_gen<<10,20>>
node e21 = multitask_gen<<10,21>>
node e22 = multitask_gen<<10,22>>
node e23 = multitask_gen<<10,23>>
node e24 = multitask_gen<<10,24>>
node e25 = multitask_gen<<10,25>>
node e26 = multitask_gen<<10,26>>
node e27 = multitask_gen<<10,27>>
node e28 = multitask_gen<<10,28>>
node e29 = multitask_gen<<10,29>>
node e30 = multitask_gen<<10,30>>
node e31 = multitask_gen<<10,31>>
node e32 = multitask_gen<<10,32>>
node e33 = multitask_gen<<10,33>>
node e34 = multitask_gen<<10,34>>
node e35 = multitask_gen<<10,35>>
node e36 = multitask_gen<<10,36>>
node e37 = multitask_gen<<10,37>>
node e38 = multitask_gen<<10,38>>
node e39 = multitask_gen<<10,39>>
node e40 = multitask_gen<<10,40>>
node e41 = multitask_gen<<10,41>>
node e42 = multitask_gen<<10,42>>
node e43 = multitask_gen<<10,43>>
node e44 = multitask_gen<<10,44>>
node e45 = multitask_gen<<10,45>>
node e46 = multitask_gen<<10,46>>
node e47 = multitask_gen<<10,47>>
node e48 = multitask_gen<<10,48>>
node e49 = multitask_gen<<10,49>>
Here is a small script that
- compile a node with and without tasks (i.e., with and without
-2cmt) - provide (10) inputs (used as PRG seed) to the corresponding executables
- measure the wall-clock execution times and store them in an org file.
export LC_ALL=C
#set -x -e
PROG=e$1
PROG_MT=e$1_mt
LOG=expe.log.org
ORG=expe.org
# run the program via lv6 -2c
echo "*** $PROG">> $LOG
echo $(uname -a) >> $LOG
d1=`date +%s%N`
# run 10 steps
# nb: the input is used as a seed for the PRGs
INPUTS="1 2 3 4 5 6 7 8 9 0"
TMP=$(echo "$INPUTS q" |./$PROG.exec)
step_nb=`echo $INPUTS | wc -w`
d2=`date +%s%N`
tdfs=$(((d2-d1)/${step_nb}))
tdfss=$((tdfs/1000000000))
tdfsms=$((tdfs/1000000%1000))
echo "=> the program ./$PROG.exec ran for $tdfss.$tdfsms s"
echo "=> the program ./$PROG.exec ran for $tdfss.$tdfsms s">> $LOG
# run the program via lv6 -2cmt
echo "*** $PROG_MT">> $LOG
echo $(uname -a) >> $LOG
d1=`date +%s%N`
TMP=$(echo "$INPUTS q" |./${PROG_MT}.exec)
d2=`date +%s%N`
tdfs_mt=$(((d2-d1)/${step_nb}))
tdfss_mt=$((tdfs_mt/1000000000))
tdfsms_mt=$((tdfs_mt/1000000%1000))
echo "=> the program ./${PROG_MT}.exec ran for $tdfss_mt.$tdfsms_mt s"
echo "=> the program ./${PROG_MT}.exec ran for $tdfss_mt.$tdfsms_mt s">> $LOG
echo "| $@ | $tdfss.$tdfsms | lv6 |">> $ORG
echo "| $@ | $tdfss_mt.$tdfsms_mt | \"lv6 -2cmt\" |">> $ORG
echo "$ORG has been modified"
Here is a Makefile to run this script on the 49 nodes:
TARGET=e1.run e2.run e3.run e4.run e5.run e6.run e7.run e8.run e9.run \
e10.run e11.run e12.run e13.run e14.run e15.run e16.run e17.run e18.run e19.run \
e20.run e21.run e22.run e23.run e24.run e25.run e26.run e27.run e28.run e29.run \
e30.run e31.run e32.run e33.run e34.run e35.run e36.run e37.run e38.run e39.run \
e40.run e41.run e42.run e43.run e44.run e45.run e46.run e47.run e48.run e49.run
all: $(TARGET)
.PRECIOUS: %.exec
LV6=lv6 -2c -cc --expand-iterators --do-not-expand-static-nodes
%.exec:
$(LV6) multitask.lus -n $* -o $*.exec -dir /tmp/multitask/$*/
$(LV6) multitask.lus -n $* -o $*_mt.exec -2cmt -dir /tmp/multitask/$*_mt/
e%.run:e%.exec
./run-prog.sh $*
clean:
rm *.exec *.c *.h *.yml *.sh
In order to launch the experiment, we can use a machine with several cores:
ssh malaval.veri # a 24-cores machine
make
In order to visualize the generated data, one can use this small R script:
library(dplyr)
library(ggplot2)
data <- data.frame(val=read.table("expe/expe.org"))
names(data) <- c("|1","n","|2","t","|3","compiler","|4")
data
plot <-
ggplot(data, aes_string(x="n",y="t", shape="compiler", color = "compiler")) +
geom_line(size=1)+ geom_point(size=3) +
xlab("m")+ylab("Time (in sec)")+
ggtitle("Execution time on 10x10 Random matrices, on a 24-cores machine")
theme_gray(base_size = 23)
pdf("expe/expe-mt.pdf")
print(plot)
that produces this graph:
Figure 1: Visualizing the experimentation results. The number of matrices is in abscissa, and the wall-clock execution time in ordinate.
Another experimentation: assessing the (single-task) performance of the Lustre program
I wanted to check the performances of this Lustre program compared with the equivalent in Ocaml and C (using a single core). It was not the purpose of this experiment, but the result is interesting.
Ocaml version
Here is an OCaml program that mimics the Lustre version as much as possible.
let seed = 0
let x_n = ref seed
let lcg m a c () = x_n := (a * !x_n + c) mod m; !x_n
let zx81 = lcg 65537 75 74;;
let small_numbers = lcg 127 7 1;;
let random_3D n m p =
let i = Array.init in
i n (fun _ -> i m (fun _ -> i p (fun _ -> (*zx81*) small_numbers ())))
let transpose : 'a array array -> 'a array array = fun m ->
let sx = Array.length m.(0) in
let sy = Array.length m in
let res = Array.make_matrix sx sy m.(0).(0)in
for i = 0 to sx-1 do
for j = 0 to sy-1 do
res.(i).(j) <- m.(j).(i)
done;
done;
res
let del_line i m =
let n = Array.length m in
let res = Array.make_matrix (n-1) n 0 in
for ii = 0 to n-1 do
if ii = i then () else
let ii' = if ii>i then ii-1 else ii in
res.(ii') <- m.(ii);
done;
res
(* a dumb version using transpose to mimick the lustre program *)
let del_line_col i j m = del_line i m |> transpose |> del_line j |> transpose
(* Using Laplace formula on the first column *)
let rec determinant : int array array -> int = fun m ->
let n = Array.length m in
assert (n>0);
assert (n = Array.length m.(0));
if n = 1 then m.(0).(0) else
let res = ref 0 in
let sign = ref 1 in
for i = 0 to n-1 do
res := !res + !sign * m.(i).(0) * determinant (del_line_col i 0 m);
sign := - !sign
done;
!res
let m1 = [|[|-2; -1; 2|]; [|2; 1; 4|]; [|-3; 3; -1|]|];;
let m2 = [|[|1;2;3|]; [|4;5;6|]; [|7;8;9|]|];;
let m3 = [|[|5;6|]; [|8;9|]|];;
let _ =
assert(determinant m1 = 54);
assert(determinant m2 = 0);
assert(determinant m3 = -3)
let pm = (* print 3D matrices *)
let i = Array.iter in
let nl = print_newline in
i (fun row -> i (fun col -> i (fun v -> Printf.printf "%d " v) col; nl ()) row; nl ())
let sum_determinant m =
let n = Array.length m in
let res = ref 0 in
for i = 0 to n-1 do
res := !res + determinant m.(i)
done;
!res
let main n m =
let m = random_3D m n n in
pm m;
Printf.printf "The sum of determinants is %d\n" (sum_determinant m);;
main (int_of_string Sys.argv.(1)) (int_of_string Sys.argv.(2))
C version
Well, I have to admit that the C version below was AI-generated from the Ocaml one (shame on me). It first had it wrong tough :)
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
// Pseudo-random number generator (LCG)
int x_n = 0; // Initial seed
int lcg(int m, int a, int c) {
x_n = (a * x_n + c) % m;
return x_n;
}
// Generator for small random numbers
int small_numbers() {
return lcg(127, 7, 1);
}
// Generate a 3D matrix of size x, y, z
int*** random_3D(int x, int y, int z) {
int*** arr = malloc(x * sizeof(int**));
for (int i = 0; i < x; i++) {
arr[i] = malloc(y * sizeof(int*));
for (int j = 0; j < y; j++) {
arr[i][j] = malloc(z * sizeof(int));
for (int k = 0; k < z; k++) {
arr[i][j][k] = small_numbers();
}
}
}
return arr;
}
// Transpose a matrix of size n x m_len
int** transpose(int** m, int n, int m_len) {
int** res = malloc(m_len * sizeof(int*)); // m_len rows, n columns
for (int i = 0; i < m_len; i++) {
res[i] = malloc(n * sizeof(int)); // n rows
for (int j = 0; j < n; j++) {
res[i][j] = m[j][i]; // Swap rows and columns
}
}
return res;
}
// Delete a row from a matrix of size n x m_len
int** del_line(int i, int** m, int n, int m_len) {
// New matrix will have (n-1) rows and m_len columns
int** res = malloc((n - 1) * sizeof(int*));
for (int ii = 0; ii < n - 1; ii++) {
res[ii] = malloc(m_len * sizeof(int)); // m_len columns
if (ii < i) {
// Copy the row ii as it is
for (int jj = 0; jj < m_len; jj++) {
res[ii][jj] = m[ii][jj];
}
} else {
// Copy the row ii+1 (shift rows)
for (int jj = 0; jj < m_len; jj++) {
res[ii][jj] = m[ii + 1][jj];
}
}
}
return res;
}
// Delete a row and a column from a matrix of size n x m_len
int** del_line_col(int i, int j, int** m, int n, int m_len) {
// 1. Delete the row i from the matrix
int** temp = del_line(i, m, n, m_len);
// 2. Transpose the resulting matrix of size (n-1) x m_len
int** temp_transposed = transpose(temp, n - 1, m_len);
// 3. Delete the column j in the transposed matrix
int** result = del_line(j, temp_transposed, m_len, n - 1);
// Free memory for temporary matrices
for (int ii = 0; ii < n - 1; ii++) {
free(temp[ii]);
free(temp_transposed[ii]);
}
free(temp);
free(temp_transposed);
return result;
}
// Calculate the determinant using Laplace's expansion for square matrices (n x n)
int determinant(int** m, int n) {
if (n == 1) {
return m[0][0]; // Base case for 1x1 matrix
}
int res = 0;
int sign = 1;
for (int i = 0; i < n; i++) {
int** submatrix = del_line_col(i, 0, m, n, n); // Remove row i and column 0
res += sign * m[i][0] * determinant(submatrix, n - 1); // Recursive calculation
sign = -sign; // Alternate sign for cofactor expansion
for (int ii = 0; ii < n - 1; ii++) {
free(submatrix[ii]); // Free submatrix memory
}
free(submatrix);
}
return res;
}
// Function to print a matrix of size n x m_len
void print_matrix(int** matrix, int n, int m_len) {
for (int i = 0; i < n; i++) {
for (int j = 0; j < m_len; j++) {
printf("%d ", matrix[i][j]);
}
printf("\n");
}
}
// Function to calculate the sum of determinants for a 3D matrix
int sum_determinant(int*** m, int x, int y, int z) {
int res = 0;
for (int i = 0; i < x; i++) {
res += determinant(m[i], y); // Calculate determinant for each 2D slice
}
return res;
}
// Main function
int main(int argc, char* argv[]) {
if (argc < 3) {
printf("Usage: %s <x> <y>\n", argv[0]); // Ensure correct input
return 1;
}
int x = atoi(argv[1]);
int y = atoi(argv[2]);
// Generate a 3D matrix
int*** matrix = random_3D(x, y, y);
// Print the generated matrix
print_matrix(matrix[0], y, y);
// Calculate the sum of determinants
int sum = sum_determinant(matrix, x, y, y);
printf("The sum of determinants is %d\n", sum);
// Free the allocated memory for the 3D matrix
for (int i = 0; i < x; i++) {
for (int j = 0; j < y; j++) {
free(matrix[i][j]);
}
free(matrix[i]);
}
free(matrix);
return 0;
}
Result
Let’s compile and run all those programs (using 10 matrices of size 10):
cd files/lustre-mt-expe/
ocamlopt multitask.ml -o multitask-ocaml
time ./multitask-ocaml 10 10
lv6 -2c multitask.lus -n e10 --expand-iterators --do-not-expand-static-nodes --remove-nested-calls
export C_COMPILER="gcc -O2"; sh e10.sh ; mv e10.exec e10-O2.exec
export C_COMPILER="gcc" ; sh e10.sh
echo " 0 q" | time ./e10.exec
echo " 0 q" | time ./e10-O2.exec
gcc multitask.c -o multitask-c
gcc -O2 multitask.c -o multitask-c-O2
time ./multitask-c 10 10
time ./multitask-c-O2 10 10
\ls -lhS multitask-ocaml multitask-c-O2 multitask-c e10.exec e10-O2.exec
Here are the timing results:
| langage | compilation | time |
|---|---|---|
| OCaml | ocamlopt | 15,1 s |
| C | gcc | 12 s |
| C | gcc -O2 | 10 s |
| Lustre V6 | lv6 + gcc | 6,5 s |
| Lustre V6 | lv6 + gcc -O2 | 1,2 s |
The Lustre program is faster because array accesses are resolved. But of course, it should possible to specialize the C program and inline arrays to get similar (or better) performances. Moreover, on the downside, the resulting Lustre executable is bigger and its size augments with the parameters size (matrix size and number of matrices):
$ ls -lhS multitask-ocaml multitask-c-O2 multitask-c e1-O2.exec e1.exec
-rwxrwxr-x 1 jahier jahier 1,8M déc. 10 13:55 multitask-ocaml
-rwxrwxr-x 1 jahier jahier 769K déc. 10 15:18 e10.exec
-rwxrwxr-x 1 jahier jahier 553K déc. 10 15:14 e10-O2.exec
-rwxrwxr-x 1 jahier jahier 17K déc. 10 15:13 multitask-c-O2
-rwxrwxr-x 1 jahier jahier 17K déc. 10 14:56 multitask-c