What does elaps’d do for me?
Suppose the following C++ code:
1 #include <omp.h>
2 #define N 1024
3
4 int main(void) {
5
6 double x[N], y[N], z[N];
7 int i;
8
9 // Do initializations
10 initialize(x, y, z, N);
11
12 #pragma omp parallel for
13 for (i = 0; i < N; i++) {
14 z[i] = x[i] + y[i];
15 }
16
17 printResult(z);
18
19 return 0;
20 }
Alright, this works and adds nicely a vector of size N
. So far, so good. But performance is everything, right? For instance assume, you want to know how long the loop runs and possibly also how long each thread computes. These are your possibilities:
- Use some profiling tool like gprof or perf. This is going to work, no question but the information you get might be too much and you’ll have to dig into its output ask yourself who calls whom and when and so on.
- Use a function for timing, e.g.
omp_get_wtime
, which you have to call twice and then handle its return value. This is quite easy, but if you also want to get detailed thread informations you might be alone with placing calls and merging outputs and so on. - Use something like Intel’s VTune. Well, this works (if you have a license for it), but it might be an overkill.
Another point is, that you might want some graphical analysis, which you won’t get with gprof or omp_get_wtime
.
You: Isn’t there an easy-to-use tool for this purpose?
Yes, there is. And this tool is elaps’d. To make a measurement, we slightly extend the code:
1 #include <omp.h>
2 #define N 1024
3
4 #include <elapsd/elapsd.h> //< Include this
5
6 int main(void) {
7
8 double x[N], y[N], z[N];
9 int i;
10
11 // Do initializations
12 initialize(x, y, z, N);
13
14 /* Initialize elaps'd */
15 ENHANCE::elapsd e("vector.db", "Vector Experiment");
16 e.addKernel(0, "Vector (Multithreaded)");
17 e.addDevice(0, "CPU");
18 /* fin */
19
20 #pragma omp parallel for
21 for (i = 0; i < N; i++) {
22 e.startTimer(0,0); //< Magic happens here
23 z[i] = x[i] + y[i];
24 e.stopTimer(0,0); //< And here
25 }
26
27 printResult(z);
28
29 e.commitToDB(); //< Store for analysis
30
31 return 0;
32 }
That’s all, now compile, link against libelapsd.so
and run it. After the run you get a SQLite3 database vector.db
which you can analyze. Analysis looks like this:
Cool, isn’t it? elaps’d automatically recognizes threads when its inside a multi-threaded environment (e.g. in the for
-loop). Furthermore, the seperation between kernels and devices lets you clearly define which function (kernel) runs on which device. Thus, also measuring functions on accelerators, e.g. GPGPU, becomes possible – but currently only on a coarse level, i.e. you won’t see GPGPU threads in the analysis.
In addition to that, multiple experiments are supported. That means, you are able to compare the same program with different arguments. To recognize this during analysis, elaps’d enables you to arbitrarily choose names for experiments, kernels and devices.