It's UWAweek 47

help5507

This forum is provided to promote discussion amongst students enrolled in CITS5507 High Performance Computing.

Please consider offering answers and suggestions to help other students! And if you fix a problem by following a suggestion here, it would be great if other interested students could see a short "Great, fixed it!"  followup message.

How do I ask a good question?
Displaying the 11 articles in this topic
Showing 11 of 148 articles.
Currently 13 other people reading this forum.


 UWA week 36 (2nd semester, mid-semester break) ↓
SVG not supported

Login to reply

👍?
helpful
11:10am Tue 3rd Sep, ANONYMOUS

Problem Statement

When I run the command on my local machine, I observe a noticeable performance boost as the number of threads increases.

However, on Setonix, regardless of whether I specify 1 or 128 threads, the performance remains consistently slow and unchanged. I checked the thread count in the job output, and it does appear to spawn the correct number of threads, but the overall performance stays the same.

Local output

% ./openmp_basics 1 100000 10000 10 
Using 1 threads..

Time taken: 9.465 seconds.
% ./openmp_basics 8 100000 10000 10 
Using 8 threads..

Time taken: 2.030 seconds.

Setonix output

tail slurm-15112799.out
Using 128 threads..

Time taken: 12.112 seconds.

Here is my batch file:

#!/bin/bash

#SBATCH --account=courses0101
#SBATCH --partition=work
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=128
#SBATCH --mem=200G
#SBATCH --time=00:30:00


export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

cc -fopenmp -o openmp_basics ./openmp_basics.c
srun ./openmp_basics $OMP_NUM_THREADS 100000 10000 10

I have tried many things, still stuck here. Any idea?


SVG not supported

Login to reply

👍?
helpful
11:24pm Tue 3rd Sep, ANONYMOUS

Do you mind sharing the code? I don't think people will give helpful answers without knowing the details.


SVG not supported

Login to reply

👍?
helpful
3:31pm Thu 5th Sep, ANONYMOUS

I suspect that Setonix may be executing our code in a sandboxed environment, which might restrict it to using a single core and limiting memory resources, regardless of how we configure the batch file. Since we are likely classified as less privileged users, it’s possible that their resource allocation policy is imposing these constraints on us.

Could anyone confirm that you are getting true performance from Setonix?

ANONYMOUS wrote:

Problem Statement

When I run the command on my local machine, I observe a noticeable performance boost as the number of threads increases.

However, on Setonix, regardless of whether I specify 1 or 128 threads, the performance remains consistently slow and unchanged. I checked the thread count in the job output, and it does appear to spawn the correct number of threads, but the overall performance stays the same.

Local output

% ./openmp_basics 1 100000 10000 10 
Using 1 threads..

Time taken: 9.465 seconds.
% ./openmp_basics 8 100000 10000 10 
Using 8 threads..

Time taken: 2.030 seconds.

Setonix output

tail slurm-15112799.out
Using 128 threads..

Time taken: 12.112 seconds.

Here is my batch file:

#!/bin/bash

#SBATCH --account=courses0101
#SBATCH --partition=work
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=128
#SBATCH --mem=200G
#SBATCH --time=00:30:00


export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

cc -fopenmp -o openmp_basics ./openmp_basics.c
srun ./openmp_basics $OMP_NUM_THREADS 100000 10000 10

I have tried many things, still stuck here. Any idea?


SVG not supported

Login to reply

👍?
helpful
6:32pm Thu 5th Sep, ANONYMOUS

Using --cpus-per-task=128 is the correct way to specify the thread count. Since we don not have control over how Setonix operates, we must work with its scheduling behavior as best we can.

I still cannot see the C code, so I am not sure how you measured the execution time. However, it is possible that Setonix's scheduling scheme could make wall time an inaccurate measure of execution time.

I suggest trying this for measuring CPU time:

#include <unistd.h>

struct timespec t;
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &t);
return t.tv_sec + t.tv_nsec / 1e9;


SVG not supported

Login to reply

👍?
helpful
8:18pm Thu 5th Sep, ANONYMOUS

In addition to my previous response, I realize that it may be impossible to perfectly time a multi-threaded function.

Although clock_gettime(CLOCK_THREAD_CPUTIME_ID, &t) excludes scheduling delays, we will always call the function in the master thread. If the master thread finishes before other threads, the recorded CPU time for the master thread may be shorter than the CPU time consumed by the longest-running thread in the execution. Ideally, as shown in the figure, we would want to capture the CPU time consumed by the longest-running thread, which is the child thread in this case, but it is just impossible.

   Child Start - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Child Complete
  /                                                                                        \
Clock() - Master Start - Master Complete - (Wait for Child, does not consume CPU time) - Clock()
#include <bits/stdc++.h>
#include <omp.h>
using namespace std;

double myclock() {
  struct timespec t;
  clock_gettime(CLOCK_THREAD_CPUTIME_ID, &t);
  return t.tv_sec + t.tv_nsec / 1e9;
}

int main() {
  cout << omp_get_max_threads() << endl;
  double a = myclock();

#pragma omp parallel for
  for (int i = 0; i < 2; ++i) {
    if (omp_get_thread_num() != 0) {
      double c = myclock();
      // simulate some work
      for (uint j = 0; j < 1000000000; ++j)
        ;
      double d = myclock();
      cout << omp_get_thread_num() << endl;
      cout << (d - c) << endl;
    }
  }

  double b = myclock();
  cout << (b - a) << endl;
}
$ g++ 1.cpp -fopenmp; ./a.out # run without optimization so the long loop will not be optimized out
256
1
0.653028
0.0163417


SVG not supported

Login to reply

👍?
helpful
10:55pm Thu 5th Sep, ANONYMOUS

Hi, thank you for your response. I've made some updates to my implementation by including your suggested time calculations along with both wtime() and clock() functions. This should help provide a clearer picture of the performance. I would greatly appreciate your feedback on any potential issues or improvements. Thank you for your ongoing support.

Test Script: openmp_101.c

#include<stdio.h>
#include<stdlib.h> // Needed for atoi
#include<time.h>
#include<omp.h>
#include<unistd.h>



int max_threads;

double myclock() {
    struct timespec t;
    clock_gettime(CLOCK_THREAD_CPUTIME_ID, &t);
    return t.tv_sec + t.tv_nsec / 1e9;
}



int calculateSums(long loops) {
    
    double start, end;
    clock_t clock_start, clock_end;
    double my_start, my_end;
    start = omp_get_wtime();    
    my_start = myclock();
    clock_start = clock();

    int sum = 0;
    int rand_sum = 0;
    #pragma omp parallel
    {
        // If this is the master thread, print the number of threads used
        if (omp_get_thread_num() == 0) {
            printf("Using %d thredas of max available %d threads.\n\n", omp_get_num_threads(), max_threads);
        }
        
        // Use a for loop with a reduction operation to sum the values
        #pragma omp for reduction(+: sum) reduction(+: rand_sum)
        for (int i = 0; i < loops; i++) {
            sum += 1;
            // Generate a random number between 1 and 3
            rand_sum += (rand() % 3) + 1;
        }
    }

    end = omp_get_wtime(); 
    my_end = myclock();
    clock_end = clock();

    // Print the time taken and the sums calculated
    printf("Time taken (wtime): %.3f seconds.\nTime taken (clock_gettime): %.3f seconds.\nTime taken (clock_t): %.3f seconds.\nLoops %ld \nSum: %d \nRand_sum: %d\n", (end - start), (my_end - my_start), ((double) (clock_end - clock_start) / CLOCKS_PER_SEC), loops, sum, rand_sum);


    return rand_sum;
}


int main(int argc, char *argv[])
{
    srand(time(NULL)); // Seed the random number generator
    long loops = 10000000000; // 10 billion
    double start, end;
        
    max_threads =  omp_get_max_threads();
    
    // Check if the number of threads was specified as a command line argument
    if (argc >= 2) {        
        int requested_threads = atoi(argv[1]);
        
        // Set number of threads based on max availability
        if (requested_threads <= max_threads) {
            omp_set_num_threads(requested_threads);
        } else {
            omp_set_num_threads(max_threads);
        }
    }
    
    // Calculate the sum of 1 and the sum of random numbers between 1 and 3
    calculateSums(loops);
    
    return 0;
}



Batch Scripts:

File: openmp_101-a.sh


#!/bin/bash

#SBATCH --account=courses0101
#SBATCH --partition=work
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=28
#SBATCH --mem=100G
#SBATCH --time=00:30:00
#SBATCH --qos=high
#SBATCH --mail-type=END,FAIL
#SBATCH [email protected]

cc -fopenmp -o openmp_101 ./openmp_101.c
srun ./openmp_101

File: openmp_101-b.sh

#!/bin/bash

#SBATCH --account=courses0101
#SBATCH --partition=work
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=28
#SBATCH --mem=100G
#SBATCH --time=00:30:00
#SBATCH --qos=high
#SBATCH --mail-type=END,FAIL
#SBATCH [email protected]

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}

cc -fopenmp -o openmp_101 ./openmp_101.c
srun ./openmp_101

On local mac:

% ./openmp_101 
Using 10 thredas of max available 10 threads.

Time taken (wtime): 2.172 seconds.
Time taken (clock_gettime): 1.905 seconds.
Time taken (clock_t): 19.139 seconds.
Loops 10000000000 
Sum: 1410065408 
Rand_sum: -1474884569

% ./openmp_101 1
Using 1 thredas of max available 10 threads.

Time taken (wtime): 9.230 seconds.
Time taken (clock_gettime): 9.228 seconds.
Time taken (clock_t): 9.228 seconds.
Loops 10000000000 
Sum: 1410065408 
Rand_sum: -1474822776

On Setonix server:

Run jobs using both batch scripts, openmp_101-b.sh has one additional line export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}.


> sbatch openmp_101-a.sh 
Submitted batch job 15274010
> sbatch openmp_101-b.sh 
Submitted batch job 15274012


> tail slurm-15274010.out 
Using 1 thredas of max available 1 threads.

Time taken (wtime): 10.454 seconds.
Time taken (clock_gettime): 10.453 seconds.
Time taken (clock_t): 10.453 seconds.
Loops 10000000000 
Sum: 1410065408 
Rand_sum: -1474854915

> tail slurm-15274012.out 
Using 56 thredas of max available 56 threads.

Time taken (wtime): 12.586 seconds.
Time taken (clock_gettime): 0.226 seconds.
Time taken (clock_t): 12.584 seconds.
Loops 10000000000 
Sum: 1410065408 
Rand_sum: -1474884167

According to all available documentation, wtime() is recommended for measuring the performance of parallel programs using OpenMP. I observe a clear difference when running programs with varying thread counts on my local machine. However, this performance difference is not reflected when running the same programs on Setonix.

ANONYMOUS wrote:

In addition to my previous response, I realize that it may be impossible to perfectly time a multi-threaded function.

Although clock_gettime(CLOCK_THREAD_CPUTIME_ID, &t) excludes scheduling delays, we will always call the function in the master thread. If the master thread finishes before other threads, the recorded CPU time for the master thread may be shorter than the CPU time consumed by the longest-running thread in the execution. Ideally, as shown in the figure, we would want to capture the CPU time consumed by the longest-running thread, which is the child thread in this case, but it is just impossible.

   Child Start - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Child Complete
  /                                                                                        \
Clock() - Master Start - Master Complete - (Wait for Child, does not consume CPU time) - Clock()
#include <bits/stdc++.h>
#include <omp.h>
using namespace std;

double myclock() {
  struct timespec t;
  clock_gettime(CLOCK_THREAD_CPUTIME_ID, &t);
  return t.tv_sec + t.tv_nsec / 1e9;
}

int main() {
  cout << omp_get_max_threads() << endl;
  double a = myclock();

#pragma omp parallel for
  for (int i = 0; i < 2; ++i) {
    if (omp_get_thread_num() != 0) {
      double c = myclock();
      // simulate some work
      for (uint j = 0; j < 1000000000; ++j)
        ;
      double d = myclock();
      cout << omp_get_thread_num() << endl;
      cout << (d - c) << endl;
    }
  }

  double b = myclock();
  cout << (b - a) << endl;
}
$ g++ 1.cpp -fopenmp; ./a.out # run without optimization so the long loop will not be optimized out
256
1
0.653028
0.0163417


SVG not supported

Login to reply

👍?
helpful
11:07pm Thu 5th Sep, ANONYMOUS

Additionally, measuring the longest-running thread won't accurately reflect total time consumed if multiple threads are time-sharing the same CPUs, especially when the number of threads exceeds the number of available CPUs. This is why I believe wtime() remains the most appropriate function for performance measurement in our case.

From my latest observations, it seems plausible that Setonix may be employing time-sharing instead of running threads on multiple CPU cores.

If you test it on your own Setonix account, you might get different results.


SVG not supported

Login to reply

👍?
helpful
5:03pm Fri 6th Sep, ANONYMOUS

First I would like to point out that your code may have undefined behavior. int i and long loops have different data type, and loops have value equal to 10000000000 which is greater than INT_MAX, so the loop should never finish. But since the behavior is undefined, compiler can do anything reasonable here. Actually based on my test your code run as if loops = 10000000000 & 0xffffffff = 1410065408.

The above is not directly related to the performance, but I think I should mention that.


SVG not supported

Login to reply

👍?
helpful
5:19pm Fri 6th Sep, ANONYMOUS

Moreover I have never tried to use sbatch before because I thought running a program with srun should have identical or at least similar performance.

For example, if I were you I would run openmp_101-b.sh as:

cc -fopenmp -o openmp_101 ./openmp_101.c

export OMP_NUM_THREADS=56 # Anything you want

srun \
    --account=courses0101 \
    --partition=work \
    --nodes=1 \
    --ntasks=1 \
    --ntasks-per-node=1 \
    --cpus-per-task=28 \
    --mem=100G \
    --time=00:30:00 \
    --qos=high \
    --mail-type=END,FAIL \
    [email protected] \
    ./openmp_101

This worked well for all of my programs so I have never used sbatch.

To answer your question I tried to execute your program using sbatch openmp_101-b.sh and it indeed completed within 15 seconds. However when I used the commands above to execute the program directly using srun, the program would take more than 140 seconds to complete. In this case, srun is much slower than sbatch.

I felt confused and tried to execute my project 1 using sbatch. And it would take forever to complete the multiplication. However, normally my program would take around two minutes to complete the multiplication. In this case, sbatch is much slower than srun.

So now I am also getting confused. I don't know the reason behind this, so hopefully there will be anyone to answer this. However before ending my post I still want to share some fun facts I observed:

If you remove the rand() call and replace it with something like i*i, sbatch will be much slower than srun. This is funny as my project did not use rand(), so I guess the performance difference may have something to do with rand(). Moreover, if you do not use rand() in your code and use srun to run your program directly under different thread counts you can notice a difference.

I just don't know why.


SVG not supported

Login to reply

👍?
helpful
5:53pm Fri 6th Sep, ANONYMOUS

A minimal test case to show the performance difference.

openmp_101.c:

#include <inttypes.h>
#include <limits.h>
#include <omp.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

int main() {
    printf("omp_get_max_threads(): %u\n", omp_get_max_threads());
    fflush(stdout);

    double start = omp_get_wtime();

    const uint64_t loops = INT_MAX;
    uint64_t sum = 0, rand_sum = 0;

#pragma omp parallel for reduction(+ : sum) reduction(+ : rand_sum)
    for (uint64_t i = 0; i < loops; i++) {
        sum += 1;
        rand_sum += rand(); // or: rand_sum += (i * i);
    }

    double end = omp_get_wtime();

    // Print the time taken and the sums calculated
    printf("Time taken (wtime): %.3f seconds\n", end - start);
    printf("Loops: %" PRIu64 "\n", loops);
    printf("Sum: %" PRIu64 "\n", sum);
    printf("nRand_sum: %" PRIu64 "\n", rand_sum);
}

openmp-101.sh:

#!/bin/bash

#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=28
#SBATCH --qos=high

cc -fopenmp -o openmp_101 ./openmp_101.c
export OMP_NUM_THREADS=56
srun ./openmp_101

Run with sbatch:

sbatch openmp-101.sh

Run with srun:

(
    cc -fopenmp -o openmp_102 ./openmp_101.c
    export OMP_NUM_THREADS=56
    srun --nodes=1 --ntasks=1 --cpus-per-task=28 --qos=high  ./openmp_102
)

Performance will differ, depending on whether there is a rand() or not.

  • If there is rand(): sbatch will get result soon, while srun will run forever (I just killed the program after enough waiting).
  • If there is no rand(): sbatch is 10x slower than srun.

Note that rand() has implementation-defined thread-safety (https://en.cppreference.com/w/c/numeric/random/srand), not sure if this has anything to do with the issue.

But indeed sbatch can be slower than srun, and this is a problem.


SVG not supported

Login to reply

👍?
helpful
5:57pm Fri 6th Sep, ANONYMOUS

sorry, use this link: https://en.cppreference.com/w/c/numeric/random/rand

The University of Western Australia

Computer Science and Software Engineering

CRICOS Code: 00126G
Written by [email protected]
Powered by history
Feedback always welcome - it makes our software better!
Last modified  8:08AM Aug 25 2024
Privacy policy