My sparse multiplication algorithm used about 1 sec for size 10000 but after that the time growth in a exponential speed and very likely would take about an hour for size 100000. I guessing I might not using the optimal algorithm, is there any hint?
Thats not far off the times I'm having as well, I'm under the assumption it is supposed to take quite a long time. Though I could also not be using the optimal algorithm but regardless I'm assuming it will take a long time.
Likewise - 10000 took around a second, and for me doubling the size led to a 10x time increase, so mine will take somewhere around 3 hours
Feels quite long however, it will make repeated tests and comparisons difficult
What does your sbatch script contain? My Setonix jobs sit in a queue for a full day before executing, and OpenMP has been doing nothing for me. What values do you have for --nodes and --ntasks, and are you using the --cpus-per-task directive?
Hi Benjamin,
At this moment, I just use command to test my program and set the total thread number to 1. I will do the batch bash program later to add the nodes or cores for getting more test results.
Thanks,
Joey
For me I've currently been using:
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task={I change this for the thread count}
#SBATCH --partition=work
#SBATCH --account=courses0101
#SBATCH --mem=220G (definitely overkill, but works)
#SBATCH --time=01:30:00
>From my understanding, you don't need to change the nodes since OpenMP is used (in this project) to maximise the performance on a single node, and neither the ntasks (I could be wrong about the ntasks).
Isn't that changing the number of "cores". It seems --cpus-per-task can be up to 128, but I assume we should set it 28 since the task says it should use 28 cores. Then we can change the number of threads used with omp_set_num_threads while keeping cores the same?
I'd assume that there would be no optimal value (U shape graph) for --cpus-per-task, but rather it would continually get faster the higher the number of cores that are used.
I believe a proper implementation should be able to perform the multiplication for p = 0.05 and N = 100,000 in under ten minutes. In fact, ten minutes is a generous estimate, and your program will likely complete the task in less time.
Well actually it would usually make sense. Pretend we have 1 thread and 1 core. Every time one of the items in a row or column of the matrix needs to be accessed, that data needs to be pulled from the RAM or a cache into the CPU, this can take hundreds of CPU cycles. During these cycles, the CPU is typically idle, waiting for the data to arrive.
If there were two threads however, the second could have already preloaded its data in the cache so that when thread 1 is finished executing and needs to recall the next piece of data, the CPU can immediately switch to thread 2 and consume that one's preloaded data, while waiting for thread 1 to retrieve more data from the RAM.
On the Pawsey documentation they refer to the --cpus-per-task as the max number of threads in an OpenMP application:
"--cpus-per-task=<c> (or -c <c>): specifies the number of cores assigned to each process (or task related to the --ntasks option). For OpenMP jobs, and multithreaded programs in general, this implies each task may have up to c threads running in parallel. The value for the --cpus-per-task should correspond to the one associated with the OMP_NUM_THREADS variable for OpenMP applications."