Lab 4

Introduction

OptsRus has been contracted to parallelize a customer’s utility function that they use to characterize the quality of random numbers generated by the C library rand_r() function. In particular, you will parallelize it using pthreads. The code makes use of a hash table, which can be non-trivial to parallelize. Be sure to answer all of the numbered questions that are embedded throughout this lab handout on Crowdmark.

Setup

Ensure you're in the ece454 directory within VSCode. Make sure you have the latest skeleton code from us by running: git pull upstream main.

This may create a merge, which you should be able to do cleanly. If you don't know how to do this read Pro Git. Be sure to read chapter 3.1 and 3.2 fully. This is how software developers coordinate and work together in large projects. For this course, you should always merge to maintain proper history between yourself and the provided repository. You should never rebase in this course, and in general you should never rebase unless you have a good reason to. It will be important to keep your code up-to-date during this lab as the test cases may change with your help.

Understanding `randtrack.cpp`

The randtrack program is reasonably straightforward. After instantiating the hash table, it processes several seed streams (streams of random numbers initiated by a certain seed value), in this case four seed streams. It then goes about collecting a large number of samples (in this case 10,000,000 samples), but skips samples_to_skip samples, so it actually only counts one sample out of every samples_to_skip samples. The sample to be counted is made to fit in the range 0..999,999 by taking a modulo (% 1000000). Next we check the hash table to see if there already exists an element for the sample, and create and insert one if not; then we increment the counter for that element/sample. Finally, once all samples have been tracked, we print out all the sample values and their frequencies (counts).

randtrack has the following command-line parameters:

num_threads: the number of parallel threads to create (not used in the initial version we give you);
samples_to_skip: the number of samples to skip

You should carefully read and understand randtrack.cpp and how it works. Note that while this is a somewhat "toy" program for the purposes of this assignment, hash tables are a very common and important data structure and this assignment considers parallelization in their presence.

Understanding the Hash Table

The hash table implementation is a one-dimensional array of pointers to lists. Each key value each maps (hashes) to a certain array location based on the key-value modulo the hash-array size. Any collision (two keys that map to the same hash-array location) is handled by inserting a new element at the head of list for that hash-array location. Note that the hash table and list structures are implemented using the C++ template construction because they become inlined—-this is unfortunately the only way (at least the most straight-forward way) it will work with the current version of the transactional memory library. When this template version of the hash table is instantiated, the types of the hash element and key value are specified, in this case class sample and unsigned respectively. You should read and understand hash.h and list.h, other than the template syntax they are fairly straightforward. Note that for the sake of fair comparisons, please do not modify the size of the hash table array for any experiment.

Compiling

In this assignment you will be making and measuring several different versions of randtrack, and you want to be able to build and run any of them at any time. It is up to you if you want to accomplish this by having different versions of the randtrack.cpp file, or instead using #ifdef's on one copy of the file. Whichever way you choose, be sure that you modify the meson.build to take care of the details. In other words, running meson compile -C build should produce an executable build/randtrack_global_lock, build/randtrack_list_lock, and build/randtrack_tm, etc, should all build the appropriate version of randtrack with those names. If you are going to use #ifdef's and have a single file, then note that any added methods or data structures added for a lock implementation should only be built for that lock implementation that is using them. i.e., if you add two methods and an array for the global lock implementation, those items shouldn't be built for the TM version if it is not using them, they should be #ifdef'ed out.

Q1 (Crowdmark). Why is it important to #ifdef out methods and data structures that aren’t used for different versions of randtrack?

More compilation details for this assignment:

You want to compile using g++ at least 4.7, a recent version with support for transactional memory.
To compile with support for pthreads you may need to add Meson thread support
To additionally enable support for transactional memory, add the flag -fgnu-tm to the gcc compile command line.

Debugging

To debug, gdb works fine and supports pthreads (although debugging parallel code is a challenge). Adjust the build file to compile with debug information. However, for your timing runs, you should use -O2 optimization to get the best performance.

Parallelizing with Pthreads

You should start by parallelizing the work using pthreads, following the example code given in class. You should create the number of threads specified by the command line argument num_threads, which you can assume will be set to either 1, 2, or 4. For dividing work among threads, for 1 thread the 1 thread should process all the four seed streams, for 2 threads they should each process two seed streams, and for four threads each thread processes its own seed stream. It is important to always verify the accuracy of the output by comparing with the output of the original program. If you save the output in a file, e.g., by running build/randtrack 1 50 > rt1.out, you can decide the correctness of a new version by doing diff rt1.out rt2.out (which should print no difference). One catch is that, for parallel versions, elements might be inserted into the hash table lists in a different order than the original (but the counts for each element should be identical). An easy way to account for this difference in the outputs is to sort the outputs before comparing them, i.e., do sort -n rt1.out > rt1.outs, sort -n rt2.out > rt2.outs, then compare by doing diff rt1.outs rt2.outs which should show no difference. Once you have created threads and divided the work among them as above, if you execute and compare with the original you will get incorrect results, since the accesses to the shared hash table are not yet synchronized. This is what you will do using several different methods in the next sections. The first challenge is to carefully identify which parts of the code need to be synchronized (i.e., executed atomically) versus which are safe to do in parallel. For further details on pthreads, an introductory tutorial is here: http://yolinux.com/TUTORIALS/LinuxTutorialPosixThreads.html

Single Global Lock

Create a version of the program called randtrack_global_lock that synchronizes by creating and using a single pthread_mutex around the critical shared accesses in the code. Ensure via testing that it produces the same results as the original version as described above.

List-Level Locks

While having a single global lock is easy to implement, it is often better for performance to have multiple locks that protect smaller data structures, i.e., "fine-grain locks". Create a version of the program called randtrack_list_lock that creates a pthread_mutex for every location in the hash array (e.g., one for each list), and synchronizes using those locks. You will find that this requires a bit of thinking.

Q2 (Crowdmark). Can you implement this without modifying the hash class, or without knowing its internal implementation?

Q3 (Crowdmark). Can you properly implement this solely by modifying the hash class methods lookup and insert? Explain.

Q4 (Crowdmark). Can you implement this by adding to the hash class a new function lookup and insert if absent? Explain.

Q5 (Crowdmark). Can you implement it by adding new methods to the hash class lock list and unlock list? Explain. Implement the simplest solution above that works (or a better one if you can think of one).

Q6 (Crowdmark). How difficult was using TM compared to implementing list locking above?

Element-Level Locks

An even finer-grain locking approach than list-level could be implemented, i.e., locking individual sample elements. Think carefully about whether your implementation handles both the case where the key already exists in the hash table and also when it does not exist and needs to be inserted. Call this version randtrack_element_lock.

Reduction Version

Rather than sharing and synchronizing on the hash table, another solution could instead give each thread its own private copy of a hash table, each counts its streams into that local hash table, then combine the counts from the multiple hash tables into one at the end before printing. Call this version randtrack_reduction.

Q7 (Crowdmark). What are the pros and cons of this approach?

Measuring

For this homework your implementations should be measured from start to finish, using /usr/bin/time. In other words, you will include the entire computation in your measurement, including reading and writing files, initialization, and thread creation or other overheads associated with parallelization. Note that there is also a shell command called time, but we prefer the one above. Report the "Elapsed" time.

Getting a Clean Timing Measurement

The code already has a timing function, measured at the appropriate points. Since we are measuring timing for parallel execution using multiple processors, it is important that we take our measurements when the machine is not heavily loaded. You can check the load on either machine with the command w” You should be able to get a relatively clean measurement when the load average is less than 1.0. Since each machine has 4 processors, the load average may be as high as 4.0 when all thread slots are busy, or even higher if they are overloaded. Once you are confident in the correctness of your code, you probably want to write a script (python or other) to perform all of your runs and collect the data. For every timing measurement always do 5 runs and average them (please only report the final average).

Experiments to Run

Run the following experiments and report the runtime results in a table. For those that implemented element-lock or reduction versions, report results for them as well.

measure the original randtrack with samples_to_skip set to 50 (num_threads can be any number, it isn’t used)
measure the global-lock, list-lock, (element-lock), and TM versions of randtrack with 1, 2, and 4 threads, and with samples_to_skip set to 50.

Q8 (Crowdmark). For samples to skip set to 50, what is the overhead for each parallelization approach? Report this as the runtime of the parallel version with one thread divided by the runtime of the single-threaded version.

Q9 (Crowdmark). How does each approach perform as the number of threads increases? If performance gets worse for a certain case, explain why that may have happened.

Q10 (Crowdmark). Repeat the data collection above with samples_to_skip set to 100 and give the table. How does this change impact the results compared with when set to 50? Why?

Q11 (Crowdmark). Which approach should OptsRus ship? Keep in mind that some customers might be using multicores with more than 4 cores, while others might have only one or two cores.

Evaluation

Your grade will be calculated as follows:

Global Lock (and basic pthreads): 20 points.
List Lock: 25 points.
Element Lock: 20 points.
Reduction: 15 points.
Answers to questions (Crowdmark): 20 points.
Total: 100 points.

Submission

Remember to also submit your answers on Crowdmark

Simply push your code using git push origin main (or simply git push) to submit it. You need to create your own commits to push, you can use as many as you'd like. You'll need to use the git add and git commit commands. You may push as many commits as you want, your latest commit that modifies the lab files counts as your submission. For submission time we will only look at the timestamp on our server. We will never use your commit times (or file access times) as proof of submission, only when you push your code to GitHub.

Introduction​

Setup​

Understanding randtrack.cpp​

Understanding the Hash Table​

Compiling​

Debugging​

Parallelizing with Pthreads​

Single Global Lock​

List-Level Locks​

Element-Level Locks​

Reduction Version​

Measuring​

Getting a Clean Timing Measurement​

Experiments to Run​

Evaluation​

Submission​