ECE 264 Fall 2012 IPA4 Parallel Computing (Optional)
Under Construction
So far in ECE 264, every program is sequential: one statement follows another. This is called the "sequential execution model." Since early 21st century, processors have multiple cores allows several programs to run simultaneously (also called "in parallel"). All desktop and laptop processors have multiple cores. In fact, Google Nexus 7 Tablet already has 4 cores. This assignment gives you a chance writing a parallel program. In particular, we will use pthread to write the program. Pthread is an extension of C and is supported in many operating systems.
Parallel Computing
Parallel computing is to perform multiple operations at the same time, so the performance and efficiency of the program can be greatly enhanced. The concept is that often times large problem that take enormous amount of time to solve could be divided into smaller problems, and those sub-problems could be solved simultaneously, therefore make the computation more efficient. There are many forms of parallel computing, and one of widely adopted form is to run multiple threads concurrently, termed as multi-threading. A thread is a sequence of program instructions that execute for certain amount of time. Thread is part of a computer program running as a process. A program can spawn several threads, each of which doing part of the work for the program. However, a program might not gain performance provided that its threads are running on the same processor, which in that case the threads are executed sequentially, one after another, and the overall execution time will not change. By assign each threads to a distinct processor, threads can be executed in parallel, that is, threads can do their jobs at the same time. This way the overall execution time will decrease thus the gain of performance.Today's multi-core technology enables this approach and is suitable for multi-threading parallel computation.
PThreads
The interface for multi-threading you are going to implement in this assignment is pthreads. Pthreads is an Application Programming Interface for creating and manipulating threads. It is commonly used under Unix-like operating systems. (wiki) Here we have provided you several small examples illustrating basic use of pthread to create multi-thread programs. Some brief descriptions of the funtions and data types for pthreads are also provided along with the examples. You can also find a detailed tutorial here at this site. Detailed function prototype can be finded at kernel.org. Each function prototype and data structure of pthreads can be found also on the open group.
Example 1: printing hello world with multiple threads
void * printMsg(void * threadId)
{
long tid = (long) threadId;
printf("Thread %ld: Hello World!\n", tid);
return NULL;
}
int main(int argc, char * argv[])
{
pthread_t threads[4]; // create 4 threads
long i;
for (i = 0; i < 4; i++) {
// create each thread with id = i
pthread_create(&threads[i], NULL, printMsg, (void*) i);
}
for (i = 0; i < 4; i++) {
// wait for each thread to finish
pthread_join(threads[i], NULL);
}
printf("Main: Hello World!\n");
return 0;
}
Output: (order may vary)
Thread 1: Hello World!
Thread 0: Hello World!
Thread 3: Hello World!
Thread 2: Hello World!
Main: Hello World!
The first example illustrate how to create thread using pthread functions, and here are protoype for them, as well as pthread data types:
pthread_t : This is the data type for the thread handle, or identifier. Each handle is unique inside a process. Threads are manipulated by their thread handles.
pthread_create ( pthread_t * thread, const pthread_attr_t * attr, void * ( * start_routine) (void *), void * arg )
This function creates a new thread within the calling process. The new thread is manipulated through the thread handle thread, which is passed in to this funtion and initialized after thread creation. The thread is created with attributes specified by attr; if attr is NULL, default attributes is used for the newly created thread. The new thread starts is execution by calling supplied function start_routine, which takes a single argument arg.
Notice here start_routine is a funtion that takes a void * argument and returns void *. and the sole argument to start_routine, arg, is typed void * as well. Since start_routine only takes one argument, if there needs more than one parameter for the thread to work on, they need to be wrapped by othere data structure like a struct or union. Exmaples are provided below to illustrate this use.
Return value: This function returns 0 on success, otherwise an error code is returned.
pthread_join ( pthread_t thread, void **value_ptr )
This function suspends the calling thread, in this case, the thread that executes the main function, and wait for the target thread specified by its handle thread to finish exectution. The return value of the target thread is written to value_ptr, provided the second argument supplied is not NULL. If the main thread need to wait for multiple thread to finish execution, pthread_join needs to be called once per target thread.
Return Value: This function returns 0 on success, otherwise an error code is returned.
void pthread_exit ( void * value_ptr )
This funtion terminates the calling thread and makes the value value_ptr available to the join function. This function is implicitly called when start_routine, the function used to create new threadsm, finishes its exectution and its return alue is used as the exit status of the terminating thread.
Example 2: create 4 threads, each sums up part of a large array of 1000 elements
Some codes are omited for not contributing to pthreads, please refer to the full source code provided here.
//global declarations
// defines the thread data structure for passing parameters
typedef struct thread_data {
int * start; // starting point of current part of start
int count; // number of elements in the given array slice
int tid; // thread id
} Tdata;
.....
//in function void * partsum(void *):
......
// sums up elements
int sum = 0;
int i;
for (i = 0; i < count; i++)
sum += start[i];
printf("Thread %2d: from %4d to %4d, sum = %6d\n", tid, start[0], start[count-1], sum);
return NULL;
......
//in function main:
......
int perThreadWork = 1000 / numThreads; // number of elements each thread gets
int rtcode, arrStart = 0; // offset in the array for each thread
for (i = 0; i < numThreads; i++) {
threadData[i].start = array + arrStart;
if (i == numThreads - 1)
threadData[i].count = 1000 - arrStart; // if last thread, assign rest of the elements to it, making sure no element is left uncounted
else
threadData[i].count = perThreadWork; // if not last thread, distribute average load
threadData[i].tid = i;
rtcode = pthread_create(&threads[i], 0, partsum, &threadData[i]); // create thread for each handle, passing the function and its parameter
if (rtcode != 0) { // check return code of pthread_create
fprintf(stderr, "Error creating threads, return code: %d\n", rtcode);
return -1; }
arrStart += perThreadWork; // shift the offset for next thread
}
int status;
for (i = 0; i < numThreads; i++)
pthread_join(threads[i], (void *) &status); // wait for each thread to finish its work
printf("All threads finished.\n");
......
Output: (order may vary)
Thread 0: from 1 to 250, sum = 31375
Thread 3: from 751 to 1000, sum = 218875
Thread 1: from 251 to 500, sum = 93875
Thread 2: from 501 to 750, sum = 156375
All threads finished.
This example illustrates how to used mulitple parameters for new threads' execution, by wraps up all necessary information inside a data structure struct thread_data. The 1000 elements array is divdie by the number of threads, in this case, 4, and each thread does its work by invoking the start routine partsum. The partial sums from each thread are then printed to stdout, which the output is not neccessarily in order.
Example 2.1: similar to example 2, but the number of threads to create is depend on the first argument of the program, user can give an integer argument between 1 and 32, for number of threads to create. Please check the file here.
Execute with argument : ./ex21 9
Output: (order may vary)
Thread 1: from 112 to 222, sum = 18537
Thread 0: from 1 to 111, sum = 6216
Thread 3: from 334 to 444, sum = 43179
Thread 7: from 778 to 888, sum = 92463
Thread 2: from 223 to 333, sum = 30858
Thread 8: from 889 to 1000, sum = 105784
Thread 6: from 667 to 777, sum = 80142
Thread 5: from 556 to 666, sum = 67821
Thread 4: from 445 to 555, sum = 55500
All threads finished.
Example 3: This time the overall sum of the given array is computed. There is a global variable totalSum that need to be updated by each thread as they finish their part of addition. Since this variable can be modified by several threads, the integrity of this data is crucial to the correctness of the program. Considering the following scenario:
1.) Initially totalSum = 0. Thread 1 finish its work and get its result sum1 = 100. It then adds 100 to totalSum, write it back to memory and totalSum now equals to 100. Thread 2 finish its part and get result sum2 = 200. It updates totalSum, write to memory, make it equals to 300. Now all works are done, and final result is correct.
2.) Starting with totalSum = 0. Suppose thread 1 finish it work, getting sum1 = 100. At the same time, thread 2 finishes and get sum2 = 200. Thread 1 now wants to update totalSum, it sees totalSum = 0 and adds sum1 to it and plan to save to memory. However, just before it writes totalSum back to memory, thread 2 cuts in because it doesn't like to wait. So it adds totalSum with sum2, write it to memory, and now totalSum = 200 . Thread 1 then write its result, 100, to memory, and this overwrites what thread 2 has saved. The final result is now 100, which is incorrect. Now the data is corrupted and the program fails.
3.) Same starting condition, suppose thread 2 finishes and update totalSum with 200, write result to totalSum in memory. Thread 1 now finishes and update totalSum with 100. The final result is 300. Now everything is fine.
Thread Synchronization: With the above three scenarios, you can see with multiple threads running in parallel updating a shared variable can potentially cause corruption of data. Therefore there needs to be some mechanism to guarantee that only one thread can update the shared variable at a time. Such process is termed as Thread Synchronization, referring to the fact that each thread needs to suspend its execution to form an agreement on how to access a shared resource, in order to gurantee integrity of the data. The code that accesses the shared resource that must not be access by two or more threads is called the critical section.
In pthread the synchronization process is done by protecting the critical sections by using mutex objects. Mutex refers to mutual exclusion, which means no two threads can have controll over the object at the same time. A mutex object has two states, locked and unlocked. A thread first has to lock the mutext object in order to access the shared resource protected by the mutex. After it acquires the lock, it can step further to execute the critical section that access shared resources. A mutex object that is locked cannot be locked again, and only the thread acquired the lock can unlock it at any given time. If it is locked, other threads must wait for the owner of the lock to finish. When the thread finishes access to shared resources, it unlocks the mutex object and only by then can the other threads acquire the lock. This way ensures only one thread gets to access the common resource at any given time, therefore the threads are sychronized, and data integrity is guaranteed.
Note that it is not the code that accesses the shared resource gets locked and unlocked. It is the mutex object, and the critical section, code that accesses common resources, is put in between the lock and unlock of mutext statement to be protected for exclusive execution by a thread. The example below illustrates the basic use of mutex to protect shared access to the variable totalSum. A detailed tutorial on how to use mutex with pthread can be found here.
Following are some code segment from the example 3 source code, for full source code please look at here:
/* Example 3 ex3.c
* create threads according to program argument
* number of threads to creates defaults to 4
* each thread sums up part of a given array
* overall sum is computed by adding up each partial sums
* */
// defines the thread data structure for passing parameters
typedef struct thread_data {
int * start; // starting point of current part of start
int count; // number of elements in the given array slice
int tid; // thread id
pthread_mutex_t * mutexSum; // pointer to the actual mutex object for synchronzing the totalSum sum of the array elements
int * totalSum; // pointer to totalSum, the overall sum of all elements in the array;
} Tdata;
// in function partsum
......
int * totalSum = td->totalSum;
......
pthread_mutex_lock(td->mutexSum); // acquires lock
*totalSum += sum; // only one thread gets to update totalSum
pthread_mutex_unlock(td->mutexSum); // unlock so other thread gets access to lock the mutex
......
// in function main
......
int totalSum = 0; // overall sum, calculated by collecting partial sums from each thread
pthread_mutex_t mutexSum; // mutext object for controlling access to shared resources among threads
pthread_mutex_init(&mutexSum, NULL); // initializes mutex object
......
threadData[i].mutexSum = &mutexSum; // all threads use the same mutex object
threadData[i].totalSum = &totalSum; // each thread gets access to update totalSum
rtcode = pthread_create(&threads[i], NULL, partsum, &threadData[i]); // creates thread for each handle, passing the function and its parameter
......
printf("Overall sum = %d\n", totalSum);
pthread_mutex_destroy(&mutexSum); // destroys mutex object, release all resources allocated
......
Output: (order of per thread output may vary)
Thread 2: from 501 to 750, sum = 156375
Thread 0: from 1 to 250, sum = 31375
Thread 3: from 751 to 1000, sum = 218875
Thread 1: from 251 to 500, sum = 93875
Overall sum = 500500
Below are some brief descriptions for the data type of mutext object and funtions prototype form manpulate mutexes, for detailed description of each, please refer to the manual of Open Group.
pthread_mutex_t: This is the mutex type for synchronization. The mutex object declared is uninitialized.
int pthread_mutex_init ( pthread_mutex_t * mutex, const pthread_mutexattr_t * attr )
This function initialize the mutex object mutex, with the attributes specified by attr. If attr is NULL, default mutex attributes are used.
Return Value: 0 on success, otherwise an error code.
int pthread_mutex_destroy ( pthread_mutex_t * mutex )
This function destroy the mutex object referenced by mutex, and then the mutex object becomes uninitialized.
Return Value: 0 on success, error code otherwise.
int pthread_mutex_lock ( pthread_mutex_t * mutex )
This function locks the mutex object pointed by mutex, giving the calling thread access to the code after it.
Retrun Value: 0 if a lock is acquired on the given mutex object, otherwise an error code is returned.
int pthread_mutex_unlock ( pthread_mutex_t *mutex )
This function gives away the lock on the mutex object, and leaves opportunity of other threads to lock it.
Return Value: 0 if sucess, error code otherwise.
The above examples only illustrated basic use of pthread funtions to create multi-threaded parallel computing programs. You might want to look at more detailed tutorial and pthread manual for reference. Here is a detailed tutorial for pthread.
Problem Description (Subset Sum Problem)
Having read through the examples above and can create simple multi-threaded programs, here is the real problem for you to solve, in a multi-threaded way:
Given a nonempty set S, with elements {a1, a2, a3, ...}, and a integer N, is there a nonempty subset of S, such that the sum of all elements in this subset equals to N? (subset sum)
This problem is NP-complete, meaning that currently there is no way to sovle it efficiently. However, since this assignment is targeted for parallel computing, the algorithm is not an important part. You only need to take the simplest approach: enumerates all the subset of S, adds up the elements and compares to N, if there is a match, output "Yes", otherwise, a "No" is given.
For any nonempty set S with size m, there are 2^m total subsets (including the empty one), so the time such program will take is at order of m*2^m, which is a huge number for large input size m. With multi-threading, this time can be reduced since the works can be divided and done in parallel. You task is to write a multi-threaded program using pthread, to sovle this subset problem and find the answer.
Skeleton Files
Some skeleton files are provided to help you start on writing the program. The main function is provided in ipa4_main.c to handle file I/O and invokes the function subsetequal, and it also computes the time used for the funtion. The function subsetequal is to be implemented by you to actually solve the subset problem. Its prototype is provided below. You have to implement the function and not change the function interface.
int subsetequal( int * setS, int size, int numN);
It takes an integer pointer setS, which points to the location of the set array; an integer size, indicating the number of elements in the set; am integer numN, which is the integer N that will be matched against the sum of subsets of setS.
The function returns an integer value, with 1 indicating success, there is a subset whose sum equals to numN, and 0 meaning no such subset exists.
In the main function, subsetequal is invoked in the following way:
result = subsetequal(setS, size, numN);
You are not permitted to modify the main function provided in the skeleton source file.
A makefile is also provide to help you compile the program and test the correctness of the program. Please modify the make file to add your own source and header file for compilation, and you can check out the make targets in the makefile or at the end of this page. Your program should be build and run under Linux, and your program is graded in Linux machines only. Please make sure your program works on the ecegrid machines.
Please check out the hints and details of the Makefile at the end of this page.
Testcases
Test cases are provided along with the skeleton makefile, which you can run the test cases by running "make test" in a shell program. You can aslo download the testcases alone. Please read the README file in the testcases folder for detailed information of the provided testcases.
Sample Inputs and Outputs
Format
There are two input files, provided as arguments as the program. The first file is the set file, which contains a set of integers, one on each line. The second file contains a single integer on one line. The output is either "Yes" or "No", on its own line. The file input and output functionality is provided in the main function of the given skeleton source file. You only need to implement the subsetequal function to find whether there exist a subset of input set S that its sum of elements equals to input integer N. Sample input and output below are served for clarification and as short test cases. More testcases are provided in the zip file along with the skeleton file, and you can also download them seperately here.
Test 1:
setS numN Output
12 45 No
34
-3
66
91
201
-9
Test 2:
setS numN Output
47 134 Yes
72
45
-53
27
-95
35
66
-10
Requirements
-
You need to use the given skeleton files and modify only the Makefile and header file as needed. You have to implement the function subsetequal in a seperated source
file. Please check out on how to modify the Makefile in the Makefile section at the end of this page.
-
The program has to be implemented through pthreads and is using multiple threads to solve the given problem. There is no point for write a sequential program as the sample sequential solutions are provided.
-
Possible memory leaks need to be checked using valgrind and make sure no leak is in your code. Note that the still reachable field in valgrind report does not acutally means memory leaks, only definitely lost, indirectly lost and possible lost are memory leaks. You can run "make leakcheck" to detect memory leaks with the provided skeleton files.
-
Your code needs to follow the coding standards specified here. We will manually check this and you can lose points if standards are not followed. You can find examples of how to follow the coding standards here.
Submission
You should submit your source file that implements the subsetequal function, along with any neccessary files, including your own header file, if you used one, and the provided ipa4.h, ipa4_main.c, and also the modified Makefile. Please make sure that you program can be successfully compiled by run "make" on the ecegrid machines.
Hints
Although the approach to solve the problem in the above is very naive, enumerating all the subset of a given set can be somehow tricky. A sample sequential version of the program is provided to you to let you focus soley on the muti-threading part of the program.
The basic idea for enumerating all subsets of S is as following: using a mask for each element of set S, sets the mask accordingly for each iteration, if the value of a mask is 1, then the corresponding element is in current subset, otherwise that element is not included, after all masks are visited, update the mask for net subset, and repeat previous process, until all subsets have been enumerated.
Mask arrays: The sample sequential version uses a array of inegers, each corresponds to one element of set S. If an ineger in the array is 1, the correspoding element is included. Part of the sample solution illustrating enumeration of subsets using integer array is provided below, and you can check out the complete source code here.
// generate mask in lexicographical order, return 1 if next mask can be generated
int nextmask(int * mask, int size)
{
int i;
for (i = 0; i < size && mask[i]; i++) mask[i] = 0; // clear left most 1s in the mask
if (i < size) {
mask[i] = 1; // set next position to 1
return 1;
}
// return 0 if all masks have been enumerated
return 0;
}
int subsetequal(int * setS, int size, int numN)
{
int * mask = (int*) malloc(sizeof(int) * size); // build and initialize an array indicating whether the indexed element is in a subset
memset(mask, 0, sizeof(int) * size);
int i = 0, flag = 0, sum = 0; // flag is used to control loop structure, if the sum of a subset equals to numN, the loop stops and result is returned; otherwise it continues until all subsets are enumerated
while (!flag && nextmask(mask, size)) {
sum = 0; // always reset the sum for each iteration
for (i = 0; i < size; i++) {
if (mask[i]) {
sum += setS[i]; // element is included in current subset, add it to current subset' sum
}
}
if (sum == numN)
flag = 1;
}
free(mask);
return flag;
}
Bitset: Another way of enumerate all subsets is to use unsigned integer instead of array of integers, since an integer is group of bits with 0s and 1s, and each bit can be used to mask existence of an element in the subset. Here an unsigned integer is used because the sign bit should be used as a mask as well, so it won't waste a bit. However, an unsigned integer is only 32 bits long, and using a single unsigned integer would impose a limit on the range of input set size. The way to avoid that is to use multiple unsigned integers to represent masks with input set size larger than 32, which is essentially a bitset. A second sequential solution for this assigment is provided use multiple unsigned integers as masks. Simliar to the above, only part of the code is listed, and you can checkout the complete source code here .
// decrement the bitset
void decrement(unsigned int * bitset, int bsize)
{
while (--bsize >= 0 && !bitset[bsize]--);
}
// check whether all bits of the bits are zeros, if not, return 1
int hasbits(unsigned int * bitset, int bsize)
{
unsigned int n = 0;
while (bsize--)
n |= bitset[bsize];
return (n > 0);
}
// interface for check if there exists a subset of setS whose sum adds up to numN
// setS -- set of elements, size -- number of elements in the set, numN -- integer N
int subsetequal(int * setS, int size, int numN)
{
int flag = 0; // flag to control the loop structure
int bsize = size / (sizeof(unsigned int) * 8) + 1; // number of unsigned integer needed to represents all the bits of the mask
unsigned int * bitset = (unsigned int *) malloc(sizeof(unsigned int) * bsize); // create bitset using unsigned integers
int tmp = size % (sizeof(unsigned int) * 8); // determine how many left-over bits, those bits won't fill in a whole unsigned integer (on most machines this would be 4 bytes or 32 bits)
bitset[0] = tmp ? (1U << tmp) - 1 : ~0U; // determine whether the left-over is 32 bits
int i;
for (i = 1; i < bsize; i++)
bitset[i] = ~0U; // fill in rest of bit sets with all 1s
int sum;
unsigned int mask;
while (!flag && hasbits(bitset, bsize)) {
sum = 0;
tmp = bsize;
for (i = 0, mask = bitset[--tmp]; mask; i++, mask >>= 1U) { // bits are shifted to right, from the last unsigned ineger of the bitset, and least significant bit is used to determine whether the correspoding element is in the array
if (mask & 1U) {
//printf("%d ", setS[i]);
sum += setS[i]; // adds up the element to current subset's su,
}
if (mask == 0) {
mask = bitset[--tmp]; // if current unsigned integer mask is finished, use next one in bitset
}
}
//printf("sum=%d\n", sum);
if (sum == numN) {
flag = 1; // find a match
}
decrement(bitset, bsize); // decrement the bitset as a whole
}
free(bitset);
return flag;
}
Please download the skeleton files and run the sample sequential solutions with provided test cases: mask array and bit set. You can use the codes in the sample solutions and make neccessary modifications, but you must implement you solution in parallen using pthread.
Makefile
For the makefile provided, the following describes its use:
Usage:
all -- compiles ipa4
clean -- removes all temporary files
debug -- compiles ipa4 for debug
test -- tests the compiled program with provided testcases of set size smaller than 28
testall -- tests the compiled program with all testcases, including input set of large size
leakcheck -- checks for potential memory leaks
You need to add your own header and source file in the corresponding section of the Makefile. The sections are as following, and they should only be modified if needed.
SOURCE :=
HEADER :=
LIBRARY :=
For example, if you implement your parallel subsetequal function in a file named sub.c. You need to edit the "SOURCE :=" line in the Makefile so it looks like this:
SOURCE := sub.c
You can also add more source files, provided that they in the same line as "SOURC := " and each is delimited by whitespace. If you have a header file sub.h, you need to add to the line "HEADER :=" so it becomes:
HEADER := sub.h
If you have used the math funcions such as pow or log, add the math library -lm to the line "LIBRARY :=", so it becomes:
LIBRARY := -lm
Note that the pthread library -lpthread needs NOT to be added since it is already included. More libraries can be appended at the end of the line with each delimited by a space. Also note that the provided source file ipa4_man.c and header file ipa4.h needs NOT to be added as they are included as well.
Debugging
Debugger is a very useful tool to find errors in your code. This is a video showing how to use gdb. Click here for more debugger information.
Version Control
You should use version control for every program. Click here for step-by-step instructions using Eclipse and SVN.
|