Implement a dense matrix multiplication of two N x N matrices (C = N x N) and then parallelize it with
1) Compare the performance of the OpenMP version solution with the serial version solution and
show how it scales when increasing number of threads.
2) Try different loop orders to see the impact of cache locality on performance (e.g., changing the
order of the outer loop and inner loop when doing the matrix multiplication). What is the
overall best performing version?
Parallel LU decomposition of a square matrix A (N x N) with OpenMP. Assume that the square matrix
A is generated randomly. Show the performance of the LU decomposition when changing the number
Hint: LU decomposition is one of the methods solving square systems of linear equations. It
decomposes the matrix A into a product of two matrices: a lower triangular matrix L and an upper
triangular matrix U. The decomposition can be represented as follows:
A = LU
Each program must work correctly and be well documented. You should hand in:
1. Report file (in PDF format): This is your change to explain your solution, possible
optimization, any insights gained, problems encountered, etc. The report should include the
performance results for your solution and your analysis of results.
2. Readme: This file should include the instructions to build and run your program.
3. Source Code: You must hand in all your source code.
4. Output file with timings of your performance testing. It should be consistent with your
本网站支持 Alipay WeChatPay PayPal等支付方式
E-mail: firstname.lastname@example.org 微信号:vipnxx