Loading...
Please wait, while we are loading the content...
Similar Documents
Efficient Parallel Implementation of Classical Gram-Schmidt Orthogonalization Using Matrix Multiplication
| Content Provider | Semantic Scholar |
|---|---|
| Author | Yokozawa, Takuya Takahashi, Daisuke Boku, Taisuke Sato, Mitsuhisa |
| Copyright Year | 2006 |
| Abstract | Extended Abstract The Gram-Schmidt orthogonalization process is one of the fundamental algorithms in linear algebra that implements the QR decomposition of a matrix into the factorization A = QR. Efficient Gram-Schmidt orthogonalization algorithms have been investigated thoroughly [1–3]. Two basic computational variants of the Gram-Schmidt process exist: the classical Gram-Schmidt (CGS) algorithm and the modified Gram-Schmidt (MGS) algorithm [4]. The modified Gram-Schmidt (MGS) algorithm is often selected for practical application because it is much more stable than the CGS algorithm. However, the MGS algorithm cannot be expressed by Level-2 BLAS, and so parallel implementation requires additional communication [5]. On the other hand, the CGS algorithm can be expressed by Level-2 BLAS and is suitable for parallelization. Moreover, the CGS orthogonalization with the DGKS correction [1] is one of the most efficient ways to perform the orthogo-nalization process. We present herein an efficient parallel implementation of the CGS orthogo-nalization using matrix multiplication. The CGS orthogonalization of a matrix can be changed into a matrix multiplication. The CGS orthogonalization can also be extended with matrix multiplication into a recursive formulation. The recursion leads to automatic variable blocking [6]. A recursive CGS algorithm to perform the QR decomposition is shown in Fig. 1. Let the matrix A be denoted as A = (a 1 a 2 · · · a n) and the matrix Q be denoted as Q = (q 1 q 2 · · · q n). Here, NB, S and w are the blocking size, the work matrix and the work vector, respectively. The function recursive CGS(A, Q, 0, n) performs the orthogonalization process of matrix A. We parallelized the recursive CGS algorithm using a column-wise distribution [3]. In order to evaluate the proposed recursive CGS algorithm, we compared its performance to that of the proposed recursive CGS algorithm and a naive implementation of the CGS algorithm using Level-2 BLAS. The CGS orthog-onalization processes were performed on double-precision real data. A 32-node Xeon PC cluster (Irwindale 3 GHz, 12 K uops L1 instruction cache, 16 KB L1 |
| File Format | PDF HTM / HTML |
| Alternate Webpage(s) | http://pmaa06.irisa.fr/pres/14-Yokozawa-PMAA06.pdf |
| Alternate Webpage(s) | http://pmaa06.irisa.fr/papers_copy/41.pdf |
| Language | English |
| Access Restriction | Open |
| Content Type | Text |
| Resource Type | Article |