

Presentation based on materials/comments from Jason Cong, Yong Cui, William Hsu, Alfred Louis, Xiuhong Li, Yun Liang, Guojie Luo, Peter Maass, Thomas Page, Linjun Qiao, Frank Natterer, Thomas Schuster, Wentai Zhang.















## 9 of 40 High-level Synthesis C->FPGA • Energy-efficient accelerator-rich architecture

- Optimization under the performance, power, and cost constraints
- Extensive use of accelerators (algorithmic blocks)
- Limited onboard memory with full control including precision
   Extendable
- VivadoHLS of Xilinx
   <u>https://www.xilinx.com/products/design-tools/viva</u>
- How to design algorithms for FPGA

   to enable the advantages of FPGA
   Cong et al, High-Level Synthesis for FPGAs: From Prototyping to Deployment, 2011.



- Background & Motivation

   Energy-efficient computing with FPGA
- Asynchronous parallel iterative algorithms • Communication model
- FPGA implementation

   Techniques for implementation
- Summary

## Iterative methods Many algorithms have the structure

$$\begin{aligned} x(t+1) &= f(x(t)), \quad t = 0, 1, \cdots \\ x(t) &\in \mathbb{R}^n, f : \mathbb{R}^n \to \mathbb{R}^n. \end{aligned}$$

11 of 40

• Component form 
$$x_i(t+1) = f_i(x_1(t), \cdots, x_n(t)), i = 1, \cdots, n.$$

- It can be parallelized by letting each one of n
  - processors update a component  $x_i$  by  $f_i$ ,
  - at each stage, the *i*-th processor
    - has the value of all components of x(t) on which  $f_i$  depends,
    - computes the new value x<sub>i</sub>(t + 1),
      communicates it to other processors in order to start the next
    - iteration.
      Bertsekas and Tsitsiklis, Parallel and distributed computation : numerical method



## Jacobi and Gauss-Seidel Iterations

13 of 40

- Jacobi-type iteration
- all components, simultaneously, updated and made available for next iteration

$$x_i(t+1) = f_i(x_1(t), \cdots, x_n(t))$$

Gauss-Seidel Iteration
 components are updated, one at a time, using the most recently computed values of other components

 $x_i(t+1) = f_i(x_1(t+1), \cdots, x_{i-1}(t+1), x_i(t), \cdots, x_n(t))$ 

Bertsekas and Tsitsiklis, Parallel and distributed computation : numerical methods. 198

updated in a cyclic order from 1 to n (or p for block iterative methods).
One update of all components is a sweep.

- Gauss-Seidel Iteration
- They incorporate the newest available information.
- Hence, they sometimes converge faster than the corresponding Jacobi-type algorithms.
- Gauss-Seidel algorithms can have different updating orders for  $f_i$ .
- In addition to the cyclic order, there are other orders.

Bertsekas and Tsitsiklis, Parallel and distributed computation : numerical methods. 1989. Censor and Zenios, Parallel Optimization: Theory, Algorithms and Applications, 1997.



Bertsekas and Tsitsiklis, Parallel and distributed computation : numerical methods. 1989. Avron, et al, Revisiting asynchronous linear solvers: provable convergence rate through randomization, Journal of ACM, 2015.





























- High-level Synthesis C->FPGA
- Energy-efficient accelerator-rich architecture
  Can be optimized under the performance, power, and cost constraints
- Extensive use of accelerators (algorithmic blocks)
- Limited onboard memory vs others
   Extendable
- VivadoHLS of Xilinx

   https://www.xilinx.com/products/design-tools/vivado
- How to design algorithms for FPGA?
   > Asynchronous iterative parallel algorithms











| Data set | method | mean  | max   | min   | std    |
|----------|--------|-------|-------|-------|--------|
| 215mA    | ours   | 0.991 | 0.996 | 0.942 | 0.0611 |
|          | vendor | 1.000 | 1.000 | 1.000 | 0.0000 |
| 150mA    | ours   | 0.981 | 0.989 | 0.965 | 0.0077 |
|          | vendor | 0.997 | 0.998 | 0.995 | 0.0010 |
| 100mA    | ours   | 0.972 | 0.990 | 0.901 | 0.0258 |
|          | vendor | 0.997 | 0.998 | 0.995 | 0.0011 |
| 50mA     | ours   | 0.976 | 0.989 | 0.920 | 0.0188 |
|          | vendor | 0.982 | 0.998 | 0.995 | 0.0298 |



| 39 of 40                                                                                                                                                                                                                             | I F |  |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|--|
| Summary                                                                                                                                                                                                                              |     |  |
| <ul> <li>Mumford-Shah regularization for XCT         <ul> <li>FPGA implementation with a quality comparable with the vendor's result.</li> </ul> </li> </ul>                                                                         |     |  |
| <ul> <li>Asynchronous parallel computation is necessary for<br/>architectures with rich computing units         <ul> <li>multi-core CPU/GPU</li> <li>accelerators rich FPGA</li> <li>multi-node supercomputer</li> </ul> </li> </ul> |     |  |
| <ul> <li>Algorithms with communication model         <ul> <li>Memory access/communication expense should not be<br/>ignored</li> </ul> </li> </ul>                                                                                   |     |  |

Thank you for your attention.

.

Supported by – National Science Foundation of China (61520106004). National Basic Research and Development Program of China (973 Program) (2015CB351803).