Efficient topology optimization using GPU computing with multilevel granularity

Jesús Martínez-Frutos, Pedro J. Martínez-Castejón, David Herrero-Pérez*

Computational Mechanics & Scientific Computing Group, Department of Structures and Construction, Technical University of Cartagena, Campus Muralla del Mar, 30202 Cartagena, Murcia, Spain

Abstract

This paper proposes a well-suited strategy for High Performance Computing (HPC) of density-based topology optimization using Graphics Processing Units (GPUs). Such a strategy takes advantage of Massively Parallel Processing (MPP) architectures to overcome the computationally demanding procedures of density-based topology design, both in terms of memory consumption and processing time. This is done exploiting data locality and minimizing both memory consumption and data transfers. The proposed GPU instance makes use of different granularities for the topology optimization pipeline, which are selected to properly balance the workload between the threads exploiting the parallelization potential of massive parallel architectures. The performance of the fine-grained GPU instance of the solving stage is evaluated using two preconditioning techniques. The proposal is also compared with the classical CPU implementation for diverse topology optimization problems, including stiffness maximization, heat sink design and compliant mechanism design.

Keywords: GPU computing, Topology optimization, Compliance, Compliant mechanism, Heat conduction.

*Corresponding author

Email addresses: jesus.martinez@upct.es (Jesús Martínez-Frutos), pedro.castejon@upct.es (Pedro J. Martínez-Castejón), david.herrero@upct.es (David Herrero-Pérez)
1. Introduction

Topology optimization aims to find the optimal distribution of material within a design domain such that an objective function is minimized under certain constraints [1]. Contrary to size and shape optimization methods, topology optimization permits to obtain a material distribution without assuming any prior structural configuration. This provides engineering designers with a powerful tool to find innovative and high-performance conceptual designs at the early stages of the design process. Not to mention the great impact of the optimization of geometry and topology on the structural performance. This problem has sparked a broad interest since the early work of Bendsøe and Kikuchi [2], giving rise to a multitude of studies in a wide range of physics problems, such as stiffness maximization of structures [3], design of compliant mechanisms [4], maximization of temperature diffusivity [5], and minimization of acoustic emission [6], to name but a few [7].

Shape and topology optimization methods can be broadly classified into three main categories depending on the representation used to describe the shapes they involve: density-based methods, Eulerian methods and Lagrangian methods. The methods included in the first category operate on a fixed grid of finite elements and seek an optimal void/solid material distribution that minimizes an objective function. The homogenization method [2, 8] and the Solid Isotropic Material Penalization (SIMP) method [1, 9] are some examples of the most popular topology optimization approaches included in this category. The second category is composed of methods that use an implicit representation of the structural boundary. Such a boundary can be modified by tracking the motion of a level-set function, as is done in the Level-Set Method (LSM) [10, 11, 12], or by evolving the interfacial dynamics of phase field equations, as occurs in the phase field models [13]. The third category is composed of methods that use an explicit representation of the structural shape by means of a computational mesh or CAD model [14, 15]. This work deals with the efficient computation of density-based topology optimization methods using GPU computing.
Despite the great advances made in theory and practical application of topology optimization in the past decade, the computational requirements still remain as a primary challenge [7]. This is due to some demanding tasks involved in the topology optimization pipeline, such as the solving of large systems of equations, the computation of sensitivities and the filtering strategy. Such tasks may increase meaningfully the computation time of the topology optimization process, which may take hours or even days for relatively large models. High Performance Computing (HPC) is then needed to address the topology optimization process, normally making use of task-level parallel computing to address the computationally intensive tasks [16, 17, 18].

The use of Graphics Processing Units (GPUs) for non-graphics applications is rapidly growing in popularity [19, 20]. This is due to the high computing capacity of these graphics cards for Massively Parallel Processing (MPP) at reasonable cost. GPU computing consists of the use of a GPU together with a CPU to accelerate compute intensive applications. This is not a simple goal since there exist numerous problems that prevent the use of GPU computing for certain scientific applications, such as memory related problems and lack of data-level parallelism. The memory related problems include excessive global memory transactions, non-coalesced global loads and stores that degrade global memory bandwidth, and shared memory accesses inducing bank conflicts, to name but a few. The lack of data-level parallelism prevents the exploitation of Single Instruction Multiple Data (SIMD) parallel computation for which GPU architectures are designed. Therefore, the proper implementation of topology optimization methods using GPUs requires a suitable formulation and selection of techniques allowing making use of the potential acceleration of massive parallel architectures and preventing memory related problems [21], which constraint severely the GPU performance.

GPU computing has been successfully used in diverse engineering and scientific problems requiring numerical analysis. One can mention the acceleration of the solving of parametric integral equations in elasticity [22], system of equations in finite element problems [23] and peridynamic systems in peridy-
namic models of solid mechanics [24, 25, 26]. These graphics devices are also successfully used for real-time simulation and haptic feedback of soft tissue deformations [27, 28], which are especially useful for the development of realistic simulators. Besides, relevant results are obtained for the structural solver of finite element explicit dynamics problems using GPU computing [29]. These graphics cards have also shown promising results in heterogeneous systems addressing large-scale problems in Finite Element Analysis (FEA) [30]. The use of these devices to speedup computationally demanding tasks in the topology optimization pipeline has sparked a broad interest last years, giving rise to several studies.

The early work of Wadbro and Berggren [31] aims to solve large topology optimization problems using a gradient-based optimality criterion method. This early work implements a Preconditioned Conjugate Gradient (PCG) method on GPU to solve high resolution finite element models arising in heat conduction topology optimization problems. The grain size of this GPU instance is at the element level. The lack of native double-precision support for early GPUs limited the GPU instance to single-precision format, which not ensures the convergence of the solver due to round-off errors. A nodal-wise assembly-free GPU implementation for the solver of the SIMP method is proposed in [32]. Applied to the minimization of the structural compliance problem, this GPU instance achieves significant speedups following the strategy of loading three successive 2D slices of the third dimension into shared memory to perform the computations required by the middle slice efficiently. Such a slice-wise and nodal-based strategy is also adopted in [33] achieving speedups of one order of magnitude for the solving of the system of equations of elasticity. GPU computing is also used to increase the tractable computational resolution of topology optimization problems using discrete level-set methods [34] and evolutionary structural optimization methods [35].

GPU computing using the sparse-matrix representation permits to efficiently assembly and solve the system of equations of elasticity [36]. However, a higher performance can be achieved exploiting the grid regularity and performing the
operations “on-the-fly”. The former permits to exploit data locality providing reduced memory accesses and making use of on-chip memory, which is much more efficient than global device memory. The latter avoids storing the matrix of coefficients explicitly in the global device memory, which affects seriously the GPU performance. For these reasons, matrix-free GPU implementations using regular grids show good performance results for the Finite Element Analysis (FEA). The GPU instance of PCG solver using geometric multigrid preconditioning in topology optimization configured to perform a reduced number of FEAs and iterations per FEA, permitted Wu et al. [37] to solve large-scale problems in a short time. This is done configuring the iterative method with low tolerance level along with SIMP method using standard Optimality Criteria (OC) method [1]. The GPU instance is based on the node-wise GPU parallelization proposed by Dick et al. [38], where the grid regularity is exploited to perform coarsening and matrix-vector operations efficiently. Besides, the operations at the finest level are performed “on-the-fly” to increase the GPU performance.

This paper proposes a multi-granular GPU implementation of the different stages involved in density-based topology optimization methods. On the one hand, a fine-grained GPU implementation of matrix-free PCG solver for structural analysis is adopted. The regularity of the grid permits to exploit data locality maximizing the GPU performance for FEA [39]. The granularity of matrix-vector multiplication operations is at the Degree of Freedom (DoF) level, which allows reducing and balancing the workload for all the threads of the MPP architecture [23, 40]. Another key point for increasing the GPU performance is that matrix-vector multiplication operations are launched by three-dimensional kernels using cache data in shared memory for the corresponding three-dimensional blocks. This strategy permits to increase the use of on-chip memory, i.e. the cache data in shared memory, by the threads performing the operations for the corresponding DoF. This achieves a significant improvement for solving the system of equations using GPU computing. The performance of the GPU instance for the solving stage is evaluated in terms of speedup and wall-clock time analyzing two preconditioning techniques; in particular, Jacobi
Figure 1: (a) Thread batching and memory model and (b) memory hierarchy of CUDA.
preconditioner and geometric multigrid preconditioner. On the other hand, the
calculation of sensitivities, filter and density update are also implemented using
GPU computing in order not to limit the theoretical speedup according to
Amdahl’s law [41]. The granularity of such tasks is at the finite element level
due to the nature of the operations do not allow us to reduce the grain size,
which usually improves the GPU performance. The proposed matrix-free GPU
instance is compared to the classical sparse-matrix CPU implementation for
diverse topology optimization problems, including stiffness maximization, heat
sink design and compliant mechanism design. The speedups and relative wall-
clock time in density-based topology optimization pipeline is also studied for
such topology optimization problems.

The paper is organized as follows. Section 2 provides an overview of the
GPU architecture and the CUDA programming model. The bases and the the-
etorical background of density-based topology optimization methods are briefly
reviewed in section 3. The proposed GPU implementation of SIMP method is
presented in section 4. Section 5 is devoted to the numerical experiments and the
performance evaluation of the proposed matrix-free GPU instance with respect
to the classical sparse-matrix CPU implementation. Finally, the conclusion of
the proposed GPU instance is presented in section 6.

2. GPU and CUDA architecture

GPU devices were initially designed to satisfy the market demand of real-
time and realistic 3D visualization. The use of these graphic cards, with mas-
sively parallel architecture, in non-graphics HPC applications is becoming very
popular due to their high computing capacity at a reasonable cost. Currently,
the use of Nvidia devices and its programming model, Compute Unified De-
vice Architecture (CUDA) [42], is the prevailing tendency, which is adopted in
the developments presented in this work. Such a programming model allows
to view the GPU as a compute device able to perform data-parallel computa-
tion (data/SIMD parallelism) using multiple cores. The parallel code (single
instruction) is defined as a C Language Extension function, called kernel, which is executed by a lot of CUDA threads using different data (multiple data). The kernel call, invoked from the host (CPU) to the device (GPU) as shown in Figure 1(a) taken from [43], should specify the number of CUDA threads organized as a grid of thread blocks.

The CUDA threads have only access to the device SGRAM (Synchronous Graphic Random-Access Memory), a type of DRAM (Dynamic Random-Access Memory) with high bandwidth interface for graphics-intensive functions, and to the on-chip SRAM (Static Random-Access Memory) through the memory spaces depicted in Figure 1(a). The blocks are batch of threads able to cooperate by sharing data through shared memory and to synchronize their execution coordinating memory accesses. A key point is that CUDA architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). The blocks of the grid, invoked by each kernel, are distributed to SMs depending on their execution capacity, which includes on-chip memory resources. The use of on-chip memory, much faster than SGRAM memory, is of paramount importance to increase significantly the GPU performance.

For that reason, the CUDA memory hierarchy, shown in Figure 1(b), is crucial to optimize memory access and achieve a reasonable performance. We can observe that each SM has the following on-chip memory: one set of registers (R) per processor (C) and a shared memory, a read-only constant cache and a read-only texture cache. These memory resources are shared by all cores (C_i) of such a SM. This fact implies that the amount of blocks that a SM can process at once depends on the number of registers per thread and the shared memory per block required for a given kernel. For this reason, the use of shared memory can show relatively poor performance for computation using large arrays. CUDA cannot schedule more blocks to SMs than the multiprocessors can support in terms of shared memory and register usage, and thus the occupancy (number of active warps) is deteriorated. A key point for the proposed GPU instance is that the constant memory is stored in SGRAM but data are read through each multiprocessor constant cache, which is on-chip memory. Constant memory is
also optimized for broadcast, i.e. when warp of threads read same location, but it is however serialized when warp of threads read in different locations. For these reasons, the proposed GPU instance uses constant memory for storing the common elemental stiffness matrix, whereas the use of shared memory is limited to unknowns and elemental densities. Such a strategy achieves a significant GPU performance for the calculations involved in density-based topology optimization.

The software developments using CUDA consist in the following steps: i) memory allocation and transaction, ii) kernel execution on GPU and iii) copy back the results to the host. The strategies to optimize code in GPU computing can be summarized as follows: i) optimization of parallel execution to achieve maximum use of cores, ii) optimization of memory management to facilitate coalesced memory accesses, iii) optimization of instruction usage to achieve maximum instruction performance, and iv) optimization of communications to achieve minimal synchronization between parallel executions. The different effects of the proposed GPU implementation can be explained using these optimization criteria.

3. Density-based topology optimization

Topology optimization can be defined as a binary programming problem that aims to find the optimal material layout (solid and void) that minimizes an objective function. Such a material layout should satisfy a set of prescribed constraints in the design domain. Density-based methods are the most widely used topology optimization methods due to its conceptual simplicity, which has facilitated its application in industrial software [44]. In these methods, the integer-based topology optimization problem is relaxed to a formulation based on artificial continuous material densities, which permits the use of gradient-based solvers. The topology optimization problem can be stated as
\[
\min_{\rho} \quad f(\rho, \mathbf{u})
\]
\[
\text{s. t. : } \quad K(\rho) \mathbf{u} = \mathbf{f}
\]
\[
V(\rho) \leq V^* \\
0 \leq \rho(x) \leq 1, \quad x \in D
\]

where \( f \) is the objective function, \( \rho \) is the vector of density design variables, \( \mathbf{u} \) is the system response, \( K \) is the global stiffness matrix, \( \mathbf{f} \) is the force vector and \( x \) is the vector of finite elements. The design domain is denoted by \( \mathcal{D} \) and the volume of material \( V(\rho) \) is constrained to be smaller than a prescribed target \( V^* \). The unknown densities, \( \rho(x) \), are used to scale the stiffness of the finite elements of the regular grid. In practice, this parametrization leads to designs with large areas of intermediate densities which, even though being numerical optimal, are impossible to manufacture. This problem is normally addressed using implicit relaxation/penalization techniques, which drive the topology design towards solid/void configurations. The SIMP method \([9, 45]\) makes use of such implicit penalization techniques by a power-law interpolation function between void and solid to determine the stiffness matrix of each element \( K_e \) as follows

\[
K_e = K_{\min} + \rho_e^p (K_0 - K_{\min}),
\]

where \( K_0 \) and \( K_{\min} > 0 \) are the stiffness matrix of solid and void material respectively, and \( p > 1 \) is the penalization power. For problems where the volume constraint is active, Bendsøe and Sigmund \([46]\) prove that the power-law interpolation function is perfectly valid when \( p \) is sufficiently large. In particular, \( p \geq 3 \) is usually required to obtain black-and-white designs.

Although the use of material interpolation schemes enables to obtain almost solid-and-void designs, they destroy the convexity of the optimization problem increasing the risk of ending in local minima. However, it is common to use continuation methods to mitigate the premature convergence to local minima when solving the optimization problem, see e.g. \([47]\). According to \([48]\), continuation
methods take “global” information into account and are more likely to ensure “global” convergence or at least convergence to better designs. Different continuation methods have been proposed based on the idea of gradually change the optimization problem from a convex problem to the original non-convex problem.

The topology optimization problem should also be regularized using additional constraints on the density field to avoid numerical difficulties and modeling problems, such as mesh-dependency of solutions and checker-board patterns [1] respectively. The sensitivity filter [4] is adopted in this work because it has proven to be effective in practice producing mesh-independent solutions. Moreover, the filtering of gradients has a continuum mechanics motivation and may promote convergence of some length scales over others, and thereby speeds up convergence [49]. Furthermore, the sensitivity filter has computational advantages because it is not included in the OC updating scheme loop. One drawback is, however, that there remain discrepancies between the filtered sensitivities and the actual sensitivities, i.e. the modified sensitivities do not completely correspond to the objective function. In theory, this may lead to some divergence problems though proper designs are obtained in practice. The sensitivity filter applies a smoothing filter to the derivatives of the objective function as follows

\[
\frac{\partial f(\rho)}{\partial \rho_e} = \frac{\sum_{i \in NB_e} w(x_i, x_e) \rho_i \frac{\partial f(\rho)}{\partial \rho_i}}{max(\gamma, \rho_e) \sum_{i \in NB_e} w(x_i, x_e)}, \tag{3}
\]

where \(NB_e\) is the neighborhood set of an element \(e\), \(w(x_i, x_e)\) is a weighting function and \(\gamma > 0\) is a small value to prevent the division by zero. The linear weighting function is used in this work, which is defined as

\[
w(x_i, x_e) = \begin{cases} 
R - ||x_i - x_e|| & \text{if } ||x_i - x_e|| \leq R \\
0 & \text{if } ||x_i - x_e|| > R 
\end{cases},
\]
whereas the neighborhood set of an element \( e \) is defined as

\[
NB_e := \{ i \mid \text{dist}(i,e) \leq R \},
\]

(4)

where \( R \) is the filter size and \( \text{dist}(i,e) \) is the Euclidean distance between the center of element \( i \) and the center of element \( e \).

The SIMP method can be applied to diverse physics problems. The numerical experiments of this work address the stiffness maximization of continuum structures, the heat sink design cooled by heat conduction and the compliant mechanism design problems. The objective function for the minimization of structural compliance (maximization of stiffness) and the minimization of thermal compliance (maximization of heat transfer) is given by

\[
f = c = f^T u,
\]

(5)

whereas the objective function for the compliant mechanism design, consisting of the maximization of output displacements, is as follows

\[
f = -u_{out} = -l^T u,
\]

(6)

where \( f \) and \( u \) are the global force/thermal load and displacement/temperature vectors respectively for elasticity/heat transfer problems, and \( l \) is a vector with ones at the Degrees of Freedom (DoF) of the output displacements \( u_{out} \) and zeros in all other positions. Considering the discretized linear state system \( Ku = f \) and using the adjoint state method, the sensitivity of (5) and (6) with respect to the state variables \( \rho \) is as follows

\[
f_{\rho} = \frac{\partial f}{\partial \rho} = -u^*^T \frac{\partial K}{\partial \rho} u = -u^*^T (p\rho^{-1} (K_0 - K_{min})) u,
\]

(7)
where $u^*$ is given by the solution of the adjoint problem as follows

$$Ku^* = \frac{\partial f}{\partial u}.$$  \hspace{1cm} (8)

The right hand side of the adjoint problem is $\partial f / \partial u = f$ for the minimization of both structural and thermal compliance, which means that these problems are self-adjoint and the solution of (8) is $u^* = u$. Conversely, the right hand side of (8) is $\partial f / \partial u = 1$ for the compliant mechanism design, which requires the solving of the adjoint system (8) to obtain $u^*$.

The sensitivities given by (7) permit to update the design variables $\rho$ using sequential convex approximations, such as the Sequential Quadratic Programming (SQP) [50] and the Method of Moving Asymptotes (MMA) [51]. The Optimality Criterion (OC) updating scheme proposed by [52] and modified by [53] is adopted in this work due to its numerical efficiency. The OC updating scheme is as follows

$$\rho_{e_k+1} = \begin{cases} 
\max\{(1 - \zeta)\rho_{e_k}, 0\} & \text{if } \rho_{e_k} B^n_{e_k} \leq \max\{(1 - \zeta)\rho_{e_k}, 0\}, \\
\min\{(1 + \zeta)\rho_{e_k}, 1\} & \text{if } \min\{(1 + \zeta)\rho_{e_k}, 1\} \leq \rho_{e_k} B^n_{e_k}, \\
(\rho_{e_k} B^n_{e_k})^q & \text{otherwise},
\end{cases} \hspace{1cm} (9)$$

where $\zeta$ is a positive step width, $\eta$ is a numerical damping coefficient, $q$ is a penalty factor to further achieve black-and-white topologies (typically $q = 2$) and

$$B_{e_k} = -\frac{\partial f(\rho)}{\partial \rho_e} \left( \lambda \frac{\partial V(\rho)}{\partial \rho_e} \right)^{-1} = 1 \hspace{1cm} (10)$$

is the Karush-Kuhn-Tucker (KKT) optimality condition. The Lagrange multiplier $\lambda$ is found using the bisection method. The algorithm stops when the maxi-
The FEA is the principal bottleneck of the topology optimization pipeline. This stage involves two computational intensive tasks: the assembly of the local element equations into a global system of equations and the solving of such a system. These computational intensive tasks can lead to an unaffordable problem in terms of computation time and memory consumption. This problem is exacerbated when dealing with large-scale models [54] or when the system response needs to be re-evaluated, as occurs in topology optimization. Iterative solvers and assembly-free methods have been extensively used for reducing the memory requirements of FEA at the cost of increasing the processing time of the solve step, which is commonly alleviated using parallel computing.

4. GPU implementation of SIMP method

GPU computing is used to accelerate the computationally intensive tasks involved in the SIMP method. Such tasks are shown in the flowchart depicted in Figure 3; in particular, the Finite Element Analysis (FEA), the calculation of the sensitivities, the filtering strategy and the OC updating scheme. The custom-developed CUDA kernels and the techniques adopted for the efficient implementation of these computationally intensive tasks on GPU architectures are detailed below.

4.1. Finite Element Analysis (FEA)

In this work, the performance of the solving stage is evaluated using a matrix-free PCG method with two different preconditioning techniques: geometric multigrid preconditioner and Jacobi preconditioner. The geometric multigrid methods are based on the smoothing property and the coarse grid principle. The former reduces the high frequency error components whereas the latter approximates the low frequency error components on coarser grids, which are then prolonged to the finer grids. Their major advantage is that they have...
an asymptotically optimal complexity of $O(N)$ and provide mesh-independent convergence and good parallel scalability [55]. However, the performance of these methods deteriorates with increasing contrast in material properties [56]. This is attributed to the coarsening across discontinuities which affects to the coarse grid correction [57]. Nevertheless, the use of geometric multigrid as preconditioning technique shows good convergence rates for topology optimization problems using a sufficiently strong smoothing operator [58].

The use of a regular grid permits to calculate and store the common elemental stiffness matrix at the finest grid $K^0$ only once at the beginning of the optimization, whereas the global matrix $K$ at the finest grid can be calculated “on-the-fly” using the elemental properties $d = E_{\text{void}} + \rho_p(E_{\text{solid}} - E_{\text{void}})$ (for simplicity) for each analysis. This reduces meaningfully the use of device memory and permits to exploit the data locality [39]. Such an approach is enough for the Jacobi preconditioning but the geometric multigrid preconditioner requires the assembled coefficients at the coarser levels, which are computationally intensive to calculate “on-the-fly” and require significant memory resources when they are stored. In particular, a Galerkin-based coarsening is required and the assembled coefficients at the coarser levels need to be stored. This deteriorates the GPU performance due to the global memory accesses through large memory, which does not permit to exploit data locality.
Algorithm 1: DbD PCG algorithm (dbdPCG)

Data: $K_0^p$, $f_0$, $g_0$, $d$, tol, $\kappa_{max}$, $\nu_1$, $\nu_2$, $\nu_3$, $\omega$, $I_c$, $I$, $C$, $D_{F,n}$, $n_x$, $n_y$, $n_z$
Result: $u$

1. $\rho_0 \leftarrow 0$; $\gamma_0 \leftarrow 0$; $k \leftarrow 0$; $f \leftarrow 0$; // initialization
2. $u \leftarrow u_0$; $f \leftarrow f_0$; // Device initialization
3. $r \leftarrow dbdMVP(u, d, K_0^p, f, I_c, I, C, D_{F,n}, n_x, n_y, n_z)$; // $r \equiv K^p u$

4. $\leftarrow threadId + BlockDim \times BlockId$; // CUDA kernel (dbdKer1)
5. if ($u < N_{D_{B,F}}$) then
6.   $f(u) \leftarrow f(u)$
7. end

9. if Multigrid then // Multigrid preconditioning
10.   $x \leftarrow VCycle(K_0^p, r, d, f, n_x, I_c, I, C, \nu_1, \nu_2, n_x, n_y, n_z, D_{F,n})$.
11. else if Jacobi then // Jacobi preconditioning
12.   $x \leftarrow dbdJacP(d, K_0^p)$
13. end
14
15. $u \leftarrow threadId + BlockDim \times BlockId$; // CUDA kernel (dbdKer2)
16. if ($u < N_{D_{B,F}}$) then
17.   $g(u) \leftarrow g(u)$
18.   $z(u) \leftarrow z(u)$
19. end

22. $\rho_0 \leftarrow \rho_0 + \frac{\|g(u)\|^2}{\|f(u)\|^2}$; // Reduction using thrust library
23. $\gamma_0 \leftarrow \gamma_0 + \frac{\|g(u)\|^2}{\|f(u)\|^2}$
24. while ($\sqrt{\gamma_0} > tol \sqrt{\rho_0}$) and ($k < \kappa_{max}$) do
25.   $k \leftarrow k + 1$
26.   $a \leftarrow dbdMVP(p, d, K_0^p, f, I_c, I, C, D_{F,n}, n_x, n_y, n_z)$; // $a \equiv K^p u$
27. $\phi_k \leftarrow 0$

29. $u \leftarrow threadId + BlockDim \times BlockId$; // CUDA kernel (dbdKer3)
30. if ($u < N_{D_{B,F}}$) then
31.   $g(u) \leftarrow a(z(u), f(u))$
32. end
33
34. $\phi_k \leftarrow \phi_k + \frac{\|g(u)\|^2}{\|f(u)\|^2}$; // Reduction using thrust library
35. $\rho_k \leftarrow \rho_k - \rho_k - 1$
36
37. $u \leftarrow threadId + BlockDim \times BlockId$; // CUDA kernel (dbdKer4)
38. if ($u < N_{D_{B,F}}$) then
39.   $u \leftarrow (u) + \phi_k p(u)$
40. end
41. if Multigrid then // Multigrid preconditioning
42.   $x \leftarrow VCycle(K_0^p, r, d, f, n_x, I_c, I, C, \nu_1, \nu_2, n_x, n_y, n_z, D_{F,n})$
43. end
44
46. $u \leftarrow threadId + BlockDim \times BlockId$; // CUDA kernel (dbdKer5)
47. if ($u < N_{D_{B,F}}$) then
48.   $z(u) \leftarrow (u)$
49. end
50. $\phi_k \leftarrow \phi_k + \frac{\|g(u)\|^2}{\|f(u)\|^2}$; // Reduction using thrust library
51. $\rho_k \leftarrow \rho_k - \rho_k - 1$
52
54. $u \leftarrow threadId + BlockDim \times BlockId$; // CUDA kernel (dbdKer6)
55. if ($u < N_{D_{B,F}}$) then
56.   $z(u) \leftarrow (u) + \phi_k p(u)$
57. end
58 end
The GPU instance to calculate and store the assembled matrices of coefficients $C$ at the coarser levels is of paramount importance for an efficient geometric multigrid implementation. The use of a regular grid permits to know a priori that the contributions to the matrix of coefficients are bounded by 8 elements and by 27 nodes per node. This permits to set the maximum size of global stiffness coefficients per node, which is bounded by 27 matrices of dimension $3 \times 3$. Such 27 matrices are related to the contributions of the $3^3$ grid neighborhood of the node. This storage scheme requires the indexes of the adjacent nodes, which are stored on a vector $I$ of integers where -1 means that the node does not exit. The storage of the global stiffness matrix $C$ per node has a similar size than using a sparse-matrix representation but permits to allocate the required memory for the assembly at the beginning, which has significant computational benefits for GPU computing.

The coefficient matrices for the coarser levels are obtained from the finer levels using a Galerkin-based coarsening following $\ell+1 C = R^{\ell+1} \ell C P^{\ell+1}$, where $R$ and $P$ are the restriction and prolongation operators respectively. Following [38], the coarsening operation is computed in a node-by-node matrix-free fashion using a two-step approach. Firstly, a linear combination of the $3^3$ fine grid neighborhood of considered node is performed, corresponding to a linear combination of the rows of $\ell C$. Secondly, these coefficients are interpolated to the coarser grid vertices, corresponding to a linear combination of the columns of $\ell C$. These operations require the vector $I_e$ of indexes of the elements contributing to each node. The GPU instance for the coarsening is performed assigning one CUDA thread to the calculation of each one of the 27 matrices of coefficients per node of the coarser level. This fine granularity provides good performance in the calculation of the matrices of coefficients at the coarser levels. The assembled matrices of coefficients $\ell C$ at the coarser levels $\ell$ are calculated and stored in the device memory. The global matrix $K$ of the finest grid is calculated “on-the-fly” using the elemental matrix $K_0^e$ and the elemental properties $d$ of the topology optimization. This is done for both preconditioners and allows exploiting data locality alleviating mostly of memory related problems in GPU
Figure 2: Cache data in shared memory by 3D block.

architectures.

The pseudocode of the DoF-by-DoF (DbD) PCG or dbdPCG GPU instance using both preconditioners for FEA is shown in Algorithm 1. Such an algorithm assumes that the hierarchical grids are composed of equally sized first-order isoparametric hexahedral elements. For the finest grid, this tessellation provides a set $V$ of $N_{ele}$ elements, a set $N$ of $N_{nod}$ nodes and a set $W$ of $N_{DoF}$ unknowns. The vector $I_c$ indicating the boundary conditions per node is also required to impose the Dirichlet conditions to the corresponding DoFs at the finest level. Thus, the input data of the dbdPCG algorithm for the finest level are the common elemental stiffness matrix $K^e$, the vector of forces $f_0$, an initialization of displacements $u_0$, the vector of elemental properties $d$ of the topology optimization, the vector $I_c$ indicating the boundary conditions per node, the number of divisions of the grid $(n_x, n_y, n_z)$ and the number of DoF per node $DoF_n$ ($DoF_n = 3$ for elasticity and $DoF_n = 1$ for heat conduction problems). Additionally, for the coarser level the vector $I$ of adjacent nodal indexes per node and the assembled matrices of coefficients $C$ per level are also included as input data. Note that the data of the coarser levels is not needed for the Jacobi preconditioner. Finally, the algorithm also requires the tolerance $tol$ and
the maximum number of iterations $k_{\text{max}}$ for the stopping criteria of the iterative method, the number of grid levels $n_\ell$, the number of pre- and post-smoothing steps $\mu_1$ and $\mu_2$, and the damping factor for Jacobi smoothing $\omega$. The GPU instance of PCG requires the matrix-vector product (dbdMVP) for both preconditioners. Besides, it also requires the diagonally preconditioner (dbdJacP) for the Jacobi preconditioning and the Vcycle preconditioner (Vcycle) for the geometric multigrid preconditioning.

<table>
<thead>
<tr>
<th>Algorithm 2: DbD Jacobi preconditioner (dbdJacP)</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Data:</strong> $d$, $K_0$</td>
</tr>
<tr>
<td><strong>Result:</strong> $M$                // Preconditioner</td>
</tr>
</tbody>
</table>

1. $u \leftarrow \text{threadId + BlockDim} \times \text{BlockId}$; // CUDA kernel
2. if ($u < N_{\text{DoF}}$) then
3. $M(u) \leftarrow 0$;
4. $\mathcal{E}(u) \leftarrow$ Determine index of elements containing $u$;
5. $\mathcal{V}(e) \leftarrow$ Determine the unknowns of $e$;
6. $i \leftarrow$ Extract index of $u$ from $\mathcal{V}(e)$;
7. $M(u) \leftarrow M(u) + d^{(e)} K_{bii}$;
8. end
9. $M(u) \leftarrow 1/M(u)$;

Additionally, the GPU instance requires the calculation of diverse vector arithmetic operations which are implemented using custom developed CUDA kernels, labeled with “dbdkerX”, with granularity at the DoF level. Some of these kernels require the synchronization of the threads involved in the computation to add the resulting data of all these threads. This can be done using atomic addition in CUDA, which permits to read, modify, and write a value back to device memory without the interference of any other threads. However, its use should be minimized because the operations are serialized when the same memory address is accessed at the same time, which can deteriorate the performance of the kernel execution. For this reason, the addition is computed as a reduction using the thrust library.
Algorithm 3: Vcycle Preconditioner (Vcycle)

Data: $K^0_l$, r, d, $\ell$, n_x, w, I_c, I, C, $\mu_1, \mu_2$, $\ell_{nx}, \ell_{ny}, \ell_{nz}, \text{DoF}_n$

Result: $\ell_z$  
// Preconditioner

1 $\ell_z \leftarrow 0$;
2 foreach $i = 1 : \mu_1$ do
3 $\ell_s \leftarrow \text{dbdDJS}(\ell_z, r, d, K^0_l, \ell, I_c, I, C, \ell_{nx}, \ell_{ny}, \ell_{nz}, \text{DoF}_n)$;
4 $\ell_z \leftarrow \ell_s$;
5 end
6 $\ell_z \leftarrow \text{dbdMVP}(\ell_z, d, K^0_l, \ell, I_c, I, C, \ell_{nx}, \ell_{ny}, \ell_{nz}, \text{DoF}_n)$;  // $\ell_z = \ell K^0 \ell_z$
7 $u \leftarrow \text{threadId + BlockDim x BlockId};$  // CUDA kernel (dbdV1)
8 if $(u < \ell N_{\text{DoF}})$ then
9 $\ell_v(u) \leftarrow \ell_r(u) - \ell_z(u)$;
10 end
11 $\ell+1 v \leftarrow R \ell (\ell+1 v)$;  // CUDA kernel – Restriction (nbnVRest)
12 if $\ell+1 = n_\ell$ then
13 $\ell+1 w \leftarrow \text{Copy to Host memory}$;
14 $\ell+1 w \leftarrow \text{Solve system } (K^{\ell+1} w) = \ell+1 v$;  // Direct solver
15 $\ell+1 w \leftarrow \text{Copy to Device memory}$;
16 else
17 $\ell+1 w \leftarrow \text{VCycle}(K^0_l, \ell+1 v, d, \ell+1, n_\ell, w, I_c, I, C, \mu_1, \mu_2, \ell+1_{nx}, \ell+1_{ny}, \ell+1_{nz}, \text{DoF}_n)$;
18 end
19 $\ell v \leftarrow P^{\ell+1}_{\ell+1} (\ell+1 w)$;  // CUDA kernel – Prolongation (nbnVProi)
20 $u \leftarrow \text{threadId + BlockDim x BlockId};$  // CUDA kernel (dbdV2)
21 if $(u < \ell N_{\text{DoF}})$ then
22 $\ell_v(u) \leftarrow \ell_v(u) + \ell_w(u)$;
23 end
24 foreach $i = 1 : \mu_2$ do
25 $\ell_s \leftarrow \text{dbdDJS}(\ell_z, r, d, K^0_l, \ell, I_c, I, C, \ell_{nx}, \ell_{ny}, \ell_{nz}, \text{DoF}_n)$;
26 $\ell_z \leftarrow \ell_s$;
27 end

The pseudocode of the Jacobi preconditioner (dbdJacP) kernel is detailed in Algorithm 2. It calculates the Jacobi preconditioner “on-the-fly” using the common stiffness matrix $K^0_l$ and the vector $d$ containing the elemental Young’s modulus/thermal conductivity for elasticity/heat conduction problems. This simple preconditioner is computationally cheap and only requires storing a vector of the dimension of unknowns. The pseudocode of the matrix-vector product (dbdMVP) kernel is shown in Algorithm 5. This algorithm performs the matrix-vector operation “on-the-fly” for the finest level ($\ell = 0$) using the vector $d$ and the common stiffness matrix $K^0_l$. For the coarser levels, it takes the matrix of coefficients contributing to the node to perform the operation. Nevertheless, the grain size of the operations is at the DoF level.

A key point to perform the matrix-vector multiplication operations efficiently
using GPU computing is the maximization of the use of on-chip memory. This is done in the proposed GPU instance by launching a three-dimensional CUDA kernel for such an operation. The displacements $\mathbf{p}$ and elemental properties $\mathbf{d}$ involved in the calculation required by the threads of the block are cached on shared memory. This requires to store some halo values of the unknowns of neighbor nodes. Figure 2 shows the dimension of the thread block in gray color whereas the size of shared memory required by elasticity problems operating at the DoF level should include the halo of neighbor nodes, which is depicted in red color. The unknown displacements are cached on the $x$ dimension, which requires a higher halo than the other dimensions. The block size of three-dimensional kernel can be tuned to maximize the use of shared memory by the threads of each block, which can improve significantly the GPU performance as shown in the numerical experiments.
Algorithm 4: DbD Damped Jacobi Smoother (dbdDJS)

Data: \( \mathbf{z}, \mathbf{r}, \mu, \mathbf{I}_c, \mathbf{I}, n_x, n_y, n_z, \mathbf{DoF}_n \)
Result: \( \mathbf{s} \)

1. \( hp \leftarrow < x, y, z > \quad \text{// CUDA kernel} \)
2. \( hd \leftarrow < 1, 1, 1 > \)

3. \( idx \leftarrow threadIdx.x + blockDim.x \cdot blockIdx.x \)
4. \( idy \leftarrow threadIdx.y + blockDim.y \cdot blockIdx.y \)
5. \( idz \leftarrow threadIdx.z + blockDim.z \cdot blockIdx.z \)

// Copy to shared memory per block
6. \( z_s[blockDim.x + 2 \cdot hp.x][blockDim.y + 2 \cdot hp.y][blockDim.z + 2 \cdot hp.z] \leftarrow \mathbf{z} \)
7. \( d_s[blockDim.x/hp.x - 1 + 2 \cdot hd.x][blockDim.y - 1 + 2 \cdot hd.y][blockDim.z - 1 + 2 \cdot hd.z] \leftarrow \mathbf{d} \)
8. \( \text{syncthreads()}; \)

// Synchronize threads in the block
9. \( u \leftarrow (idx < (n_x + 1)) \land (idy < (n_y + 1)) \land (idz < (n_z + 1)) \)
10. \( n_1 \leftarrow \text{Determine node containing } u; \)
11. \( v \leftarrow \text{Determine the unknowns of } n_1; \)
12. \( i \leftarrow \text{Extract index of } u \text{ from } v; \)
13. \( \text{foreach } k = 0 : 26 \) do
14. \( n_2 \leftarrow \ell I_n^k; \) // Loop 1
15. \( w \leftarrow \text{Determine the unknowns of } n_2; \)
16. \( \text{if } n_2 > -1 \) then
17. \( \text{foreach } e \in \mathcal{E}(n_1) \) do
18. \( \text{if } n_2 \in \mathcal{N}(e) \) then
19. \( A \leftarrow A + d_s(e)(K_0^{e,0})_{v,w}; \)
20. \( \text{end} \)
21. \( \text{end} \)
22. \( \text{end} \)
23. \( A \leftarrow \text{Impose Dirichlet BC from } \mathbf{I}_c; \)
24. \( \text{end} \)
25. \( \text{else} \)
26. \( A \leftarrow C_\mathcal{N}^{n_1}; \) // (3x3) coefficients matrix
27. \( \text{end} \)
28. \( \text{if } n_2 \equiv n_1 \text{ then} \)
29. \( M \leftarrow 1/A_{1,1}; \)
30. \( \text{end} \)
31. \( \text{foreach } j = 0 : 2 \) do
32. \( s^{(n)} \leftarrow s^{(n)} - \omega M A_{1,1} \left\{ e^{(n)} \right\}_j; \) // Loop 2
33. \( \text{end} \)
34. \( \text{end} \)
35. \( s^{(n)} \leftarrow s^{(n)} + \omega M r^{(n)}; \)
36. \( \text{end} \)
37. \( \text{end} \)

The pseudocode of the geometric multigrid preconditioner (V-cycle) kernel is shown in Algorithm 3. Such a preconditioning is carried out by a recursive call to the V-cycle algorithm. The algorithm requires as input data the vector \( \mathbf{d} \), the common elemental stiffness matrix \( K_0^e \), the vector \( \mathbf{I}_c \) of boundary conditions for the finest level (\( \ell = 0 \)) and the assembled matrix of coefficients \( \mathbf{C} \) for the coarser levels. It also needs the vector \( \mathbf{I} \) of neighbor nodal indexes per node, the residual \( \mathbf{r} \) and the parameters \( \omega, \mu_1, \mu_2 \) for all the levels. The algorithm performs a matrix-vector product (dbdMVP) and the multigrid smoother (dbdDJS) for
each level. Besides, diverse vector arithmetic operations are performed using custom developed kernels. To transfer information between two consecutive grids $\ell + 1 \Omega$ and $\ell \Omega$, a nodal-based GPU instance of prolongation operator $P^{\ell+1}_\ell : \ell + 1 \Omega \rightarrow \ell \Omega$ and restriction operator $R^{\ell+1}_\ell : \ell \Omega \rightarrow \ell + 1 \Omega$ are introduced. The geometric relationship between hierarchical grids allows us to avoid storing the prolongation operator $P^{\ell+1}_\ell$ and the restriction operator $R^{\ell+1}_\ell$ and to work with the stencils instead, which are constant or can be computed “on-the-fly” when needed. The number of levels $\ell$ is selected in order to ensure the coarsest level is small enough to be solved using a sparse LU decomposition on CPU. When the number of levels $\ell$ is properly selected, the number of DoFs in the coarsest grid is relatively small and the system of equations can be solved with a direct method on CPU. The pseudocode of the multigrid smoother (dbdDJS) kernel is shown in Algorithm 4. Such a smoother is based on the damped Jacobi method, which uses the inverse of the diagonal of global stiffness matrix with a relaxation parameter $\omega$. 


Algorithm 5: DbD Matrix-Vector Product (dbhMVP)

Data: \( p, d, K_0, \ell, I_c, I, C, n_x, n_y, n_z, \text{DoF}_n \)

Result: \( a \)  

\[ a = Kp \]

1. \( hp < x, y, z > \leftarrow < \text{DoF}_n, 1, 1 >; \quad hd < x, y, z > \leftarrow < 1, 1, 1 >; \]

2. \( \text{idx} \leftarrow \text{threadIdx.x} + \text{blockDim.x} \cdot \text{blockIdx.x} \);  
   \( \text{idy} \leftarrow \text{threadIdx.y} + \text{blockDim.y} \cdot \text{blockIdx.y} \);  
   \( \text{idz} \leftarrow \text{threadIdx.z} + \text{blockDim.z} \cdot \text{blockIdx.z} \);  

   // CUDA kernel

3. \( p_s[\text{blockDim.x}+2 \cdot \text{hp.x}][\text{blockDim.y}+2 \cdot \text{hp.y}][\text{blockDim.z}+2 \cdot \text{hp.z}] \leftarrow p; \)
4. \( d_s[\text{blockDim.x}/\text{hp.x}−1+2 \cdot \text{hd.x}][\text{blockDim.y}/\text{hp.y}−1+2 \cdot \text{hd.y}][\text{blockDim.z}/\text{hp.z}−1+2 \cdot \text{hd.z}] \leftarrow d; \)
5. \( \text{syncthreads();} \)  
   // Synchronize threads in the block

6. \( \text{if} (\text{idx} < (n_x + 1)) \&\& (\text{idy} < (n_y + 1)) \&\& (\text{idz} < (n_z + 1)) \) then
7. \( u \leftarrow (\text{idz} \ast ((n_x + 1) \ast (n_y + 1))) + (\text{idy} \ast (n_x + 1)) + \text{idx} ; \)
8. \( n1 \leftarrow \text{Determine node containing } u; \)
9. \( v \leftarrow \text{Determine the unknowns of } n1; \)
10. \( i \leftarrow \text{Extract index of } u \text{ from } v; \)
11. \( \text{foreach } k = 0 : 26 \) do  
   \( n2 \leftarrow i \text{nn}_k; \)
   \( w \leftarrow \text{Determine the unknowns of } n2; \)
   \( \text{if } n2 > −1 \) then
    \( \text{if } \ell == 0 \) then  
    \( \text{foreach } e \in \mathcal{E}^{(n1)} \) do
     \( \text{if } n2 \in \mathcal{N}^{(e)} \) then
      \( A \leftarrow A + d_s^{(e)}(K_0)_{v,w}; \)
     end
    end
    \( A \leftarrow \text{Impose Dirichlet BC from } I_c; \)
    \( \text{else} \)  
    \( A \leftarrow ^tC^{n1}; \)  
    // (3x3) coefficients matrix
   end
12. \( \text{foreach } j = 0 : 2 \) do  
   \( a^{(w)} \leftarrow a^{(w)} + A_{i,j}(p_s^{(w)}); \)  
   // \( a = Kp \)
end
end
4.2. Calculation of sensitivities

The calculation of sensitivities (7) is decomposed into element-wise operations to exploit the parallelization potential of GPU architectures. The pseudo-code of the Element-by-Element (EbE) custom-developed CUDA kernel (ebeUKU) is detailed in Algorithm 6. The input data of such a CUDA kernel are: the vector $d$, the result of the state $u$ and adjoint state $u^*$ equations, the number $DoF_e$ of DoFs per element, the elemental stiffness matrix $K_0$, the number of divisions of the regular grid ($n_x, n_y, n_z$) and the number of DoF per node $DoF_n$. The common elemental matrix $K_0$ is stored in constant memory. The algorithm is designed as a three-dimensional CUDA kernel using on-chip memory for the vector $d$ and the state $u$ and adjoint state $u^*$ involved in the calculation of the thread block. The sensitivity of each block element is calculated by one thread, for which the unknowns $\mathbf{u}^{(e)}$ attached to the element $e$ are determined making use of the grid regularity. The inner loops operate on each degree of freedom of the element to calculate the sensitivities $f_{\rho}^{(e)}$ according to (7). Finally, the objective function $f$ is computed using reduction operation using the thrust library. The compliant mechanism synthesis also requires the calculation of the objective function ($u_{\text{out}}$) as well as the adjoint state ($u^*$), which are calculated on GPU using the custom-developed CUDA kernel detailed in Algorithm 1.
4.3. Filtering strategy

The filtering strategy aims to prevent numerical artifacts in the optimal solution. The Algorithm 7 shows the pseudo-code of the EbE GPU implementation of the three-dimensional sensitivity filter (ebeFILTER) following (3). The input data of such an algorithm are: the density design variables $\rho$, the vector of sensitivities $f_{\rho}$, the finite element size $(d_x, d_y, d_z)$, the number of divisions of the grid $(n_x, n_y, n_z)$ and the filter radius $R$. Note that the ebeFILTER kernel cannot be implemented using a three-dimensional launching approach. This is due to the radius size of (4) requires a large halo in all the dimensions of the cache data using shared memory, which normally exceeds the limit of such a resource.
or forces to launch very small three-dimensional blocks that do not make use of the parallelization potential of GPU computing. The kernel applies to each element \( e \in \mathcal{E} \) going through each element that falls within the projection of the linear convolution function (3). The regularity of the grid permits to efficiently look for the neighbors of the corresponding element, avoiding the use of a global index table that requires a substantial number of memory accesses. The granularity of the ebeFILTER CUDA kernel is at the element level, assigning one CUDA thread to each element when invoking such a kernel.

**Algorithm 7: EbE sensitivity filter (ebeFILTER)**

**Data:** \( \rho, f^{\rho}, d_x, d_y, d_z, n_x, n_y, n_z, R \)

**Result:** \( \tilde{f}^{\rho} \) // Filtered sensitivities

1. \( s_x \leftarrow \text{floor}(R/d_x); \)
2. \( s_y \leftarrow \text{floor}(R/d_y); \)
3. \( s_z \leftarrow \text{floor}(R/d_z); \)
4. \( \tilde{f}^{\rho} \leftarrow 0; \)
5. \( \text{sum} \leftarrow 0; \)

\( e \leftarrow \text{threadId} + \text{BlockDim} \times \text{BlockId}; \) // CUDA kernel

6. **if** \( (e < N_{ele}) \) **then**
7. \( c_z \leftarrow \text{floor}(e/((n_x + 1)(n_y + 1))); \)
8. \( c_y \leftarrow \text{floor}((e - c_z(n_x + 1)(n_y + 1))/(n_x + 1)); \)
9. \( c_x \leftarrow e - c_z(n_x + 1)(n_y + 1) - c_y(n_x + 1); \)
10. **for** \( r \leftarrow \text{max}(c_x - s_x, 0) \) **to** \( \text{min}(c_x + s_x, n_x) \) **do**
11. **for** \( s \leftarrow \text{max}(c_y - s_y, 0) \) **to** \( \text{min}(c_y + s_y, n_y) \) **do**
12. **for** \( t \leftarrow \text{max}(c_z - s_z, 0) \) **to** \( \text{min}(c_z + s_z, n_z) \) **do**
13. \( i \leftarrow t(n_x + 1)(n_y + 1) + s(n_x + 1) + r; \)
14. \( w \leftarrow R - \sqrt{((c_x - r)^2 + (c_y - s)^2 + (c_z - t)^2)} ; \) // convolution operator
15. \( \text{sum} \leftarrow \text{sum} + \max(0, w); \)
16. \( \tilde{f}^{(e)}_{\rho} \leftarrow \tilde{f}^{(e)}_{\rho} \cdot \max(0, w)\rho^{(e)} \cdot \tilde{f}^{(e)}_{f}; \)
17. **end**
18. **end**
19. **end**
20. \( \tilde{f}^{(e)}_{\rho} \leftarrow \tilde{f}^{(e)}_{\rho} / (\rho^{(e)} \text{sum}) \)
21. **end**
4.4. Optimality Criterion (OC) update scheme

The pseudo-code of the EbE GPU implementation of the density updating strategy (ebeOC) is shown in Algorithm 8. Such an implementation makes use of the bisection method and the optimality criterion according to (9) and (10). The input data of this algorithm are: the density design variables ($\rho$), the interval bounds $[\lambda_l, \lambda_u]$ of the Lagrange multiplier, the numerical damping coefficient ($\eta$), a positive step width ($\zeta$), an intermediate density penalty factor ($q$), the finite element size ($d_x, d_y, d_z$), the objective function sensitivities ($\tilde{f}_p$) and the volume sensitivities ($V_\rho$). The algorithm is composed of two CUDA kernels, highlighted in boxes labeled with “CUDA kernel”, with element level granularity. The first kernel goes through each element of the regular grid to calculate the volume penalized with $\rho$. The total volume is then calculated as the addition of partial volume computed as a reduction using the thrust library. The bisection method divides the interval repeatedly to calculate the midpoint Lagrange multiplier $\lambda_m$, which is then used to calculate the KKT optimality condition. The second kernel updates the element density $\rho$ following (9) and copy back the calculated volume $V_{new}$ to update the Lagrange multiplier, which is used to evaluate the stopping criteria. This procedure is repeated until such a convergence criterion is satisfied. The vector $\rho$ is finally copy back to host memory.
Figure 3: Flowchart of GPU instance of SIMP method.
Algorithm 8: EbE density update (ebeOC)

Data: $\rho, \lambda_1, \lambda_2, \eta, \gamma, \delta, n_{eX}, n_{eY}, n_{eZ}, V_{\rho}$
Result: $\rho$

// updated densities

$V_0 \leftarrow 0$

if $(x < N_{e_0})$ then

$\rho(x) \leftarrow \delta_{x}^2 \delta_{y}^2 \delta_{z}^2 \rho(x)$

end

$V_0 \leftarrow V_0 + \sum_{x=1}^{N_0} \rho(x) / \lambda_1$

while $(\lambda_1 \lambda_2 > 10^{-6})$ do

$\lambda_m \leftarrow (\lambda_1 + \lambda_2) / 2$

$V_{\text{new}} \leftarrow 0$

if $(x < N_{e_0})$ then

$\rho(x) \leftarrow \max(0, \min(1, \min(\rho(x), \rho(x) / \lambda_1, \lambda_2) / \lambda_m))$

$g(x) \leftarrow d_x d_y d_z \rho(x)$

end

$V_{\text{new}} \leftarrow V_{\text{new}} + \sum_{x=1}^{N_{e_0}} \rho(x) / \lambda_1$

if $V_0 \geq V_{\text{new}}$ then

$\lambda_1 \leftarrow \lambda_m$

else

$\lambda_2 \leftarrow \lambda_m$

end

end

Figure 4: Kernel invocation and memory transfer for each iteration of SIMP method.
4.5. GPU implementation and memory management

The GPU instance of density-based topology optimization consists of the custom-developed CUDA kernels for the computationally demanding tasks involved in the algorithm. Figure 3 shows the flowchart of the algorithm and the relevant memory allocation and memory transfer of large vectors during the optimization, whereas Figure 4 details the custom-developed kernels, including memory transfer of scalar values in each iteration of the topology optimization and the invocations from the host. One can observe that the information needed by the custom-developed CUDA kernels is allocated and transferred to the device memory in the initialization of the optimization process, and that the memory transaction between host and device memory of large vectors is reduced to the $\rho$ vector in each iteration of the optimization to evaluate the stopping criteria. This is of paramount importance to obtain reasonable results. One also can observe that the iterations of the optimization process only requires to copy back to host memory some scalar values. This minimization of memory transactions increases notably the GPU performance. The CUDA kernels are invoked from the host assigning the corresponding grid size $GS$ and block size $BS$ to fit the granularity of the custom-developed kernels, which are tuned to empiric values that provide good performance.

The device memory allocated for the custom-developed kernels, obviating the allocation of scalar values, is shown in Figure 5. The dbdPCG kernel using the Jacobi preconditioner requires the storage in the global device memory of vectors $d$, $f$, $u$, $r$, $p$, $a$, $z$, and $I_c$ indicated in the pseudocode of Algorithm 1. When the geometric multigrid preconditioner is used, the vectors $s$, $v$, $^{\ell+1}I_e$, $^{\ell+1}I$ and $^{\ell+1}C$ are also stored in the global device memory for the corresponding $n_l$ levels. The common elemental stiffness matrix $K_0^e$ is stored in constant memory. This permits to save bandwidth because constant memory is cached and consecutive reads of the same address does not incur any additional memory traffic. Besides, one single read from constant memory is broadcast to the threads of a half-warp. Additionally, the vectors $\rho$, $d_\rho$, $u^*$, $f_\rho$, and $\hat{f}_\rho$ required by the kernels ebeUKU, ebeFILTER and ebeOC are also stored in the global
Figure 5: Device memory required by the kernels of the proposed GPU instance.
device memory. In the case of compliance minimization problems, the memory allocation for the adjoint state $u^*$ can be obviated because such problems are self-adjoint.

5. Numerical experiments

The performance of the proposed GPU instance of density-based topology optimization is evaluated using three topology optimization problems in different fields. In particular, the stiffness design of continuum structures, the heat sink design cooled by heat conduction and the compliant mechanism synthesis. The two first two numerical experiments aim to analyze the use of different $DoF_n$, whereas the last one aims to explore the use of the proposal with multi-GPU systems. The solving of the system of equations using GPU computing is evaluated using two preconditioning techniques; in particular, the Jacobi preconditioner and the geometric multigrid preconditioner. Besides, the way to launch the kernel for the matrix-vector multiplication operation is studied to make use of the parallel potential of massive parallel architectures in this demanding operation. In addition, the proposed GPU instance is compared with the classical CPU implementation, in which the global stiffness matrix is assembled and the sparse-matrix representation is used to perform the operations required by the PCG solver. The CPU implementation makes use of only one thread for the comparisons. The computationally demanding tasks involved in the algorithm are evaluated separately using GPUs with different massive parallel capabilities. This aims to evaluate the scalability of the GPU instance with respect to the capabilities of the graphics units.

The three numerical experiments are performed using a computer with an Intel Core i7-5820k 3.33 GHz and 32 GB of RAM memory. Three Nvidia GPUs are installed in the computer to perform the experiments: GF100-100-KD (Quadro 4000), GF100 (Tesla C2070) and GK110b (Tesla K40). The first two graphics cards use the Fermi micro-architecture, whereas the third one makes use of the Kepler micro-architecture. Table 1 summarizes the most relevant specifications
Table 1: GPU specifications for benchmark devices.

<table>
<thead>
<tr>
<th>Nvidia GPU model</th>
<th>CUDA cores</th>
<th>Processor clock (MHz)</th>
<th>Memory clock (MHz)</th>
<th>FMA-DP (GFlops)</th>
</tr>
</thead>
<tbody>
<tr>
<td>GF100-100-KD</td>
<td>256</td>
<td>475</td>
<td>1400</td>
<td>243</td>
</tr>
<tr>
<td>GF100</td>
<td>448</td>
<td>575</td>
<td>1566</td>
<td>515.2</td>
</tr>
<tr>
<td>GK110b</td>
<td>2880</td>
<td>889</td>
<td>3004</td>
<td>1430</td>
</tr>
</tbody>
</table>

of such graphics units for scientific computation purposes; in particular, the number of cores, the processor and memory clocks, and the Double-Precision (DP) Fused Multiply Add (FMA) operations as specified in IEEE 754-2008. The GPU instance is compiled using the NVIDIA CUDA Toolkit 7.5 and the numerical experiments are run on 64 bits Linux OS with the NVIDIA Driver Version 340.76. It is important to remark that the development environment and the graphics driver updates often show significant performance improvements.

A continuation strategy for parameter \( p \) is adopted for all the numerical experiments; in particular, the evolution of the parameter \( p \) in the continuation step, according to [53], is as follows

\[
p_{k+1} = \begin{cases} 
1 & \text{if } k \leq 20 \\
\min(3, \gamma \cdot p_k) & \text{if } k > 20
\end{cases}
\]  

This continuation strategy is represented in Figure 6 for different \( \gamma \) values. By modifying the parameter \( p \), the optimization problem is gradually changed from a convex problem to the original non-convex problem, which is governed by the parameter \( \gamma \). Based on the experience provided by Groenwold and Etman [53], such a \( \gamma \) parameter is set to 1.02 in all the numerical experiments.

The minimum compliance design problem consists of finding the material density distribution that minimizes the deformation of the structure, a tied-arch bridge in our case, under the prescribed loading and boundary conditions. Figure 7(a) shows the box shape design domain and the boundary conditions of the optimization problem. It also shows the non-optimizable region over the top of the bridge deck, which represents the area needed to circulate the vehicles. The bases of bridge abutments and the bottom part of the bridge deck
are simply-supported along the edges located at 60m from left face and right face respectively. A uniformly distributed load is applied to the top of the non-optimizable bridge deck. The solving of this problem makes use of one of the vertical planes of symmetry to only analyze the half of the finite element model. The half design domain is discretized using $184 \times 40 \times 90$ eight-node hexahedral linear brick elements, i.e. 662,400 elements and about 2 million of DoFs. The material parameters are $E_{\text{solid}} = 210$ GPa, $E_{\text{void}} = 2.1 \cdot 10^{-4}$ GPa and $\nu = 0.31$. The maximum residual error is set to $10^{-8}$ for the PCG algorithm and the calculations are performed using double-precision floating-point format. The target volume is the 15% of the volume of the design domain. The parameters of the bisection method used to infer the Lagrange multiplier $\lambda$ are: $\eta = 0.5$, $\zeta = 0.2$, $\lambda_l = 0$, $\lambda_u = 10^9$ and $\lambda_{\text{min}} = 10^{-40}$. The topology optimization parameters are: filter radius $R = 3$ m and factor $q$ as follows

$$q_{k+1} = \begin{cases} 
1 & \text{if } k \leq 15 \\
\min(3, 1.01 \cdot q_k) & \text{if } k > 15
\end{cases}$$  \quad (12)
Figure 7: Tied-arch bridge benchmark: (a) the design domain and boundary conditions, (b) the topology design with threshold $\rho = 0.99$ from isometric view and (c) the Oregon city bridge.
The resulting structural design of the tied-arch bridge is shown in Figure 7(b), where the final design is thresholded at $\rho = 0.99$ and colored by the magnitude of the displacement field. This structural design resembles the topology of this kind of bridges, as shown in the real bridge of Figure 7(c).

The compliant mechanism design problem consists of maximizing the output displacements $u_{out}$ for mechanisms under given forces $f_{in}$ applied to the input actuators. The numerical experiment aims to provide the optimal topology compliant mechanism design for a gripper. The design domain and the boundary conditions for this benchmark are shown in Figure 8(a). This design domain is simply supported at the upper and lower edges of left face. The input actuator and the output port are modeled as springs with different degrees of stiffness, in particular $k_{in}$ and $k_{out}$ respectively. These degrees of stiffness are fixed to $k_{in} = k_{out} = 1kN/mm$ in the springs. The input force $f_{in} = 1kN$ is applied to the center of a given face, for which the top and bottom edges are fixed. The output ports are located in the opposite face as indicated in Figure 8(a). The material considered is nylon with Young’s modulus $E_{\text{solid}} = 3$ GPa and Poisson’s coefficient $\nu = 0.4$. For the void material, a Young’s modulus $E_{\text{void}} = 10^{-3}$ GPa is considered. Only the half of the design domain is analyzed using the horizontal plane of symmetry, which is discretized using $160 \times 40 \times 80$ eight-node hexahedral linear brick elements, i.e. 512,000 elements or 1,604,043 DoFs. The target volume is about 10% of the volume of the design domain. The topology optimization parameters are: filter radius $R = 4 \text{ mm}$ and factor $q$ following (12). The parameters of the bisection method used to infer the Lagrange multiplier $\lambda$ are: $\eta = 0.3$, $\zeta = 0.1$, $\lambda_l = 0$, $\lambda_u = 10^9$ and $\lambda_{\text{min}} = 10^{-40}$. The resulting mechanism design is shown in Figure 8(b), where the final design is thresholded at $\rho = 0.99$ and colored by the magnitude of the displacement field. One can observe that the gripping “jaws” along with the hinges resemble the common topology of grippers.

The heat sink design considering conduction heat transfer, also known as volume-to-point heat conduction problem [59], consists of the minimization of thermal compliance to maximize the heat conduction transfer from the heat sink.
Figure 8: 3D compliant gripper benchmark: (a) the design domain and boundary conditions and (b) the topology design with threshold $\rho = 0.99$ from isometric view.
The volume is subjected to a heat generation rate at every point of the design domain and cooled through a small patch (heat sink) located in the middle of its upper face. The thermal conductivity matrix of each element is considered, according to (2), as the distribution of two material phases comprising a good thermal conductor \(K_0\) and a poor conductor \(K_{\text{min}}\). Both material phases are considered homogeneous and isotropic without temperature effect on their conductivities. Therefore, the topology optimization problem aims to find the optimal distribution of “good” thermal conductor that minimizes the highest temperature under a volume constraint. The design domain and the boundary conditions for this benchmark are shown in Figure 9(a). The thermal conductivities are \(k_{\text{min}} = 0.1 \text{ W/m}^2\text{K}\) and \(k_0 = 100 \text{ W/m}^2\text{K}\). The heat generation rates for the different phases are similar with a magnitude of \(F = 10 \text{ kW/m}^3\). All the boundaries are adiabatic with the exception of the heat sink, in which \(T = 0\). Only one quarter of the design domain is analyzed using the two vertical planes of symmetry, which is discretized using a regular grid of \(128 \times 64 \times 256\) eight-node hexahedral linear brick elements, i.e. 2,097,152 elements or 6,464,835 DoFs. The target volume is about the 30% of the volume of the design domain. The topology optimization parameters are: filter radius \(R = 5\) mm and factor \(q\) following (12). The parameters of the bisection method used to infer the Lagrange multiplier \(\lambda\) are: \(\eta = 0.5, \zeta = 0.2, \lambda_l = 0, \lambda_u = 10^9\) and \(\lambda_{\text{min}} = 10^{-40}\).

The resulting topology design along with intermediate designs to illustrate the evolution of the optimization process are shown in Figure 9. The intermediate and final designs are thresholded at \(\rho = 0.99\) and colored by the magnitude of the temperature field (°C). One can observe how the final design is a “thermal tree” composed of conductivity branches that move the heat away from the heat source.

Figure 10 shows the evolution of the Measure of Non-Discreteness \(M_{\text{nd}}\) [60] for the numerical experiments. This value indicates whether an optimized design has converged to a discrete solution following
Figure 9: Heat sink design benchmark: (a) the design domain and boundary conditions and (b-d) the optimized topology designs with threshold $\rho = 0.99$ at different iterations of the optimization process.
Figure 10: Measure of Non-Discreteness ($M_{nd}$) for the numerical experiments.

$$M_{nd} = \frac{\sum_{e=1}^{N_{ele}} 4 \rho_e (1 - \rho_e)}{N_{ele}} \times 100\%,$$

where $M_{nd} = 100\%$ indicates that the design is totally gray and $M_{nd} = 0\%$ means that the design is fully discrete. One can observe in Figure 10 that the numerical experiments show stable convergence to fully discrete designs. Figure 11 shows the evolution of the objective function for the different topology optimization problems. One can observe that all the topology optimization problems show stable convergence of the objective function. All the numerical experiments are performed using $\gamma = 1.02$ following (11). We can observe in Figure 6 how the parameter $p$ is fixed to $p = 1$ for the 20 initial iterations and then it is increased until $p = 3$ in the iteration 75. The objective function for the different topology optimization problems converges to a minimum with intermediate densities, as shown in Figure 10, in the iteration 20. As the penalty parameter $p$ is increased one can observe that the objective function converges
to a local minima with a lower $M_{sd}$ until a fully discrete design is found.

The efficient calculation of matrix-vector multiplication operations is crucial to obtain a reasonable performance for the solving stage. The performance of this operation using shared memory is evaluated modifying the block size used to launch the $dbdMVP$ kernel. This is done launching one FEA of the tied-arch bridge problem using different three-dimensional block size configurations. Table 2 shows the battery of experiments performed to tune the block size, which is sorted by wall-clock time. As a general rule, the performance is maximized using the maximum amount of on-chip memory and the maximum number of threads. However, the GPU performance of $dbdMVP$ kernel is maximized as increasing the $x$ dimension of block size due to the unknowns are stored in such a dimension.

Table 3 details the thread hierarchy used for invoking the CUDA kernels in the different benchmarks. One can observe that the three-dimensional kernels with granularity at the DoF level use the block size configuration obtained from the experiment shown in Table 2. Three-dimensional kernels with granularity at the element level use symmetric kernels to maximize the use of shared memory. These kernels do not operate over information with a dimension higher than the others. The kernels that are not using on-chip memory are launched as

<table>
<thead>
<tr>
<th>Block size</th>
<th>Grid size</th>
<th>Shared memory (bytes)</th>
<th>Threads per block</th>
<th>Wall-clock time (ms)</th>
</tr>
</thead>
<tbody>
<tr>
<td>111 x 3 x 3</td>
<td>5 x 14 x 33</td>
<td>28264</td>
<td>999</td>
<td>17.2</td>
</tr>
<tr>
<td>63 x 4 x 4</td>
<td>9 x 11 x 25</td>
<td>24272</td>
<td>1008</td>
<td>18.0</td>
</tr>
<tr>
<td>39 x 5 x 5</td>
<td>15 x 9 x 20</td>
<td>21672</td>
<td>975</td>
<td>18.7</td>
</tr>
<tr>
<td>24 x 7 x 6</td>
<td>24 x 6 x 17</td>
<td>21312</td>
<td>1008</td>
<td>21.6</td>
</tr>
<tr>
<td>21 x 7 x 6</td>
<td>27 x 6 x 17</td>
<td>19136</td>
<td>882</td>
<td>22.1</td>
</tr>
<tr>
<td>21 x 6 x 6</td>
<td>27 x 7 x 17</td>
<td>16960</td>
<td>756</td>
<td>25.0</td>
</tr>
<tr>
<td>18 x 6 x 6</td>
<td>31 x 7 x 17</td>
<td>15032</td>
<td>648</td>
<td>25.0</td>
</tr>
</tbody>
</table>

Table 2: Wall-clock time of $dbdMVP$ configuring different grid and block size for tied-arch bridge experiment.
blocks with only one dimension. The grid size of all CUDA kernels is adjusted to process the corresponding information.

The performance of solving the system of equations is evaluated using the Jacobi and the geometric multigrid preconditioning techniques. Table 4 shows the wall-clock time for the topology optimization stages using diverse graphic cards, including the iterative solver using both preconditioners and the CPU implementation using sparse-matrix operations. One can observe that the best performance considering the wall-clock time is obtained using the geometric multigrid preconditioner. The solving stage of the compliant mechanism design problem (3D gripper) is also evaluated using a multi-GPU system, which exploits the task-level parallelism involved in the computation of the direct and adjoint problems. The multi-GPU system consists of a master-worker configuration [40] installing two Nvidia Tesla K40 (GK110b) on the same host. The workers solve the governing equations of the finite element model using the matrix-free PCG

| Table 3: Thread hierarchy for the benchmarks. The reader is referred to Figure 4 in order to find the block and grid sizes used by each kernel. |
|---|---|---|---|---|---|
| **Tied-arch bridge** | BS<sup>c</sup> | BS<sup>n</sup> | BS<sup>u</sup> | BS<sup>c</sup><sub>3</sub> | BS<sup>n</sup><sub>3</sub> |
| Tessellation | (512,1,1) | (512,1,1) | (512,1,1) | (10,10,10) | (111,3,3) |
| 184 × 40 × 96 | GS<sup>c</sup> | GS<sup>n</sup> | GS<sup>u</sup> | GS<sup>c</sup><sub>3</sub> | GS<sup>n</sup><sub>3</sub> |
| | (1380,1,1) | (1438,1,1) | (4312,1,1) | (5,14,33) | (19,4,10) |
| **3D gripper** | BS<sup>c</sup> | BS<sup>n</sup> | BS<sup>u</sup> | BS<sup>c</sup><sub>3</sub> | BS<sup>n</sup><sub>3</sub> |
| Tessellation | (512,1,1) | (512,1,1) | (512,1,1) | (10,10,10) | (111,3,3) |
| 160 × 40 × 80 | GS<sup>c</sup> | GS<sup>n</sup> | GS<sup>u</sup> | GS<sup>c</sup><sub>3</sub> | GS<sup>n</sup><sub>3</sub> |
| | (800,1,1) | (841,1,1) | (2522,1,1) | (16,4,8) | (5,11,27) |
| **Heat sink** | BS<sup>c</sup> | BS<sup>n</sup> | BS<sup>u</sup> | BS<sup>c</sup><sub>3</sub> | BS<sup>n</sup><sub>3</sub> |
| Tessellation | (512,1,1) | (512,1,1) | (512,1,1) | (10,10,10) | (8,8,8) |
| 128 × 64 × 256 | GS<sup>c</sup> | GS<sup>n</sup> | GS<sup>u</sup> | GS<sup>c</sup><sub>3</sub> | GS<sup>n</sup><sub>3</sub> |
| | (4096,1,1) | (4209,1,1) | (12627,1,1) | (13,7,26) | (17,9,33) |
### Tied-arch bridge

<table>
<thead>
<tr>
<th></th>
<th>dbdPCG (sec)</th>
<th>ebeUKU (sec)</th>
<th>ebeFILTER (sec)</th>
<th>ebeOC (sec)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Jacobi</td>
<td>69991.92</td>
<td>5104.26</td>
<td>84.69</td>
<td>316.11</td>
</tr>
<tr>
<td>Multigrid</td>
<td>10769.99</td>
<td>3762.71</td>
<td>9.15</td>
<td>10.39</td>
</tr>
<tr>
<td><strong>CPU (Sparse)</strong></td>
<td>11048.89</td>
<td>7104.26</td>
<td>15.15</td>
<td>13.50</td>
</tr>
<tr>
<td>Quadro 4000</td>
<td>4831.15</td>
<td>1867.22</td>
<td>8.10</td>
<td>4.96</td>
</tr>
<tr>
<td>Tesla C2070</td>
<td>3509.48</td>
<td>1256.81</td>
<td>6.75</td>
<td>2.76</td>
</tr>
<tr>
<td>Tesla K40</td>
<td>25404.44</td>
<td>3212.67</td>
<td>17.78</td>
<td>14.62</td>
</tr>
</tbody>
</table>

### 3D gripper

<table>
<thead>
<tr>
<th></th>
<th>dbdPCG (sec)</th>
<th>ebeUKU (sec)</th>
<th>ebeFILTER (sec)</th>
<th>ebeOC (sec)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Jacobi</td>
<td>320039.48</td>
<td>9704.29</td>
<td>146.92</td>
<td>754.21</td>
</tr>
<tr>
<td>Multigrid</td>
<td>51998.92</td>
<td>7009.23</td>
<td>22.76</td>
<td>35.14</td>
</tr>
<tr>
<td><strong>CPU (Sparse)</strong></td>
<td>52998.92</td>
<td>7009.23</td>
<td>22.76</td>
<td>35.14</td>
</tr>
<tr>
<td>Quadro 4000</td>
<td>25404.44</td>
<td>3212.67</td>
<td>17.78</td>
<td>14.62</td>
</tr>
<tr>
<td>Tesla C2070</td>
<td>16528.05</td>
<td>2243.24</td>
<td>15.46</td>
<td>7.69</td>
</tr>
<tr>
<td>Tesla K40</td>
<td>8549.52</td>
<td>1153.67</td>
<td>1.01</td>
<td>1.01</td>
</tr>
</tbody>
</table>

### Heat sink

<table>
<thead>
<tr>
<th></th>
<th>dbdPCG (sec)</th>
<th>ebeUKU (sec)</th>
<th>ebeFILTER (sec)</th>
<th>ebeOC (sec)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Jacobi</td>
<td>26396.97</td>
<td>3006.49</td>
<td>58.76</td>
<td>28699.76</td>
</tr>
<tr>
<td>Multigrid</td>
<td>3198.65</td>
<td>4453.50</td>
<td>22.28</td>
<td>1105.31</td>
</tr>
<tr>
<td><strong>CPU (Sparse)</strong></td>
<td>3198.65</td>
<td>4453.50</td>
<td>22.28</td>
<td>1105.31</td>
</tr>
<tr>
<td>Quadro 4000</td>
<td>1938.04</td>
<td>1849.62</td>
<td>20.04</td>
<td>437.73</td>
</tr>
<tr>
<td>Tesla C2070</td>
<td>1172.15</td>
<td>1113.96</td>
<td>17.34</td>
<td>176.85</td>
</tr>
</tbody>
</table>

Table 4: Total wall-clock time for the topology optimization stages.
solver. The results show how the use of a multi-GPU platform permits to scale up the acceleration of the solver stage preserving the speedups obtained for a single analysis.

Figure 12 shows the speedups with respect to the sparse-matrix CPU implementation for the topology optimization stages using diverse graphic cards. These speedups are calculated for different graphics units to evaluate the scalability with respect to GPU capabilities. One can observe that all the computationally intensive tasks are accelerated significantly. Besides, the speedup increases with the massive parallel capabilities of the graphics unit. Speedups between 98x and 162x are obtained, in the 3D gripper and heat sink experiments respectively, for the \textit{ebeFILTER} kernel using the most recent graphics card (Nvidia K40-GK110b) of the devices evaluated. The solving of the system of equations limits the global speedup for the topology optimization; the numerical experiments requiring the linear elastic analysis achieve speedups of 4x and 20x using the geometric multigrid and the Jacobi preconditioning respectively. The speedup of the solving stage for the heat conduction problem is 2.7x and 22.5x using the geometric multigrid and the Jacobi preconditioning respectively. The increment in the speedup using Jacobi preconditioning is attributed to the higher size of the finite element model. Besides, the lower number of non-zero elements in the global thermal conductivity matrix reduces the number of operations and device memory accesses using GPU computing. The number of non-zero elements is 13 per DoF for the thermal experiment, whereas is 40 per DoF for elasticity problems. On the contrary, the decrement of the speedup using the geometric multigrid preconditioner is attributed to the higher size of the stiffness matrices of coefficients, which are stored in global device memory. The memory accesses of these large arrays dominate the potential speedup.

Figure 13 shows the percentage of the total wall-clock time of each kernel in the topology optimization using the Jacobi and the geometric multigrid preconditioning. As expected, the principal bottleneck of the topology optimization algorithm is the solving of the system of equations of the finite element model. The wall-clock time for elasticity problems is around the 98% of the total wall-
Figure 11: Evolution of the objective function for the numerical experiments.

Figure 12: Speedup for the computationally demanding tasks in the numerical experiments.
clock time using the Jacobi preconditioning whereas this percentage is around the 87% using the geometric multigrid preconditioning. This percentage is considerably reduced in the heat conduction problem, where the solving stage using the most modern graphics device (GK110b) is about the 55% of the total wall-clock time using both preconditioners. This is mainly attributed to the fewer number of iterations required by PCG in the heat conduction problem to achieve convergence. As a consequence, the percentage of the total wall-clock time consumed by the filtering strategy and the bisection method become significant in the heat conduction problem, and the use of GPU computing for these stages can notably accelerate the topology optimization process.

6. Conclusion

This paper has investigated about the proper strategy and techniques to achieve efficient calculation and reasonable speedups using GPU computing for the computationally intensive tasks of density-based topology optimization methods. The performance of the solving stage using GPU computing is analyzed using two preconditioning techniques; in particular, Jacobi preconditioner and geometric multigrid preconditioner. Different granularities are used to facilitate the exploitation of massive parallel architectures. The solving of the system of equations is implemented with granularity at the DoF level, whereas element grain size is used to process the calculation of sensitivities, the filtering strategy and the optimality criteria method. A nodal-based strategy is adopted for the coarsening using the geometric multigrid preconditioner and for prolongation and reduction operators. Significant speedup is achieved in the solving stage using constant memory for storing the common elemental matrix $K_0$ and shared memory for storing the unknowns and the vector of elemental properties $d$ of the topology optimization. Besides, the three-dimensional kernel to process the matrix-vector multiplication operations fits well for massive parallel processing showing good performance results.

The numerical results show that the proposed GPU instance achieves signifi-
Figure 13: Percentage of total wall-clock time using (a) Jacobi and (b) geometric multigrid preconditioning.
cant speedups for the computational demanding tasks involved in density-based topology optimization. Nevertheless, we have to remark that such speedups are calculated with respect to the sparse-matrix calculation using one CPU thread, and thus lower speedups are obtained using multiple CPU threads for the calculation. The numerical results also show that the best performance in wall-clock time is obtained using the geometric multigrid preconditioner in the GPU instance of the PCG. Moreover, the experiments show the efficiency of the proposed GPU implementation for solving topology optimization problems in three different fields: maximum stiffness design, heat sink design and compliant mechanism design. Furthermore, the numerical experiments show the scalability of the proposed GPU instance with the resources of the graphics units. This is a promising result to achieve higher speedups on new generations of graphics cards.

7. Acknowledgements

We gratefully acknowledge the support of NVIDIA Corporation with the donation of some of the GPU devices used for this research. Such a work has also been supported by the AEI/FEDER and UE under the contract DPI2016-77538-R and by the “Fundación Séneca – Agencia de Ciencia y Tecnología de la Región de Murcia” under the contract 19274/PI/14.

References


