Efficient topology optimization using GPU computing with multilevel granularity

This paper proposes a well-suited strategy for High Performance Computing (HPC) of density-based topology optimization using Graphics Processing Units (GPUs). Such a strategy takes advantage of Massively Parallel Processing (MPP) architectures to overcome the computationally demanding procedures of density-based topology design, both in terms of memory consumption and processing time. This is done exploiting data locality and minimizing both memory consumption and data transfers. The proposed GPU instance makes use of different granularities for the topology optimization pipeline, which are selected to properly balance the workload between the threads exploiting the parallelization potential of massive parallel architectures. The performance of the fine-grained GPU instance of the solving stage is evaluated using two preconditioning techniques. The proposal is also compared with the classical CPU implementation for diverse topology optimization problems, including stiffness maximization, heat sink design and compliant mechanism design.


Introduction
Topology optimization aims to find the optimal distribution of material within a design domain such that an objective function is minimized under certain constraints [1]. Contrary to size and shape optimization methods, topology optimization permits to obtain a material distribution without assuming any prior structural configuration. This provides engineering designers with a powerful tool to find innovative and high-performance conceptual designs at the early stages of the design process. Not to mention the great impact of the optimization of geometry and topology on the structural performance. This problem has sparked a broad interest since the early work of Bendsøe and Kikuchi [2], giving rise to a multitude of studies in a wide range of physics problems, such as stiffness maximization of structures [3], design of compliant mechanisms [4], maximization of temperature diffusivity [5], and minimization of acoustic emission [6], to name but a few [7].
Shape and topology optimization methods can be broadly classified into three main categories depending on the representation used to describe the shapes they involve: density-based methods, Eulerian methods and Lagrangian methods. The methods included in the first category operate on a fixed grid of finite elements and seek an optimal void/solid material distribution that minimizes an objective function. The homogenization method [2,8] and the Solid Isotropic Material Penalization (SIMP) method [1,9] are some examples of the most popular topology optimization approaches included in this category.
The second category is composed of methods that use an implicit representation of the structural boundary. Such a boundary can be modified by tracking the motion of a level-set function, as is done in the Level-Set Method (LSM) [10,11,12], or by evolving the interfacial dynamics of phase field equations, as occurs in the phase field models [13]. The third category is composed of methods that use an explicit representation of the structural shape by means of a computational mesh or CAD model [14,15]. This work deals with the efficient computation of density-based topology optimization methods using GPU computing.
Despite the great advances made in theory and practical application of topology optimization in the past decade, the computational requirements still remain as a primary challenge [7]. This is due to some demanding tasks involved in the topology optimization pipeline, such as the solving of large systems of equations, the computation of sensitivities and the filtering strategy. Such tasks may increase meaningfully the computation time of the topology optimization process, which may takes hours or even days for relatively large models. High Performance Computing (HPC) is then needed to address the topology optimization process, normally making use of task-level parallel computing to address the computationally intensive tasks [16,17,18].
The use of Graphics Processing Units (GPUs) for non-graphics applications is rapidly growing in popularity [19,20]. This is due to the high computing capacity of these graphics cards for Massively Parallel Processing (MPP) at reasonable cost. GPU computing consists of the use of a GPU together with a CPU to accelerate compute intensive applications. This is not a simple goal since there exist numerous problems that prevent the use of GPU computing for certain scientific applications, such as memory related problems and lack of data-level parallelism. The memory related problems include excessive global memory transactions, non-coalesced global loads and stores that degrade global memory bandwidth, and shared memory accesses inducing bank conflicts, to name but a few. The lack of data-level parallelism prevents the exploitation of Single Instruction Multiple Data (SIMD) parallel computation for which GPU architectures are designed. Therefore, the proper implementation of topology optimization methods using GPUs requires a suitable formulation and selection of techniques allowing making use of the potential acceleration of massive parallel architectures and preventing memory related problems [21], which constraint severely the GPU performance. GPU computing has been successfully used in diverse engineering and scientific problems requiring numerical analysis. One can mention the acceleration of the solving of parametric integral equations in elasticity [22], system of equations in finite element problems [23] and peridynamic systems in peridy-namic models of solid mechanics [24,25,26]. These graphics devices are also successfully used for real-time simulation and haptic feedback of soft tissue deformations [27,28], which are especially useful for the development of realistic simulators. Besides, relevant results are obtained for the structural solver of finite element explicit dynamics problems using GPU computing [29]. These graphics cards have also shown promising results in heterogeneous systems addressing large-scale problems in Finite Element Analysis (FEA) [30]. The use of these devices to speedup computationally demanding tasks in the topology optimization pipeline has sparked a broad interest last years, giving rise to several studies.
The early work of Wadbro and Berggren [31] aims to solve large topology optimization problems using a gradient-based optimality criterion method. This early work implements a Preconditioned Conjugate Gradient (PCG) method on GPU to solve high resolution finite element models arising in heat conduction topology optimization problems. The grain size of this GPU instance is at the element level. The lack of native double-precision support for early GPUs limited the GPU instance to single-precision format, which not ensures the convergence of the solver due to round-off errors. A nodal-wise assembly-free GPU implementation for the solver of the SIMP method is proposed in [32]. Applied to the minimization of the structural compliance problem, this GPU instance achieves significant speedups following the strategy of loading three successive 2D slices of the third dimension into shared memory to perform the computations required by the middle slice efficiently. Such a slice-wise and nodal-based strategy is also adopted in [33] achieving speedups of one order of magnitude for the solving of the system of equations of elasticity. GPU computing is also used to increase the tractable computational resolution of topology optimization problems using discrete level-set methods [34] and evolutionary structural optimization methods [35].
GPU computing using the sparse-matrix representation permits to efficiently assembly and solve the system of equations of elasticity [36]. However, a higher performance can be achieved exploiting the grid regularity and performing the operations "on-the-fly". The former permits to exploit data locality providing reduced memory accesses and making use of on-chip memory, which is much more efficient than global device memory. The latter avoids storing the matrix of coefficients explicitly in the global device memory, which affects seriously the GPU performance. For these reasons, matrix-free GPU implementations using regular grids show good performance results for the Finite Element Analysis (FEA). The GPU instance of PCG solver using geometric multigrid preconditioning in topology optimization configured to perform a reduced number of FEAs and iterations per FEA, permitted Wu et al. [37] to solve large-scale problems in a short time. This is done configuring the iterative method with low tolerance level along with SIMP method using standard Optimality Criteria (OC) method [1]. The GPU instance is based on the node-wise GPU parallelization proposed by Dick et al. [38], where the grid regularity is exploited to perform coarsening and matrix-vector operations efficiently. Besides, the operations at the finest level are performed "on-the-fly" to increase the GPU performance. This paper proposes a multi-granular GPU implementation of the different stages involved in density-based topology optimization methods. On the one hand, a fine-grained GPU implementation of matrix-free PCG solver for structural analysis is adopted. The regularity of the grid permits to exploit data locality maximizing the GPU performance for FEA [39]. The granularity of matrix-vector multiplication operations is at the Degree of Freedom (DoF) level, which allows reducing and balancing the workload for all the threads of the MPP architecture [23,40]. Another key point for increasing the GPU performance is that matrix-vector multiplication operations are launched by threedimensional kernels using cache data in shared memory for the corresponding three-dimensional blocks. This strategy permits to increase the use of on-chip memory, i.e. the cache data in shared memory, by the threads performing the operations for the corresponding DoF. This achieves a significant improvement for solving the system of equations using GPU computing. The performance of the GPU instance for the solving stage is evaluated in terms of speedup and wall-clock time analyzing two preconditioning techniques; in particular, Jacobi  preconditioner and geometric multigrid preconditioner. On the other hand, the calculation of sensitivities, filter and density update are also implemented using GPU computing in order not to limit the theoretical speedup according to Amdahl's law [41]. The granularity of such tasks is at the finite element level due to the nature of the operations do not allow us to reduce the grain size, which usually improves the GPU performance. The proposed matrix-free GPU instance is compared to the classical sparse-matrix CPU implementation for diverse topology optimization problems, including stiffness maximization, heat sink design and compliant mechanism design. The speedups and relative wallclock time in density-based topology optimization pipeline is also studied for such topology optimization problems.
The paper is organized as follows. Section 2 provides an overview of the GPU architecture and the CUDA programming model. The bases and the theoretical background of density-based topology optimization methods are briefly reviewed in section 3. The proposed GPU implementation of SIMP method is presented in section 4. Section 5 is devoted to the numerical experiments and the performance evaluation of the proposed matrix-free GPU instance with respect to the classical sparse-matrix CPU implementation. Finally, the conclusion of the proposed GPU instance is presented in section 6.

GPU and CUDA architecture
GPU devices were initially designed to satisfy the market demand of realtime and realistic 3D visualization. The use of these graphic cards, with massively parallel architecture, in non-graphics HPC applications is becoming very popular due to their high computing capacity at a reasonable cost. Currently, the use of Nvidia devices and its programming model, Compute Unified Device Architecture (CUDA) [42], is the prevailing tendency, which is adopted in the developments presented in this work. Such a programming model allows to view the GPU as a compute device able to perform data-parallel computation (data/SIMD parallelism) using multiple cores. The parallel code (single instruction) is defined as a C Language Extension function, called kernel, which is executed by a lot of CUDA threads using different data (multiple data). The kernel call, invoked from the host (CPU) to the device (GPU) as shown in Figure 1(a) taken from [43], should specify the number of CUDA threads organized as a grid of thread blocks.
The CUDA threads have only access to the device SGRAM (Synchronous Graphic Random-Access Memory), a type of DRAM (Dynamic Random-Access Memory) with high bandwidth interface for graphics-intensive functions, and to the on-chip SRAM (Static Random-Access Memory) through the memory spaces depicted in Figure 1(a). The blocks are batch of threads able to cooperate by sharing data through shared memory and to synchronize their execution coordinating memory accesses. A key point is that CUDA architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). The blocks of the grid, invoked by each kernel, are distributed to SMs depending on their execution capacity, which includes on-chip memory resources. The use of on-chip memory, much faster than SGRAM memory, is of paramount importance to increase significantly the GPU performance.
For that reason, the CUDA memory hierarchy, shown in Figure 1(b), is crucial to optimize memory access and achieve a reasonable performance. We can observe that each SM has the following on-chip memory: one set of registers (R) per processor (C) and a shared memory, a read-only constant cache and a read-only texture cache. These memory resources are shared by all cores (C i ) of such a SM. This fact implies that the amount of blocks that a SM can process at once depends on the number of registers per thread and the shared memory per block required for a given kernel. For this reason, the use of shared memory can show relatively poor performance for computation using large arrays. CUDA cannot schedule more blocks to SMs than the multiprocessors can support in terms of shared memory and register usage, and thus the occupancy (number of active warps) is deteriorated. A key point for the proposed GPU instance is that the constant memory is stored in SGRAM but data are read through each multiprocessor constant cache, which is on-chip memory. Constant memory is also optimized for broadcast, i.e. when warp of threads read same location, but it is however serialized when warp of threads read in different locations. For these reasons, the proposed GPU instance uses constant memory for storing the common elemental stiffness matrix, whereas the use of shared memory is limited to unknowns and elemental densities. Such a strategy achieves a significant GPU performance for the calculations involved in density-based topology optimization.
The software developments using CUDA consist in the following steps: i) memory allocation and transaction, ii) kernel execution on GPU and iii) copy back the results to the host. The strategies to optimize code in GPU computing can be summarized as follows: i) optimization of parallel execution to achieve maximum use of cores, ii) optimization of memory management to facilitate coalesced memory accesses, iii) optimization of instruction usage to achieve maximum instruction performance, and iv) optimization of communications to achieve minimal synchronization between parallel executions. The different effects of the proposed GPU implementation can be explained using these optimization criteria.

Density-based topology optimization
Topology optimization can be defined as a binary programming problem that aims to find the optimal material layout (solid and void) that minimizes an objective function. Such a material layout should satisfy a set of prescribed constraints in the design domain. Density-based methods are the most widely used topology optimization methods due to its conceptual simplicity, which has facilitated its application in industrial software [44]. In these methods, the integer-based topology optimization problem is relaxed to a formulation based on artificial continuous material densities, which permits the use of gradientbased solvers. The topology optimization problem can be stated as : where f is the objective function, ρ is the vector of density design variables, u is the system response, K is the global stiffness matrix, f is the force vector and x is the vector of finite elements. The design domain is denoted by D and the volume of material V (ρ) is constrained to be smaller than a prescribed target V * . The unknown densities, ρ(x), are used to scale the stiffness of the finite elements of the regular grid. In practice, this parametrization leads to designs with large areas of intermediate densities which, even though being numerical optimal, are impossible to manufacture. This problem is normally addressed using implicit relaxation/penalization techniques, which drive the topology design towards solid/void configurations. The SIMP method [9,45] makes use of such implicit penalization techniques by a power-law interpolation function between void and solid to determine the stiffness matrix of each element K e as follows where K 0 and K min > 0 are the stiffness matrix of solid and void material respectively, and p > 1 is the penalization power. For problems where the volume constraint is active, Bendsøe and Sigmund [46] prove that the powerlaw interpolation function is perfectly valid when p is sufficiently large. In particular, p ≥ 3 is usually required to obtain black-and-white designs.
Although the use of material interpolation schemes enables to obtain almost solid-and-void designs, they destroy the convexity of the optimization problem increasing the risk of ending in local minima. However, it is common to use continuation methods to mitigate the premature convergence to local minima when solving the optimization problem, see e.g. [47]. According to [48], continuation methods take "global" information into account and are more likely to ensure "global" convergence or at least convergence to better designs. Different continuation methods have been proposed based on the idea of gradually change the optimization problem from a convex problem to the original non-convex problem.
The topology optimization problem should also be regularized using additional constraints on the density field to avoid numerical difficulties and modeling problems, such as mesh-dependency of solutions and checker-board patterns [1] respectively. The sensitivity filter [4] is adopted in this work because it has proven to be effective in practice producing mesh-independent solutions. Moreover, the filtering of gradients has a continuum mechanics motivation and may promote convergence of some length scales over others, and thereby speeds up convergence [49]. Furthermore, the sensitivity filter has computational advantages because it is not included in the OC updating scheme loop. One drawback is, however, that there remain discrepancies between the filtered sensitivities and the actual sensitivities, i.e. the modified sensitivities do not completely correspond to the objective function. In theory, this may lead to some divergence problems though proper designs are obtained in practice. The sensitivity filter applies a smoothing filter to the derivatives of the objective function as follows where N B e is the neighborhood set of an element e, w(x i , x e ) is a weighting function and γ > 0 is a small value to prevent the division by zero. The linear weighting function is used in this work, which is defined as whereas the neighborhood set of an element e is defined as where R is the filter size and dist(i, e) is the Euclidean distance between the center of element i and the center of element e.
The SIMP method can be applied to diverse physics problems. The numerical experiments of this work address the stiffness maximization of continuum structures, the heat sink design cooled by heat conduction and the compliant mechanism design problems. The objective function for the minimization of structural compliance (maximization of stiffness) and the minimization of thermal compliance (maximization of heat transfer) is given by whereas the objective function for the compliant mechanism design, consisting of the maximization of output displacements, is as follows where f and u are the global force/thermal load and displacement/temperature vectors respectively for elasticity/heat transfer problems, and l is a vector with ones at the Degrees of Freedom (DoF) of the output displacements u out and zeros in all other positions. Considering the discretized linear state system Ku = f and using the adjoint state method, the sensitivity of (5) and (6) with respect to the state variables ρ is as follows where u * is given by the solution of the adjoint problem as follows The right hand side of the adjoint problem is ∂f /∂u = f for the minimization of both structural and thermal compliance, which means that these problems are self-adjoint and the solution of (8) is u * = u. Conversely, the right hand side of (8) is ∂f /∂u = l for the compliant mechanism design, which requires the solving of the adjoint system (8) to obtain u * .
The sensitivities given by (7) permit to update the design variables ρ using sequential convex approximations, such as the Sequential Quadratic Programming (SQP) [50] and the Method of Moving Asymptotes (MMA) [51]. The Optimality Criterion (OC) updating scheme proposed by [52] and modified by [53] is adopted in this work due to its numerical efficiency. The OC updating scheme is as follows where ζ is a positive step width, η is a numerical damping coefficient, q is a penalty factor to further achieve black-and-white topologies (typically q = 2) and is the Karush-Kuhn-Tucker (KKT) optimality condition. The Lagrange multiplier λ is found using the bisection method. The algorithm stops when the maxi-mum number of iterations is reached or when the change variable ||ρ e k+1 −ρ e k || ∞ and the change in the objective function |f k+1 −f k | fall below a prescribed value.
The FEA is the principal bottleneck of the topology optimization pipeline.
This stage involves two computational intensive tasks: the assembly of the local element equations into a global system of equations and the solving of such a system. These computational intensive tasks can lead to an unaffordable problem in terms of computation time and memory consumption. This problem is exacerbated when dealing with large-scale models [54] or when the system response needs to be re-evaluated, as occurs in topology optimization. Iterative solvers and assembly-free methods have been extensively used for reducing the memory requirements of FEA at the cost of increasing the processing time of the solve step, which is commonly alleviated using parallel computing.

GPU implementation of SIMP method
GPU computing is used to accelerate the computationally intensive tasks involved in the SIMP method. Such tasks are shown in the flowchart depicted in Figure 3; in particular, the Finite Element Analysis (FEA), the calculation of the sensitivities, the filtering strategy and the OC updating scheme. The custom-developed CUDA kernels and the techniques adopted for the efficient implementation of these computationally intensive tasks on GPU architectures are detailed below.

Finite Element Analysis (FEA)
In this work, the performance of the solving stage is evaluated using a matrixfree PCG method with two different preconditioning techniques: geometric multigrid preconditioner and Jacobi preconditioner. The geometric multigrid methods are based on the smoothing property and the coarse grid principle.
The former reduces the high frequency error components whereas the latter approximates the low frequency error components on coarser grids, which are then prolonged to the finer grids. Their major advantage is that they have an asymptotically optimal complexity of O(N ) and provide mesh-independent convergence and good parallel scalability [55]. However, the performance of these methods deteriorates with increasing contrast in material properties [56].
This is attributed to the coarsening across discontinuities which affects to the coarse grid correction [57]. Nevertheless, the use of geometric multigrid as preconditioning technique shows good convergence rates for topology optimization problems using a sufficiently strong smoothing operator [58].
The use of a regular grid permits to calculate and store the common elemental stiffness matrix at the finest grid K e 0 only once at the beginning of the optimization, whereas the global matrix K at the finest grid can be calculated "on-the-fly" using the elemental properties d = E void + ρ p (E solid − E void ) (for simplicity) for each analysis. This reduces meaningfully the use of device memory and permits to exploit the data locality [39]. Such an approach is enough for the Jacobi preconditioning but the geometric multigrid preconditioner requires the assembled coefficients at the coarser levels, which are computationally intensive to calculate "on-the-fly" and require significant memory resources when they are stored. In particular, a Galerkin-based coarsening is required and the assembled coefficients at the coarser levels need to be stored. This deteriorates the GPU performance due to the global memory accesses through large memory, which does not permit to exploit data locality.

end 58 end
The GPU instance to calculate and store the assembled matrices of coefficients C at the coarser levels is of paramount importance for an efficient geometric multigrid implementation. The use of a regular grid permits to know a priori that the contributions to the matrix of coefficients are bounded by 8 elements and by 27 nodes per node. This permits to set the maximum size of global stiffness coefficients per node, which is bounded by 27 matrices of dimension 3 × 3.
Such 27 matrices are related to the contributions of the 3 3 grid neighborhood of the node. This storage scheme requires the indexes of the adjacent nodes, which are stored on a vector I of integers where -1 means that the node does not exit. The storage of the global stiffness matrix C per node has a similar size than using a sparse-matrix representation but permits to allocate the required memory for the assembly at the beginning, which has significant computational benefits for GPU computing.
The coefficient matrices for the coarser levels are obtained from the finer levels using a Galerkin-based coarsening following +1 C = R +1 C P +1 , where R and P are the restriction and prolongation operators respectively. Following [38], the coarsening operation is computed in a node-by-node matrix-free fashion using a two-step approach. Firstly, a linear combination of the 3 3 fine grid neighborhood of considered node is performed, corresponding to a linear combination of the rows of C. Secondly, these coefficients are interpolated to the coarser grid vertices, corresponding to a linear combination of the columns of  Additionally, the GPU instance requires the calculation of diverse vector arithmetic operations which are implemented using custom developed CUDA kernels, labeled with "dbdkerX", with granularity at the DoF level. Some of these kernels require the synchronization of the threads involved in the computation to add the resulting data of all these threads. This can be done using atomic addition in CUDA, which permits to read, modify, and write a value back to device memory without the interference of any other threads. However, its use should be minimized because the operations are serialized when the same memory address is accessed at the same time, which can deteriorate the performance of the kernel execution. For this reason, the addition is computed as a reduction using the thrust library.  v ← P +1 ( +1 w); // CUDA kernel -Prolongation (nbnVProl) The pseudocode of the Jacobi preconditioner (dbdJacP) kernel is detailed in Algorithm 2. It calculates the Jacobi preconditioner "on-the-fly" using the common stiffness matrix K e 0 and the vector d containing the elemental Young's modulus/thermal conductivity for elasticity/heat conduction problems. This simple preconditioner is computationally cheap and only requires storing a vector of the dimension of unknowns. The pseudocode of the matrix-vector product (dbdMVP) kernel is shown in Algorithm 5. This algorithm performs the matrixvector operation "on-the-fly" for the finest level ( = 0) using the vector d and the common stiffness matrix K e 0 . For the coarser levels, it takes the matrix of coefficients contributing to the node to perform the operation. Nevertheless, the grain size of the operations is at the DoF level.

end
The pseudocode of the geometric multigrid preconditioner (Vcycle) kernel is shown in Algorithm 3. Such a preconditioning is carried out by a recursive call to the V-cycle algorithm. The algorithm requires as input data the vector d, the common elemental stiffness matrix K e 0 , the vector I c of boundary conditions for the finest level ( = 0) and the assembled matrix of coefficients C for the coarser levels. It also needs the vector I of neighbor nodal indexes per node, the residual r and the parameters ω, µ 1 and µ 2 for all the levels. The algorithm performs a matrix-vector product (dbdMVP) and the multigrid smoother (dbdDJS) for each level. Besides, diverse vector arithmetic operations are performed using custom developed kernels. To transfer information between two consecutive grids +1 Ω and Ω, a nodal-based GPU instance of prolongation operator P +1 : +1 Ω → Ω and restriction operator R +1 : Ω → +1 Ω are introduced. The geometric relationship between hierarchical grids allows us to avoid storing the prolongation operator P +1 and the restriction operator R +1 and to work with the stencils instead, which are constant or can be computed "on-the-fly" when needed. The number of levels is selected in order to ensure the coarsest level is small enough to be solved using a sparse LU decomposition on CPU. When the number of levels is properly selected, the number of DoFs in the coarsest grid is relatively small and the system of equations can be solved with a direct method on CPU. The pseudocode of the multigrid smoother (dbdDJS) kernel is shown in Algorithm 4. Such a smoother is based on the damped Jacobi method, which uses the inverse of the diagonal of global stiffness matrix with a relaxation parameter ω. if (idx < (nx + 1)) && (idy < (ny + 1)) && (idz < (nz + 1)) then 10 u ← (idz * ((nx + 1) * (ny + 1))) + (idy * (nx + 1)) + idx ;

Calculation of sensitivities
The calculation of sensitivities (7)  cording to (7). Finally, the objective function f is computed using reduction operation using the thrust library. The compliant mechanism synthesis also requires the calculation of the objective function (u out ) as well as the adjoint state (u * ), which are calculated on GPU using the custom-developed CUDA kernel detailed in Algorithm 1.

Filtering strategy
The filtering strategy aims to prevent numerical artifacts in the optimal solution. The Algorithm 7 shows the pseudo-code of the EbE GPU implementation of the three-dimensional sensitivity filter (ebeFILTER) following (3). The input data of such an algorithm are: the density design variables ρ, the vector of sensitivities f ρ , the finite element size (d x , d y , d z ), the number of divisions of the grid (n x ,n y ,n z ) and the filter radius R. Note that the ebeF ILT ER kernel cannot be implemented using a three-dimensional launching approach. This is due to the radius size of (4) requires a large halo in all the dimensions of the cache data using shared memory, which normally exceeds the limit of such a resource or forces to launch very small three-dimensional blocks that do not make use of the parallelization potential of GPU computing. The kernel applies to each element e ∈ E E E going through each element that falls within the projection of the linear convolution function (3). The regularity of the grid permits to efficiently look for the neighbors of the corresponding element, avoiding the use of a global index table that requires a substantial number of memory accesses. The granularity of the ebeFILTER CUDA kernel is at the element level, assigning one CUDA thread to each element when invoking such a kernel. cz ← f loor(e/((nx + 1)(ny + 1))); 10 cy ← f loor((e − cx(nx + 1)(ny + 1))/(nx + 1)); 11 cx ← e − cz(nx + 1)(ny + 1) − cy(nx + 1); 12 for r ← max(cx − sx, 0) to min(cx + sx, nx) do 13 for s ← max(cy − sy, 0) to min(cy + sy, ny) do 14 for t ← max(cz − sz, 0) to min(cz + sz, nz) do 15 i ← t(nx + 1)(ny + 1) + s(nx + 1) + r;

Optimality Criterion (OC) update scheme
The pseudo-code of the EbE GPU implementation of the density updating strategy (ebeOC) is shown in Algorithm 8. Such an implementation makes use of the bisection method and the optimality criterion according to (9)  The bisection method divides the interval repeatedly to calculate the midpoint Lagrange multiplier λ m , which is then used to calculate the KKT optimality condition. The second kernel updates the element density ρ following (9) and copy back the calculated volume V new to update the Lagrange multiplier, which is used to evaluate the stopping criteria. This procedure is repeated until such a convergence criterion is satisfied. The vector ρ is finally copy back to host memory.

GPU implementation and memory management
The GPU instance of density-based topology optimization consists of the custom-developed CUDA kernels for the computationally demanding tasks involved in the algorithm. Figure 3 shows The device memory allocated for the custom-developed kernels, obviating the allocation of scalar values, is shown in Figure 5. The dbdPCG kernel using the Jacobi preconditioner requires the storage in the global device memory of vectors d, f , u, r, p, a, z, and I c indicated in the pseudocode of Algorithm 1.
When the geometric multigrid preconditioner is used, the vectors s, v, I e , I and +1 C are also stored in the global device memory for the corresponding n l levels. The common elemental stiffness matrix K e 0 is stored in constant memory. This permits to save bandwidth because constant memory is cached and consecutive reads of the same address does not incur any additional memory traffic. Besides, one single read from constant memory is broadcast to the threads of a half-warp. Additionally, the vectors ρ, d ρ , u * , f ρ andf ρ required by the kernels ebeUKU, ebeFILTER and ebeOC are also stored in the global device memory. In the case of compliance minimization problems, the memory allocation for the adjoint state u * can be obviated because such problems are self-adjoint.

Numerical experiments
The performance of the proposed GPU instance of density-based topology optimization is evaluated using three topology optimization problems in different fields. In particular, the stiffness design of continuum structures, the heat sink design cooled by heat conduction and the compliant mechanism synthesis.
The two first two numerical experiments aim to analyze the use of different    A continuation strategy for parameter p is adopted for all the numerical experiments; in particular, the evolution of the parameter p in the continuation step, according to [53], is as follows This continuation strategy is represented in Figure 6 for different γ values. By modifying the parameter p, the optimization problem is gradually changed from a convex problem to the original non-convex problem, which is governed by the parameter γ. Based on the experience provided by Groenwold and Etman [53], such a γ parameter is set to 1.02 in all the numerical experiments.
The minimum compliance design problem consists of finding the material density distribution that minimizes the deformation of the structure, a tiedarch bridge in our case, under the prescribed loading and boundary conditions.  The resulting structural design of the tied-arch bridge is shown in Figure 7 GPa is considered. Only the half of the design domain is analyzed using the horizontal plane of symmetry, which is discretized using 160×40×80 eight-node hexahedral linear brick elements, i.e. 512,000 elements or 1,604,043 DoFs. The target volume is about 10% of the volume of the design domain. The topology optimization parameters are: filter radius R = 4 mm and factor q following (12).

Non-design domain
The parameters of the bisection method used to infer the Lagrange multiplier λ are: η = 0.3, ζ = 0.1, λ l = 0, λ u = 10 9 and λ min = 10 −40 . The resulting mechanism design is shown in Figure 8(b), where the final design is thresholded at ρ = 0.99 and colored by the magnitude of the displacement field. One can observe that the gripping "jaws" along with the hinges resemble the common topology of grippers.
The heat sink design considering conduction heat transfer, also known as volume-to-point heat conduction problem [59], consists of the minimization of thermal compliance to maximize the heat conduction transfer from the heat sink.  The volume is subjected to a heat generation rate at every point of the design domain and cooled through a small patch (heat sink) located in the middle of its upper face. The thermal conductivity matrix of each element is considered, according to (2), as the distribution of two material phases comprising a good thermal conductor (K 0 ) and a poor conductor (K min ). Both material phases are considered homogeneous and isotropic without temperature effect on their conductivities. Therefore, the topology optimization problem aims to find the optimal distribution of "good" thermal conductor that minimizes the highest temperature under a volume constraint. The design domain and the boundary conditions for this benchmark are shown in Figure 9 where M nd = 100% indicates that the design is totally gray and M nd = 0% means that the design is fully discrete. One can observe in Figure 10 that the numerical experiments show stable convergence to fully discrete designs. Figure 11 shows the evolution of the objective function for the different topology optimization problems. One can observe that all the topology optimization problems show stable convergence of the objective function. All the numerical experiments are performed using γ = 1.02 following (11). We can observe in Figure 6 how the parameter p is fixed to p = 1 for the 20 initial iterations and then it is increased until p = 3 in the iteration 75. The objective function for the different topology optimization problems converges to a minimum with intermediate densities, as shown in Figure 10, in the iteration 20. As the penalty parameter p is increased one can observe that the objective function converges  to a local minima with a lower M nd until a fully discrete design is found.
The efficient calculation of matrix-vector multiplication operations is crucial to obtain a reasonable performance for the solving stage. The performance of this operation using shared memory is evaluated modifying the block size used to launch the dbdM V P kernel. This is done launching one FEA of the tiedarch bridge problem using different three-dimensional block size configurations. Table 2 shows the battery of experiments performed to tune the block size, which is sorted by wall-clock time. As a general rule, the performance is maximized using the maximum amount of on-chip memory and the maximum number of threads. However, the GPU performance of dbdM V P kernel is maximized as increasing the x dimension of block size due to the unknowns are stored in such a dimension. Table 3 details the thread hierarchy used for invoking the CUDA kernels in the different benchmarks. One can observe that the three-dimensional kernels with granularity at the DoF level use the block size configuration obtained from the experiment shown in Table 2. Three-dimensional kernels with granularity at the element level use symmetric kernels to maximize the use of shared memory.
These kernels do not operate over information with a dimension higher than the others. The kernels that are not using on-chip memory are launched as   DoF for elasticity problems. On the contrary, the decrement of the speedup using the geometric multigrid preconditioner is attributed to the higher size of the stiffness matrices of coefficients, which are stored in global device memory.
The memory accesses of these large arrays dominate the potential speedup. Figure 13 shows the percentage of the total wall-clock time of each kernel in the topology optimization using the Jacobi and the geometric multigrid preconditioning. As expected, the principal bottleneck of the topology optimization algorithm is the solving of the system of equations of the finite element model.
The wall-clock time for elasticity problems is around the 98% of the total wall-    clock time using the Jacobi preconditioning whereas this percentage is around the 87% using the geometric multigrid preconditioning. This percentage is considerably reduced in the heat conduction problem, where the solving stage using the most modern graphics device (GK110b) is about the 55% of the total wallclock time using both preconditioners. This is mainly attributed to the fewer number of iterations required by PCG in the heat conduction problem to achieve convergence. As a consequence, the percentage of the total wall-clock time consumed by the filtering strategy and the bisection method become significant in the heat conduction problem, and the use of GPU computing for these stages can notably accelerate the topology optimization process.

Conclusion
This paper has investigated about the proper strategy and techniques to achieve efficient calculation and reasonable speedups using GPU computing for the computationally intensive tasks of density-based topology optimization methods. The performance of the solving stage using GPU computing is analyzed using two preconditioning techniques; in particular, Jacobi preconditioner and geometric multigrid preconditioner. Different granularities are used to facilitate the exploitation of massive parallel architectures. The solving of the system of equations is implemented with granularity at the DoF level, whereas element grain size is used to process the calculation of sensitivities, the filtering strategy and the optimality criteria method. A nodal-based strategy is adopted for the coarsening using the geometric multigrid preconditioner and for prolongation and reduction operators. Significant speedup is achieved in the solving stage using constant memory for storing the common elemental matrix K 0 and shared memory for storing the unknowns and the vector of elemental properties d of the topology optimization. Besides, the three-dimensional kernel to process the matrix-vector multiplication operations fits well for massive parallel processing showing good performance results.
The numerical results show that the proposed GPU instance achieves signifi- cant speedups for the computational demanding tasks involved in density-based topology optimization. Nevertheless, we have to remark that such speedups are calculated with respect to the sparse-matrix calculation using one CPU thread, and thus lower speedups are obtained using multiple CPU threads for the calculation. The numerical results also show that the best performance in wall-clock time is obtained using the geometric multigrid preconditioner in the GPU instance of the PCG. Moreover, the experiments show the efficiency of the proposed GPU implementation for solving topology optimization problems in three different fields: maximum stiffness design, heat sink design and compliant mechanism design. Furthermore, the numerical experiments show the scalability of the proposed GPU instance with the resources of the graphics units. This is a promising result to achieve higher speedups on new generations of graphics cards.

Acknowledgements
We gratefully acknowledge the support of NVIDIA Corporation with the donation of some of the GPU devices used for this research. Such a work has also been supported by the AEI/FEDER and UE under the contract DPI2016-77538-R and by the "Fundación Séneca -Agencia de Ciencia y Tecnología de la Región de Murcia" under the contract 19274/PI/14.