We discuss the design of algorithms and codes for the solution of large sparse systems of linear equations on extreme scale computers that are characterized by having many nodes with multi-core CPUs or GPUs. We first use two approaches to get good single node performance. For symmetric systems we use task-based algorithms based on an assembly tree representation of the factorization. We then use runtime systems for scheduling the computation on both multicore CPU nodes and GPU nodes [6]. In this work, we are also concerned with the efficient parallel implementation of the solve phase using the computed sparse factors, and we show impressive results relative to other state-of-the-art codes [3]. Our second approach was to design a new parallel threshold Markowitz algorithm [4] based on Luby’s method [7] for obtaining a maximal independent set in an undirected graph. This is a significant extension since our graph model is a directed graph. We then extend the scope of both these approaches to exploit distributed memory parallelism. In the first case, we base our work on the block Cimmino algorithm [1] using the ABCD software package coded by Zenadi in Toulouse [5, 8]. The kernel for this algorithm is the direct factorization of a symmetric indefinite submatrix for which we use the above symmetric code. To extend the unsymmetric code to distributed memory, we use the Zoltan code from Sandia [2] to partition the matrix to singly bordered block diagonal form and then use the above unsymmetric code on the blocks on the diagonal. In both cases, we illustrate the added parallelism obtained from combining the distributed memory parallelism with the high single-node performance and show that our codes out-perform other state-of-the-art codes. This work is joint with a number of people. We developed the algorithms and codes in an EU Horizon 2020 Project, called NLAFET, that finished on 30 April 2019. Coworkers in this were: Sebastien Cayrols, Jonathan Hogg, Florent Lopez, and Stojce ´ ∗iain.duff@stfc.ac.uk 1 Nakov. Collaborators in the block Cimmino part of the project were: Philippe Leleux, Daniel Ruiz, and Sukru Torun. Our codes available on the github repository https://github.com/NLAFET.

References [1] M. ARIOLI, I. S. DUFF, J. NOAILLES, AND D. RUIZ, A block projection method for sparse matrices, SIAM J. Scientific and Statistical Computing, 13 (1992), pp. 47–70. [2] E. BOMAN, K. DEVINE, L. A. FISK, R. HEAPHY, B. HENDRICKSON, C. VAUGHAN, U. CATALYUREK, D. BOZDAG, W. MITCHELL, AND J. TERESCO, Zoltan 3.0: Parallel Partitioning, Load-balancing, and Data Management Services; User’s Guide, Sandia National Laboratories, Albuquerque, NM, 2007. Tech. Report SAND2007-4748W http://www.cs.sandia. gov/Zoltan/ug_html/ug.html. [3] S. CAYROLS, I. S. DUFF, AND F. LOPEZ, Parallelization of the solve phase in a task-based Cholesky solver using a sequential task flow model, Int. J. of High Performance Computing Applications, To appear (2019). NLAFET Working Note 20. RAL-TR-2018-008. [4] T. A. DAVIS, I. S. DUFF, AND S. NAKOV, Design and implementation of a parallel Markowitz threshold algorithm, Technical Report RAL-TR-2019-003, Rutherford Appleton Laboratory, Oxfordshire, England, 2019. NLAFET Working Note 22. Submitted to SIMAX. [5] I. S. DUFF, R. GUIVARCH, D. RUIZ, AND M. ZENADI, The augmented block Cimmino distributed method, SIAM J. Scientific Computing, 37 (2015), pp. A1248–A1269. [6] I. S. DUFF, J. HOGG, AND F. LOPEZ, A new sparse symmetric indefinite solver using a posteriori threshold pivoting, SIAM J. Scientific Computing, To appear (2019). NLAFET Working Note 21. RAL-TR-2018-012. [7] M. LUBY, A simple parallel algorithm for the maximal independent set problem, SIAM J. Computing, 15 (1986), pp. 1036–1053. [8] M. ZENADI, The solution of large sparse linear systems on parallel computers using a hybrid implementation of the block Cimmino method., These de Doctorat, ´ Institut National Polytechnique de Toulouse, Toulouse, France, decembre 2013.