
Christian Engelmann
Senior Scientist and Group Leader, Intelligent Systems and Facilities Research
Bio
Dr. Christian Engelmann is a Senior Computer Scientist and the Intelligent Systems and Facilities Research Group Leader at , the US Department of Energy’s (DOE) largest multiprogram science and technology laboratory with an annual budget of $2.8 billion and 7,000+ staff. He has more than 24 years experience in software research and development for extreme-scale high-performance computing (HPC) systems. Dr. Engelmann’s research solves computer science challenges in HPC software, such as scalability, dependability, and interoperability.
Dr. Engelmann’s primary expertise is in , i.e., efficiency and correctness in the presence of faults, errors, and failures. He is a leading HPC resilience expert and was a member of the DOE Technical Council on HPC Resilience 2013-15. He received the 2015 DOE Early Career Award for research in . Dr. Engelmann’s secondary expertise is in , enabling science breakthroughs with autonomous experiments, self-driving laboratories, smart manufacturing, and artificial intelligence (AI) driven design, discovery and evaluation. He further has expertise in , studying the impact of hardware/software properties on performance and resilience for application-architecture co-design. Dr. Engelmann is also an expert in operating system and runtime software for parallel and distributed systems.
Dr. Engelmann earned a Dipl.-Ing. (FH) in Computer Systems Engineering from the University of Applied Sciences Berlin, Germany, and a M.Sc. in Computer Science from the University of Reading, UK, both in 2001 as conjoint degrees, and a Ph.D. in Computer Science from the University of Reading in 2008. He is a Senior Member of the Association for Computing Machinery (ACM) and the Institute of Electrical and Electronics Engineers (91°µÍø). He is also a Member of the Society for Industrial and Applied Mathematics (SIAM) and the Advanced Computing Systems Association (USENIX).
More information can be found on , including details about , , and a .
Professional Experience
- Group Leader, Intelligent Systems and Facilities – 91°µÍø (10/2020-Present)
- Senior R&D Staff – 91°µÍø (4/2018-Present)
- R&D Staff – 91°µÍø (9/2009-3/2018)
- R&D Associate – 91°µÍø (5/2004-8/2009)
- Post-Master’s Research Associate – 91°µÍø (6/2001-4/2004)
- Software Developer – 91°µÍø (8/2000-1/2001)
- Software Developer – Hewlett-Packard, Germany (10/1998-9/1999)
Awards
Education
- Ph.D. in Computer Science – University of Reading, UK (12/2008)
- M.Sc. in Computer Science – University of Reading, UK (7/2001)
- Dipl.-Ing. (FH) in Computer Systems Engineering – University of Applied Sciences Berlin, Germany (2/2001)
Professional Affiliations
- – Senior Member
- – Senior Member
- 91°µÍø
- 91°µÍø
- 91°µÍø CS
- 91°µÍø CS
- 91°µÍø CS
- 91°µÍø CS
- 91°µÍø Reliability Society (RL)
Publications
Other Publications
Highly Cited Peer-Reviewed Publications
- A. B. Nagarajan, F. Mueller, C. Engelmann, and S. L. Scott. Proactive Fault Tolerance for HPC with Xen Virtualization. In Proceedings of the 21st ACM International Conference on Supercomputing (ICS) 2007, June, 2007. DOI . Accept. rate 23.6% (29/123). 527 citations.
- M. Snir, R. W. Wisniewski, J. A. Abraham, S. V. Adve, S. Bagchi, P. Balaji, J. Belak, P. Bose, F. Cappello, B. Carlson, A. A. Chien, P. Coteus, N. A. Debardeleben, P. Diniz, C. Engelmann, M. Erez, S. Fazzari, A. Geist, R. Gupta, F. Johnson, S. Krishnamoorthy, S. Leyffer, D. Liberty, S. Mitra, T. Munson, R. Schreiber, J. Stearley, and E. V. Hensbergen. Addressing Failures in Exascale Computing. , volume 28, number 2, May, 2014. DOI . 526 citations.
- D. Fiala, F. Mueller, C. Engelmann, K. Ferreira, R. Brightwell, and R. Riesen. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing. In Proceedings of the , November, 2012. DOI . Accept. rate 21.2% (100/472). 386 citations.
- C. Wang, F. Mueller, C. Engelmann, and S. L. Scott. Proactive Process-Level Live Migration in HPC Environments. In Proceedings of the , November, 2008. DOI . Accept. rate 21.3% (59/277). 250 citations.
- J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. Ferreira, and C. Engelmann. Combining Partial Redundancy and Checkpointing for HPC. In Proceedings of the , June, 2012. DOI . Accept. rate 13.8% (71/515). 203 citations.
Other Significant Publications
- M. Kumar, S. Gupta, T. Patel, M. Wilder, W. Shi, S. Fu, C. Engelmann, and D. Tiwari. Study of Interconnect Errors, Network Congestion, and Applications Characteristics for Throttle Prediction on a Large Scale HPC System. , volume 153, July, 2021. DOI .
- G. Ostrouchov, D. Maxwell, R. Ashraf, C. Engelmann, M. Shankar, and J. Rogers. GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability. In Proceedings of the , November, 2020. DOI . Accept. rate 25.1% (95/378).
- H. Jeong, Y. Yang, C. Engelmann, V. Gupta, T. M. Low, P. Grover, V. Cadambe, and K. Ramchandran. 3D Coded SUMMA: Communication-Efficient and Robust Parallel Matrix Multiplication. In Lecture Notes in Computer Science: Proceedings of the , August, 2020. DOI . Accept. rate 24.5% (39/159).
- D. Fiala, F. Mueller, K. Ferreira, and C. Engelmann. Mini-Ckpts: Surviving OS Failures in Persistent Memory. In Proceedings of the , June, 2016. DOI . Accept. rate 24.2% (43/178).
- C. Engelmann. Scaling To A Million Cores And Beyond: Using Light-Weight Simulation to Understand The Challenges Ahead On The Road To Exascale. , volume 30, number 0, January, 2014. DOI .