Christian Engelmann Profile Image

Christian Engelmann

Senior Scientist and Group Leader, Intelligent Systems and Facilities Research

Dr. Christian Engelmann is a Senior Computer Scientist and the Intelligent Systems and Facilities Research Group Leader at , the US Department of Energy’s (DOE) largest multiprogram science and technology laboratory with an annual budget of $2.8 billion and 7,000+ staff. He has more than 24 years experience in software research and development for extreme-scale high-performance computing (HPC) systems. Dr. Engelmann’s research solves computer science challenges in HPC software, such as scalability, dependability, and interoperability.

Dr. Engelmann’s primary expertise is in , i.e., efficiency and correctness in the presence of faults, errors, and failures. He is a leading HPC resilience expert and was a member of the DOE Technical Council on HPC Resilience 2013-15. He received the 2015 DOE Early Career Award for research in . Dr. Engelmann’s secondary expertise is in , enabling science breakthroughs with autonomous experiments, self-driving laboratories, smart manufacturing, and artificial intelligence (AI) driven design, discovery and evaluation. He further has expertise in , studying the impact of hardware/software properties on performance and resilience for application-architecture co-design. Dr. Engelmann is also an expert in operating system and runtime software for parallel and distributed systems.

Dr. Engelmann earned a Dipl.-Ing. (FH) in Computer Systems Engineering from the University of Applied Sciences Berlin, Germany, and a M.Sc. in Computer Science from the University of Reading, UK, both in 2001 as conjoint degrees, and a Ph.D. in Computer Science from the University of Reading in 2008. He is a Senior Member of the Association for Computing Machinery (ACM) and the Institute of Electrical and Electronics Engineers (91°µÍø). He is also a Member of the Society for Industrial and Applied Mathematics (SIAM) and the Advanced Computing Systems Association (USENIX).

More information can be found on , including details about , , and a .

  • Group Leader, Intelligent Systems and Facilities – 91°µÍø (10/2020-Present)
  • Senior R&D Staff – 91°µÍø (4/2018-Present)
  • R&D Staff – 91°µÍø (9/2009-3/2018)
  • R&D Associate – 91°µÍø (5/2004-8/2009)
  • Post-Master’s Research Associate – 91°µÍø (6/2001-4/2004)
  • Software Developer – 91°µÍø (8/2000-1/2001)
  • Software Developer – Hewlett-Packard, Germany (10/1998-9/1999)
  • Ph.D. in Computer Science – University of Reading, UK (12/2008)
  • M.Sc. in Computer Science – University of Reading, UK (7/2001)
  • Dipl.-Ing. (FH) in Computer Systems Engineering – University of Applied Sciences Berlin, Germany (2/2001)
  • – Senior Member
  • – Senior Member
    • 91°µÍø
    • 91°µÍø
      • 91°µÍø CS
      • 91°µÍø CS
      • 91°µÍø CS
      • 91°µÍø CS
    • 91°µÍø Reliability Society (RL)

Highly Cited Peer-Reviewed Publications

  1. A. B. Nagarajan, F. Mueller, C. Engelmann, and S. L. Scott. Proactive Fault Tolerance for HPC with Xen Virtualization. In Proceedings of the 21st ACM International Conference on Supercomputing (ICS) 2007, June, 2007. DOI . Accept. rate 23.6% (29/123). 527 citations.
  2. M. Snir, R. W. Wisniewski, J. A. Abraham, S. V. Adve, S. Bagchi, P. Balaji, J. Belak, P. Bose, F. Cappello, B. Carlson, A. A. Chien, P. Coteus, N. A. Debardeleben, P. Diniz, C. Engelmann, M. Erez, S. Fazzari, A. Geist, R. Gupta, F. Johnson, S. Krishnamoorthy, S. Leyffer, D. Liberty, S. Mitra, T. Munson, R. Schreiber, J. Stearley, and E. V. Hensbergen. Addressing Failures in Exascale Computing. , volume 28, number 2, May, 2014. DOI . 526 citations.
  3. D. Fiala, F. Mueller, C. Engelmann, K. Ferreira, R. Brightwell, and R. Riesen. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing. In Proceedings of the , November, 2012. DOI . Accept. rate 21.2% (100/472). 386 citations.
  4. C. Wang, F. Mueller, C. Engelmann, and S. L. Scott. Proactive Process-Level Live Migration in HPC Environments. In Proceedings of the , November, 2008. DOI . Accept. rate 21.3% (59/277). 250 citations.
  5. J. Elliott, K. Kharbas, D. Fiala, F. Mueller, K. Ferreira, and C. Engelmann. Combining Partial Redundancy and Checkpointing for HPC. In Proceedings of the , June, 2012. DOI . Accept. rate 13.8% (71/515). 203 citations.

Other Significant Publications

  1. M. Kumar, S. Gupta, T. Patel, M. Wilder, W. Shi, S. Fu, C. Engelmann, and D. Tiwari. Study of Interconnect Errors, Network Congestion, and Applications Characteristics for Throttle Prediction on a Large Scale HPC System. , volume 153, July, 2021. DOI .
  2. G. Ostrouchov, D. Maxwell, R. Ashraf, C. Engelmann, M. Shankar, and J. Rogers. GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability. In Proceedings of the , November, 2020. DOI . Accept. rate 25.1% (95/378).
  3. H. Jeong, Y. Yang, C. Engelmann, V. Gupta, T. M. Low, P. Grover, V. Cadambe, and K. Ramchandran. 3D Coded SUMMA: Communication-Efficient and Robust Parallel Matrix Multiplication. In Lecture Notes in Computer Science: Proceedings of the , August, 2020. DOI . Accept. rate 24.5% (39/159).
  4. D. Fiala, F. Mueller, K. Ferreira, and C. Engelmann. Mini-Ckpts: Surviving OS Failures in Persistent Memory. In Proceedings of the , June, 2016. DOI . Accept. rate 24.2% (43/178).
  5. C. Engelmann. Scaling To A Million Cores And Beyond: Using Light-Weight Simulation to Understand The Challenges Ahead On The Road To Exascale. , volume 30, number 0, January, 2014. DOI .