Scientific Context.
In the space industry, for the last decade, new actors are emerging from private companies sometimes not supported by government agencies. This trend is called the “new-space” and these actors usually seek to provide new commercial services with low cost technologies [1]. New-space advocates the use of COTS (Commercial-off-the-shelf) components to make commercial services as competitive as possible. COTS components provide higher computing and storage capabilities [2,11], allowing the embedding of AI technologies. They contribute to future commercial services especially when data has to be exploited in space, near sensors with DPU (Data Processing Unit) for instance [3]. Unfortunately, COTS components are not specifically hardened for space and are then much more subject to failure than traditional space components.
Targeted research questions.
In the COINS project, we investigate the use of COTS processors embedding AI applications for the new-space and, in particular, how transient failures of COTS components may impact performance, safety and predictability of AI applications in space. We argue that some AI applications do not require the same level of fault tolerance as provided in traditional space technologies. The project will investigate how AI applications may tolerate transient failures without jeopardizing computation results, performance and timing predictability. Actually, COINS may contribute to reducing the cost of space systems. Bit flip due to radiation is an example of transient faults that will be assumed by COINS. In traditional space components, bit flips are corrected by specific devices (i.e. scrubber [4]) that drastically reduce computing efficiency, which could be mitigated for fault tolerant AI applications by specific scrubbers. To summarize, COINS will explore how to co-design (both on the hardware and the software sides) fault tolerance mechanisms to maximize AI applications timing predictability, performance, and safety with unhardened COTS components.
Assumptions and expected outcomes.
COINS assumes critical AI applications that are compliant with SIMD architectures (e.g. GPU). We focus on systems that have timing and safety requirements to meet. We also assume a restricted number of transient failures such as bit flips, on memory hierarchy components or GPU [5,6]. At COINS project completion, expected contributions will be of two-folds. First, we will propose hardware designs (e.g. new GPU or new memory hierarchy designs) that will be suited for fault tolerant AI applications. The project will benefit from the SPARROW AI accelerator design [7] proposed by the BSC that can be implemented on COTS processors such as ARM [8]. COINS will also propose guidelines for the design of AI applications or methods to detect and mitigate faults. Both hardware and software proposals will be co-designed to optimize GPU performance (e.g. by efficient task mapping as proposed by LS2N [9]) and to assess timing predictability (e.g. with the Cheddar tool [10] of the Lab-STICC). Second, COINS proposals will be prototyped in the Lab-STICC CPER POMELOS platform. This platform and any acquired measures will be made available as open-source and open-data.
References:
[1] Denis, G., Alary, D., Pasco, X., Pisot, N., Texier, D., & Toulza, S. (2020). From new space to big space: How a commercial space dream is becoming a reality. Acta Astronautica, 166, 431-443.
[2] Pignol, M. (2010, March). COTS-based applications in space avionics. In 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010) (pp. 1213-1219). IEEE.
[3]Fiethe, B., Michalik, H., Dierker, C., Osterloh, B., & Zhou, G. (2007, April). Reconfigurable system-on-chip data processing units for space imaging instruments. In 2007 Design, Automation & Test in Europe Conference & Exhibition (pp. 1-6). IEEE.
[4] Keller, A. M., & Wirthlin, M. J. (2016). Benefits of complementary SEU mitigation for the LEON3 soft processor on SRAM-based FPGAs. IEEE Transactions on Nuclear Science, 64(1), 519-528.
[5] Hodson, R. F., Chen, Y., Pandolf, J. E., Ling, K., Boomer, K. T., Green, C. M., … & Defrancis, M. A. (2020). Recommendations on use of commercial-off-the-shelf (COTS) electrical, electronic, and electromechanical (EEE) parts for nasa missions (No. NESC-RP-19-01490).
[6] Hogan, S. L. (2023). Expanding Space Design Options Using COTS. AEROSPACE REPORT NO. ATR-2023-01935
[7] Bonet, M. S., & Kosmidis, L. (2022, March). SPARROW: a low-cost hardware/software co-designed SIMD microarchitecture for AI operations in space processors. In 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE) (pp. 1139-1142). IEEE.
[8] Furber, Stephen Bo (2000). ARM system-on-chip architecture. Pearson Education.
[9] Zahaf, H. E., Olmedo, I. S., Singh, J., Capodieci, N., & Faucou, S. (2021, April). Contention-aware GPU partitioning and task-to-partition allocation for real-time workloads. In Proceedings of the 29th International Conference on Real-Time Networks and Systems (pp. 226-236).
[10] F. Singhoff, A. Plantec, P. Dissaux and J. Legrand. Investigating the usability of real-time scheduling theory with the Cheddar project. Journal of Real Time Systems, volume 43, number 3, pages 259-295. November 2009. Springer.
[11] Bokil, H. (2020, April). COTS semiconductor components for the new space industry. In 2020 4th IEEE Electron Devices Technology & Manufacturing Conference (EDTM) (pp. 1-4). IEEE.