

Welcome!

## The 1st 128-bit RISC-V European Workshop

HiPEAC Workshop Wednesday January 22nd, 2025, Barcelona EUROPE 2025 12-15th May Paris, France

> The CfP is open! Deadline Friday February 7th, 2025, AOE. https://riscv-europe.org

Acknowledgments to « Maplurinum — Machinæ pluribus unum » (Faire) une seule machine avec plusieurs (Make) one machine out of many <u>French gov. grant n° ANR-21-CE25-0016</u>

The Benagil team of INRIA and Institut Polytechnique de Paris hires a tenured assistant professor (young researcher). This is a system and distributed systems group at the frontier between hardware and software. Contact: gael.thomas@inria.fr.





Deadline Friday February 7th, 2025, AOE.

#### Maplurinum — One Machine out of Many, or We had 64 bit, yes. What about second 64 bit?

<u>Mathieu Bacou<sup>1</sup></u>, Adam Chader<sup>1</sup>, Chandana Deshpande<sup>2</sup>, Christian Fabre<sup>3</sup>, César Fuguet<sup>6</sup>, Pierre Michaud<sup>4</sup>, Arthur Perais<sup>2</sup>, Frédéric Pétrot<sup>2</sup>, Gaël Thomas<sup>5</sup>, Jana Toljaga<sup>1</sup>, Eduardo Tomasi<sup>2,3</sup>

- <sup>1</sup> Samovar, Télécom SudParis, IMT, IP Paris
- <sup>3</sup> Université Grenoble Alpes, CEA, List
- <sup>5</sup> Inria Saclay

<sup>2</sup> Université Grenoble Alpes, CNRS, Grenoble INP, TIMA

- <sup>4</sup> Inria, Université de Rennes, IRISA
- <sup>6</sup> Inria, Université Grenoble Alpes

French government grant ANR-21-CE25-0016 ANR project « Maplurinum — Machinæ pluribus unum » (Make) one machine out of many

"The 1st 128-bit RISC-V European Workshop", HiPEAC, Wednesday January 22nd, 2025, Barcelona.

#### What is RISC-V 128 bit anyway?

|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | 42 Volume I: RISC-V Unprivileged ISA V20191213                                                                                                                                                                                                                                           | Starts w<br>"There is                                            |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------|
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | To improve compatibility with $RV64$ , in a reverse of how $RV32$ to $RV64$ was handled, we might<br>change the decoding around to remain $RV64I$ ADD as a $64$ -bit ADDD, and aid at $128+ADDQ$<br>in what was previously the OP-64 major poole (now remained the OP-128 major opcode). | can be r                                                         |
| Chapter 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | Shifts by an immediate (SLLI/SRLI/SRAI) are now encoded using the low 7 bits of the I-immediate,<br>and variable shifts (SLL/SRL/SRA) use the low 7 bits of the shift amount source register.                                                                                            | design t                                                         |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | A LDU (load double unsigned) instruction is added using the existing LOAD major opcode, along<br>with new LQ and SQ instructions to load and store quadword values. SQ is added to the STORE<br>major opcode, while LQ is added to the MISC-MEM major opcode.                            | from—n                                                           |
| RV128I Base Integer Instruction Set,<br>Version 1.7                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | Choisir l'affichage de la barre latérale                                                                                                                                                                                                                                                 | address                                                          |
| Version 1.7                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | and from the T (128-bit) integer format.                                                                                                                                                                                                                                                 | address                                                          |
| "There is only one mistake that can be made in computer design that is difficult to re-<br>cover from—not having enough address bits for memory addressing and memory man-<br>agement." Bell and Strecker, ISCA-3, 1976.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |                                                                                                                                                                                                                                                                                          | manage<br>Bell and                                               |
| This chapter describes RV1281, a variant of the RISC-V ISA supporting a flat 128-bit address space.<br>The variant is a straightforward extrapolation of the existing RV321 and RV641 designs.<br>The primary reason to extend integer register width is to support larger address spaces. It is                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                                                                                                                                                                                                                                                                                          | ISCA-3,                                                          |
| not clear when a flat address space larger than 64 bits will be required. At the time of writing,<br>the fastest supercomputer in the world as measured by the Top500 benchmark had over 1PB<br>of DRAM, and would require over 50 bits of address space if all the DRAM resided in a single<br>address space. Some workness-cade computers already contains even larger quantities of DRAM,<br>and new dense solid-state non-volatile memories and fast interconnect technologies might drive a<br>demand for even larger memory spaces. Exascale systems research is targeting 100PB memory<br>systems, which occupy 57 bits of address space. At historic rates of growth, it is possible that<br>greater than 64 bits of address space might be required before 2000.<br>History suggests that whenever it becomes clear that more than 64 bits of address space is<br>needed, architects will repeat intensive dedates about alternatives to extending the address space,<br>including segmentation, 96-bit address space, and solvine workwrounds, writi, finally, flat 128-<br>bit address space will be adopted as the simplest and best solution.<br>We have not forces the HV12B space at this time, as there might be need to evolve the design<br>based on actual usage of 128-bit address spaces. |                                                                                                                                                                                                                                                                                          | [1] A. Wat<br>" <i>Chapter</i><br><i>Instructio</i><br>in The RI |
| RV1281 builds upon RV64I in the same way RV64I builds upon RV32I, with integer registers<br>extended to 128 bits (i.e., XLEN=128). Most integer computational instructions are unchanged<br>as they are defined to operate on XLEN bits. The RV64I " $^{\rm HW}$ " integer instructions that operate<br>on 32-bit values in the low bits of a register are retained but now sign extend their results from<br>bit 31 to bit 127. A new set of " $^{\rm HW}$ " integer registers are added that operate on 64-bit values<br>held in the low bits of the 128-bit integer registers and sign extend their results from bit 63 to bit<br>127. The " $^{\rm HW}$ " integer registers and sign extend their results from bit 63 to bit<br>32-bit encoding.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |                                                                                                                                                                                                                                                                                          | Manual -<br>20191213                                             |
| 41                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |                                                                                                                                                                                                                                                                                          | 2019. Ava<br>https://r<br>specifica                              |

Starts with a nice quote : "There is only one mistake that can be made in computer design that is difficult to recover from—not having enough address bits for memory addressing and memory management." Bell and Strecker, ISCA-3, 1976."

[1] A. Waterman and K. Asanović, "Chapter 6, RV128I Base Integer Instruction Set, Version 1.7," in The RISC-V Instruction Set Manual - Volume I: Unpriviliged ISA, 20191213, The RISC-V Foundation, 2019. Available online at https://riscv.org/technical/ specifications/

#### This is not needed anytime soon! Why start so early?

Innovations like domain-specific hardware, enhanced security, open instruction sets, and agile chip development will lead the way. BY JOHN L. HENNESSY AND DAVID A. PATTERSON

DOI:10.1145/3282307

#### A New Golden Age for Computer Architecture

WE BEGAN OUR Turing Lecture June 4, 2018<sup>11</sup> with a review of computer architecture since the 1960s. In addition to that review, here, we highlight current challenges and identify future opportunities, projecting another golden age for the field of computer architecture in the next decade, much like the 1980s when we did the research that led to our award, delivering gains in cost, energy, and security, as well as performance.

"Those who cannot remember the past are condemned to repeat it." —George Santayana, 1905

Software talks to hardware through a vocabulary called an instruction set architecture (ISA). By the early 1960s, IBM had four incompatible lines of computers, each with its own ISA, software stack, I/O system, and market niche—targeting small business, large business, scientific, and real time, respectively. IBM

48 COMMUNICATIONS OF THE ACM | FEBRUARY 2019 | VOL. 62 | NO. 2



engineers, including ACM A.M. Turing Award laureate Fred Brooks, Jr., thought they could create a single ISA that would efficiently unify all four of these ISA bases. They needed a technical solution for how computers as inexpensive as

 key insights
 Software advances can inspire architecture innovation.
 Elevating the hardware/software interface creates opportunities for architecture innovation.
 The marketplace utilimately settles architecture debates. J. L Hennessy & D. A. Patterson *Turing Award2018* 

https://iscaconf.org/isca2018/turing\_lecture.html

"People who are serious about software should make their own hardware" Alan Kay Turing Award 2003 Apple Fellow



#### This presentation as a three courses light meal

# Some ideas about the 128 bit software stack

Back to basics!

Single/Unified view of a large 128 bit machine

- 1. Operating system
- 2. Compilation chain

Architecture & Microarchitecture

Just double everything?

(partly) address complexity with:

- 1. Clustering
- 2. Compression

## HW/SW Interface & Memory Management

How could the system architecture provide a 128 bit flat address space ?

- Better tailor the virtual memory system to HPC systems
- 2. Unified 128 bit address space



## Some ideas about the base software stack for future RISC-V 128 bit HPC machines with > 100 M cores

Christian FABRE (CEA LIST, Grenoble)

# Let's explain "Some ideas about the base software stack" for future RISC-V 128 bit HPC machine with > 100 M cores"

- Some ideas about ...
  - This is barely the beginning... We are not there yet far from it!
- ... the base software stack ...
  - The base software stack:
    - Operating system: kernel, command and libraries. Forget about virtualization and other bleeding edge topics for a moment.
    - · Compilers: code parallelization, code generation, runtime support.
    - File system: who need directories, files and data serialization, when you can have permanent pointers to data in byte addressable NV-RAM?
  - Long story short: unroll/remove the multiple software layers that have been coalescing over decades, like OpenMP or MPI.
- ... future ... HPC machines ...
  - Not there before a decade or more.
  - HPC are closed systems with simple heavy workloads. So it is easier analyze and understand, to kick start building something.
- ... with > 100 M cores.
  - We already have machines with 2-10 M cores in the Top 500.
- ... RISC-V ...
  - Machine will be made of RISC-V only cores. 100 k to 1 M interconnected clusters made of 100-1000 RISC-V cores each.
  - RISC-V as a unifying force: different clusters will support different sets of RISC-V extensions.
- ... 128 bit ...
  - Such a large address space is a chance to get a Single System Image (SSI) view of a program itself (process view) and the operating system (the OS as the first abstraction of the machine hardware)
- The 128 tons elephant *not* in the room
  - Compute-intensive ISA extensions: crypto., vector, variable precision, matrices, etc.
  - The ecosystem is there already, alive and kicking both for basic software and hardware

#### **128 bit makes it for a large address space!**

"Such a large address space is a chance to get a *Single System Image (SSI)* view of a program itself (process view) and the operating system (the OS as the first abstraction of the machine hardware)"

Opportunities to revisit the software stack and its basic concepts:

- Get rid of MPI? Replaced by cluster to cluster *virtual memory remapping* and memory transfers
- What is exactly a 128 bit *process* that would span 1 M shared memory clusters?
- What would be a *kernel* for such a machine?
- Ensure *virtual-to-physical memory mapping consistency* over that many cores
- Will we *still* be programming 100 cores with C code + pragmas?
- Why bother with a file system when you can store your data in non volatile memory, and get a permanent pointer to it ?





## Microarchitectural tricks to support 128-bit RISC-V without "128-bit everywhere"

Arthur PERAIS (CNRS)

- A naive approach to 128-bit hardware risks introducing area/power/latency penalties
- Larger tags/payloads
  - TLBs, branch target predictors (BTB & Co) => Compress tags<sup>1</sup>
- Wider datapath
  - Physical registers
  - Operand bypass
  - Functional units
- "HW Tax" to 128-bit support

• A simple example to illustrate opportunity to reduce the "HW tax"

| <pre>int main() {    for(int i = 0; i &lt; 64; i++)     {       c[i] = a[i] + b[i] * 10;    } }</pre> | gcc 13 –01 | slliw<br>Iw | a1,0(a3) // *b<br>a5,a1,2 // *b * 4<br>a5,a5,a1 // *b * 5<br>a5,a5,1 // *b * 10<br>a1,0(a4) // *a<br>a5,a5,a1 // *a + *b * 10<br>a5,0(a2) // *c =<br>a4,a4,4 // a++<br>a3,a3,4 // b++<br>a2,a2,4 // c++<br>a6,a6,1 // i++<br>a6,a0, loop |
|-------------------------------------------------------------------------------------------------------|------------|-------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                                                                                                       |            | DIC         | αυ,αυ, πουρ                                                                                                                                                                                                                              |

What actually needs to use 128-bit?

• A simple example to illustrate opportunity to reduce the "HW tax"

loop: lw a1,0(a3) // \*b a5,a1,2 //\*b\*4 slliw int main() addw a5,a5,a1 // \*b \* 5 a5,a5,1 // \*b \* 10 slliw for(int i = 0; i < 64; i++)W a1,0(a4) // \*a gcc 13 –O1 addw a5,a5,a1 // \*a + \*b \* 10 c[i] = a[i] + b[i] \* 10;a5,0(a2) // \*c = ... SW } addi a4,a4,4 //a++ } a3,a3,4 //b++ addi addi a2,a2,4 // c++ a6,a6,1 // i++ addi a6,a0, loop bne

Address generation (in blue)!



- Fraction of retired uOps that participate in AGEN (SPEC 2k17) is around 55% on average.
- => Opportunity to save "something" for 45% of the retired uOps

- Assume only address generation actually requires 128-bit
- Implement a 128-bit block to deal with AGEN and occasional 128-bit compute
  - Keep a 64-bit block to deal with the rest
- Works only if 64/128 is reasonably balanced
  - It is :)
    - in SPEC :|

- ADdress/Value (ADA) clustered microarchitecture<sup>1</sup>
  - Each cluster has their own resources
- Address cluster: 128b datapath
- Data cluster: 64b datapath
- "Dense" local bypass
- "Shallow" global bypass using explicit copy uOps



You mave have caught this on Chandana Desphande's poster at a coffee break

 One optimization enabled by separating Addresses and Data: PRF Compression



 Address cluster PRF has few different upper bits, achieve 40% storage reduction with region-based compression<sup>1</sup>

• Performance on-par with monolithic 128-bit uarch, but significant savings (fewer operand ready broadcasts, less bypass, smaller PRFs)





## Hardware-Software Interface and SoC Integration to efficiently support 128 bit address spaces

César FUGUET (Inria)



#### Scope

- This work focuses on High-Performance Computing (HPC) systems
- Current trend indicates that memory requirements in such systems may exceed 2<sup>64</sup> bytes in 20 years.
- The transition to 128 bits addresses is an opportunity to review some old well-established architecture mechanisms
  - Virtual addressing
  - Data addressing and orchestration on distributed machines

#### **Virtual Addressing: Page Size**

It is time to review the long-lasting 4K page size in processors

Address translation can be a performance bottleneck because of TLB misses and long page table walks



#### **Page-Size Exploration: Methodology**

- We conducted a study to see the impact of page size on:
  - performance (measured as TLB miss rate)
  - memory bloat (ratio of memory effectively used to memory allocated)

- The study considered different benchmarks: NPB, PARSEC, SPLASH3, SPECInt
- We used the QEMU simulator to redirect execution traces to a TLB simulator

*E. Tomasi, C. Fuguet, C. Fabre, F. Pétrot, "Page size exploration for RISC-V systems: the case for HPC", 35th International Workshop on Rapid System Prototyping* 

#### **Page-Size Exploration: Cost**

- We defined a simple cost function: A weighted mean of the performance and memory bloat as a function of the page size.

$$J_{n,b}(p) = w \cdot mr_{n,b}(p) + (1-w) \cdot mb_{n,b}(p)$$

 $\begin{array}{ll} mr_{n,b} & TLB \mbox{ miss rate for benchmark b and n cores} \\ mb_{n,b} & Memory \mbox{ bloat for benchmark b and n cores} \end{array}$ 

- The weight 0 < w < 1 throttle the importance of the one or the other criterion
  - W < 0.4</th>Embedded Systems0.4 <= W <= 0.6</td>General Purpose SystemsW > 0.6High-Performance Computing systems

#### <u>cea</u>

#### **Page-Size Exploration: Results**

- For both general purpose systems and HPC systems, a page size of 32 KB fits the best.
- For embedded systems, a smaller page size (e.g. 16 KB) gives the best result.





### Page-Size Exploration: Conclusions

 Preliminary results indicate that a transition to 32 KB pages would be beneficial for future HPC systems

 Increasing the page size reduces TLB misses and allows shallower page table "walks" (hence reduces TLB miss penalty)



#### But what about 128-bits ??? 51 50

R

1

127

VPN[3]

12

95



- Our target: Flat address space for distributed machines (nodes)
- Namespace ID : virtual identifier of a node
- Local accesses are translated as usual by • the local TLB
- Remote accesses can be forwarded to:
  - the remote node through the NIC
  - a local copy of the remote page.
- Remote NIC performs virtual to physical translation using an IOMMU



# **Conclusion & Questions**

1. 128-bit pointers is more than "just more memory"

- A thought experiment to rethink part of the stack
- 2. RISC-V 128 bit provides a tangible opportunity for hardware – software co-design
  - Some thoughts of the basic software stack
  - Some thoughts on the system
  - Some thoughts on micro-architecture
  - Some thoughts on virtual memory
  - Also, some thoughts on using 128-bit addresses differently: 2D addressing machines (<u>https://inria.hal.science/hal-04816363v1</u>, P. Michaud)
- Though 128 bit machines are probably far away, such work will take time, so we are starting now!



The CfP is open! Check https://riscv-europe.org Deadline Friday February 7th, 2025, AOE.

> The Benagil team of INRIA and Institut Polytechnique de Paris hires a tenured assistant professor (young researcher). This is a system and distributed systems group at the frontier between hardware and software. Contact: gael.thomas@inria.fr.

Work funded by the project « **Maplurinum — Machinæ pluribus unum** » (Faire) une seule machine avec plusieurs (Make) one machine out of many

French gov. grant n° ANR-21-CE25-0016





#### **Elephant in the Room : 128-bit ?**



#### • The 128 tons elephant *not* in the room

- Compute-intensive ISA extensions: crypto., vector, variable precision, matrices, etc.
- The ecosystem is there already, alive and kicking both for basic software and hardware

#### • For which machines?

- 128-bit within a single socket or server is unlikely :
  - Rack-scale computing
  - Supercomputer-scale
- Niche markets: filtering/fire-walling 128 bit addresses of IPv6

#### • But this is also an opportunity to revisit the software stack

- File system: who need directories, files and data serialization, when you can have permanent pointers to data in byte addressable NV-RAM across the whole machine (machine to be defined :) )?
  - Similarly: Share pointers to in-memory objects in the VA space with other sockets, blades, etc.?
- Such a large address space is a chance to get a *Single System Image (SSI)* view of a program itself (process view) and the operating system (the OS as the first abstraction of the machine hardware)



## Some ideas about the base software stack for future RISC-V 128 bit HPC machines with > 100 M cores

Christian FABRE (CEA LIST, Grenoble)

#### **Overview**





#### HPC TOP 500 — Status & Trends

#### European machines in the TOP 500 as of November 2023:

- #5 HPE Cray 2,752,704 cores Fi
- #6 Bull 1,824,768 cores It
- #8 Bull 680,960 cores Es
- $\rightarrow$  Increasing parallelism and distribution

Meanwhile:

 $\rightarrow$  Trend towards heterogeneity: GPUs, FPGAs, TPUs, variable precision FPUs...

# Hard to use efficiently, hard to program.

| Systems                       | 2012<br>BG/Q<br>Computer  | 2022                    | Difference<br>Today & 2022     |
|-------------------------------|---------------------------|-------------------------|--------------------------------|
| System peak                   | 20 Pflop/s                | 1 Eflop/s               | O(100)                         |
| Power                         | 8.6 MW<br>(2 Gflops/W)    | ~20 MW<br>(50 Gflops/W) |                                |
| System memory                 | 1.6 PB<br>(16*96*1024)    | 32 - 64 PB              | O(10)                          |
| Node performance              | 205 GF/s<br>(16*1.6GHz*8) | 1.2 or 15TF/s           | <i>O</i> (10) - <i>O</i> (100) |
| Node memory BW                | 42.6 GB/s                 | 2 - 4TB/s               | O(1000)                        |
| Node concurrency              | 64<br>Threads             | O(1k) or 10k            | O(100) - O(1000)               |
| Total Node Interconnect<br>BW | 20 GB/s                   | 200-400GB/s             | O(10)                          |
| System size (nodes)           | 98,304<br>(96*1024)       | O(100,000) or O(1M)     | O(100) - O(1000)               |
| Total concurrency             | 5.97 M                    | O(billion)              | O(1,000)                       |
| MTTI                          | 4 days                    | O(<1 day)               | - O(10)                        |



Source: J. Dongara, Grenoble Sep. 2019. Big thanks to Henri-Pierre Charles (CEA).

31

#### A RISC-V HPC machine by 2030: vision and rationale

At historic rates of growth, it is possible that greater than 64 bits of address space might be required before 2030.

Let's assume that a full RISC-V 128 bit HPC machine could have (wild guess) 100 x 10<sup>6</sup> cores, as 1,000,000 heterogeneous clusters of 100 cores each with o(10 TB) RAM/cluster.

The challenge is how to take advantage of RISC-V and 128 bit to •Manage the heterogeneity of the machine

•Optimize and simplify the operating system stack

Increase the performance in distributed computing

Do not beat around the bush: flat 128-bit address spaces will be adopted as the simplest and best solution. "There is only one mistake that can be made in computer design that is difficult to recover from — not having enough address bits for memory addressing and memory management."

Bell and Strecker, ISCA-3, 1976.

RV128 spec is not frozen at this time, as there might be need to evolve the design based on actual usage of 128-bit address spaces.