WP1 New light event readout board prototype
1.1 WP presentation
This task is about providing electric signals from VUV (Vacuum ultraviolet at 178 nm) generated during
scintillations within the liquid xenon (LXe). UV photons will be collected from off-the-shelf photodetectors. To detect low charges (few hundreds of pC), deposited on the photodetector by photoelectrons, and their timing, a self-triggered circuit is needed to collect and preprocess these data to ensure proper digital conversion. When scintillation happens, few photons reach the detector due to the LXe. Since the lightemission is isotropic several detectors can output a signal. Hence, we have to discriminate correctly between low charges to select the detector receiving the most photons, i.e. the highest charge.
Typically, the photodetector output charge is converted into a time pulse whose duration is proportional
to the charge value, this technique is known as “Time Over Threshold” (TOT). Obtaining such a linear conversion is challenging for low charges yet essential to improve image SNR [18]. The pulse duration is then digitized using an external clock. Delay line architectures will be studied to determine the best compromise between complexity and accuracy. Finally, the external clock sets a time-window during which the photodetector could randomly fire several times, we must ensure that the charge to time conversion can be performed before within the time window. Also, considering the number of disintegration expected, the data flow is also tremendous, about 1.106 pulses/s.
To locate where the detection took place, each photodetector will have its digitized output data tagged
with the photodetector address. So that the Event Builder (WP2) will receive light signals with the proper localisation information.
1.2 Results
Time Over Threshold is not suited when one needs to count a large number of photon , i.e. more than 10 as shown in Fig. 1.2. As the number of PEs increasease the resulting integration of the light pulses into a single analogue pulse by the shaper does not yield difference in the time varying digital signal steming from the threshold. Thus, this does not count the number of PEs above a dozen at best. A better solution to correctly count PEs above 10 is to generate several TOTs using several threshold voltages for the same analogue pulse and then add the TOTs up to get the corresponding number of PEs, Fig. 1.2.
A new readout circuit implementing multi time-over-threshold has been designed, Fig. 1.3. The PCB format is fully compatible with the XEMIS2 prototype, hence testing the prototype is real conditions wil be possible. Compared to the existing light readout circuit, this one has 6 differents digitally tunable (12 bits) threshold voltages, increased bandwidth for amplifiers and comparators bandwidths resulting in better noise performance.
figure 1.4 shows first measurements comparing the number of PEs obtained using only the first threshold (TOT1) and then four thresholds (MTOT).
WP2 Event builder
2.1 WP presentation
The light and charge information received from the sensors (WP1) must be analyzed to determine the instances where these two information are synchronous, i.e. both values are above a certain threshold within a certain small time-range, indicating the occurrence of an actual phenomenon, called ’event’. These events are, in turn, outputted to computer (WP3). Therefore, these two values must be sorted first, based on their time of occurrence, before events can be detected. In the previous version of the system (XEMIS-2) that was a smaller prototype, the sorting and event-builder task was accomplished within software. As the new system under development (XEMIS-3) receives almost 256 LVDS channels for PUs (charge info), and 64 LVDS channels for PMs (light info), processing these channels in real-time to obtain the events, dictates a faster hardwarebased processing, which defines the scope of WP2. Moreover, as the number of PM/PU channels could increase in the future, for XIMEIS-3 or future versions, the hardware solution must be fast, efficient and flexible enough to be tuned to any speed requirement that the overall system demands. Therefore, FPGAs have been selected as framework to provide this performance, due to their high speed, high flexibility and easy-to-use built-in LVDS receivers. Therefore, the tasks included in WP2 can be listed as below:
- Task 1: Designing the overall system
- Task 2: Receiving & decoding of data from LVDS channels
- Task 3: Distribution of data between FPGAs (Ethernet interconnect)
- Task 4: Sorting Accelerator
- Task 4.1- Exploring the design-space and literature review
- Task 4.2- Designing algorithm & architecture
- Task 4.3- RTL implementation (VHDL)
- Task 5: Event-builder & Outputting
2.2 Results
2.2.1 Overall design
The architecture for implementing the scheme of Fig. 2.1 has been developed. According to this architecture, if the total number of PU channels is N_{PU} , and the number of FPGAs is N_{FPGA}, a dataset of size: n × N_{PU} is formed by receiving n samples from each channel. This dataset is given to the first FPGA for sorting and outputting. The second dataset is, in turn, given to the second FPGA and so on. Following the same patern, (N_{FPGA} + 1)th dataset is sent to the first FPGA, which should have already finished the processing of the 1th dataset by that time. This means that each FPGA has a time-budget of n × N_{PU} × N_{FPGA} cycles to receive, sort and output a dataset of size n × N_{PU} . In this summary, only the PU receiving and sorting is covered, but the PM channels are treated similarly, but separately, up until the ”event-builder” block, where both PM and PU sorted results are processed together to construct events. Regarding practical aspects, the FPGA boards at reasonable prices have almost 40-50 LVDS receivers. Allocating 32 of which to PU channels and the rest to PM channels, demands that almost 8 FPGAs are used to accommodate all the existing PU channels, which is around 256. Therefore, N_{PU} = 256, N_{FPGA} = 8 and n = 16 are viable choices, simplifying the above-mentioned expressions so that, each FPGA must be able to receive, sort and output a dataset of size ’4K’ within ’32K’ clock-cycles. As n is proportional to the time-window of the dataset, higher values for n’result in more accuracy in the result, but increased cost, on behalf of the sorter, which is the bottleneck of the design, and whose size and throughput are highly dependant on dataset-size.
2.2.2 Decoding/Inputting (Task 2)
The block-level diagram of Fig. 2.1 depicts the overall data-flow of WP2, in which data is delivered to the ’Event-Builder’ after passing through the sorter. Also, due to the fact that only the portion of the received packages that contains the time info (26 bits) is needed for sorting, the rest of data is stored in RAM, to
be fetched back and used at the event-builder, when the sorted result is ready. This can substantially reduce the data-width and cost of the sorter, which is the cost/throughput bottleneck of the design. The hardware implementation for this part is currently ongoing.
2.2.3 Sorting accelarator (Task 4)
As discussed, the sorting accelerator must be able to input, sort, and output a dataset of size 4K within 32K clock-cycles (multi-streaming of inputs/outputs is allowed, accelerating the I/O phase). For this purpose, a bitonic sorter with flexible dataset-size has been developed in system-verilog; The synthesis results reveal that even for a dataset-size of 128, the combinational area-consumption exceeds the total available area of a Virtex-7 FPGA. As there must be free space in FPGAs for performing other functionalities, e.g. Ethernet, Event-Builder, etc, and also the data-set size might need to be increased further (n > 32) for improved accuracy, these requirements demand for a much more efficient sorting accelerator, which is currently the main area/throughput bottleneck of the design.
2.2.3.1 Merge-sort
Generally, for sorting large datasets with minimal cost, a divide-and-conquer strategy is practiced, where the dataset is divided into smaller subsets, with each subset sorted separately, and finally the results are merged together to construct the final result. The problem with this highly popular approach, referred to as ”merge-sort”, is that the hardware for merging at the final stage is quite costly. We have proposed an alternative solution for implementing merge-sort, which is demonstrated in Fig. 2.2. The scheme utilizes a ”pivot-finder”, inspired from the quick-sort algorithm. According to quick-sort, ”pivots” are the elements
from the dataset that can divide the dataset into equal non-overlapping subsets (’non-overlapping’ means that the range of data in each subset is distinct from the others). The extracted pivots by pivot-finder are utilized in the arranger, to merge the two subsets. The usage of pivot-finder reduces the complex design of the merger, into much simpler blocks (Arranger, 1-merger) that are substantially
less costly for hardware implementation.
2.2.3.2 Bucket-sort
The pivot-finder can also be utilized in the other popular sorting scheme for large datasets (”bucket-sort”), shown in Fig. 2.3. From hardware perspective, currently an estimation of the pivot-finder is practiced in literature (known as ”samplesort”), which is susceptible to possible hazards, due to its inaccuracy. The proposed pivot-finder can find the accurate pivots in two passes over the dataset. The RTL development of the sorting-accelerator is currently in progress.
2.2.4 Current progress
The current progress of WP2 in each task is summarized below:
Task 1: Designing the overall system – 90%
Task 2: Receiving & decoding of data from LVDS channels – 50%
Task 3: Distribution of data between FPGAs (Ethernet interconnect) – 0%
Task 4: Sorting Accelerator
Task 4.1- Exploring the design-space and literature review – 90%
Task 4.2- Designing algorithm & architecture – 80%
Task 4.3- RTL implementation (VHDL) – 10%
Task 5: Event-builder & Outputting – 25%
Task 6: Publications – 50% (Waiting for RTL results)
WP3 Acceleration of the reconstruction process for real time 3-γ imaging
3.1 WP presentation
Artificial intelligence has already demonstrated its potential within the field of medical image processing for tasks such as segmentation, denoising, super-resolution or classification. On the other hand, in the last couple of years, there has been an increasing interest in the deployment of AI within the field of raw data correction and the image reconstruction process. Potential advantages include the real-time execution once the model is trained and the potential to directly correct the physics principles of the detection process within the reconstruction task. The disintegration events from emitters used in 3- gamma imaging are processed one by one in order to determine the position of the 3^{rd} gamma emission. Within this context, analytical methods are time-consuming and lead to inaccurate models. Artificial intelligence approaches, coupled with realistic Monte Carlo simulations (MCS) have the potential to solve this problem. Indeed, by using each photon its interaction positions in the liquid xenon and associated energy information we will train a convolutional (deep learning, DL) neural network capable of predicting the direction of the third gamma from the measurements that will be provided by the Event Builder to be developed in WP2 above. The use of MCS will provide sufficient data for model learning. We have previously used similar approaches to predict the interaction position in monolithic PET detectors [21]. The geometry of a total body clinical system based on XEMIS detector technology will be considered in these simulations in addition to the geometry of the XEMIS2 that will be used to experimentally validate the developed algorithm during the integration work to be carried out in WP4. In the second phase of this work a direct DL based image reconstruction algorithm will be developed that will produce 3D reconstructed images using the acquired raw datasets and previously determined third gamma location information. The performance of this algorithm both in terms of precision but also in terms of speed of execution will be compared with traditional iterative reconstruction algorithms used in PET imaging and a recently developed reconstruction algorithm for 3-gamma imaging developed by the LaTIM [13].
The work in this work package is closely related to the development of the Advanced Event Builder that will be provided in WP2. The output of the Event Builder will be used as the input of the reconstruction algorithm. The algorithm developed in this WP3 will be benchmarked for its performance using measured datasets within the context of the integrative WP4.
3.2 Results
References
[18] T. Orita, et al, “The current mode Time-over-Threshold ASIC for a MPPC module in a TOF-PET system” (2018) Nuclear Instruments and Methods in Physics Research, Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 912, pp. 303-308.
[13] D. Giovagnoli et al., “A Pseudo-TOF Image Reconstruction Approach for Three-Gamma Small Animal Imaging,” in IEEE Transactions on Radiation and Plasma Medical Sciences, doi: 10.1109/TRPMS.2020.3046409.
[21] A. Iborra, D. Visvikis, et al. “Ensemble of Neural Networks for 3D Position Estimation in Monolithic PET Detectors”. Physics in Medicine & Biology 64.19 (2019), p. 195010