Low power systems are key for portable imaging devices, and helps bring them closer to the patients. They allow these devices to be available in emergency vehicles, field hospitals and remote health care centers.

To keep power low, we must focus on efficient implementation of data processing in order to take advantage of the high compute density per power in embedded processors, used widely in these systems. The processing of data from capture to display is often complicated requiring heavy computations. There are three major elements to pay attention to while designing with these processors: input/output or I/O bandwidth, memory bandwidth and compute need.

First, let’s focus on the I/O and memory bandwidth. Multicore processors come with high speed I/O interfaces like the Gb Ethernet, PCIe Express, Serial rapid I/O (SRIO) as well as proprietary interfaces like the Hyperlink. Even then, it may be necessary to do some pre-processing on the data. For example, in medical ultrasound imaging, the conventional preprocessing is to do beamforming which combines the data from all the elements in the transducer into one set. An alternate re-partition of the system has also been demonstrated where the demodulation is done in the analog front end (AFE) to reduce the I/O bandwidth between the AFE and the processing unit.

Conventional beamforming reduces the I/O throughput by combining the output of all the channels into one

The second element is memory bandwidth. Efficiently moving the data between on-chip and external memory so as not to overwhelm the memory bandwidth is a key aspect of embedded system implementation. The idea is simple: do as much processing as you can while the data stays in on-chip memory. This often requires repartitioning of the processing tasks or the data or both. Let’s take a typical example of processing tasks carried out on medical images before presenting them for display. The image first goes through some noise reduction technique, usually through a data dependent spatial filtering; the edges are then enhanced and finally the contrast is adjusted. One can perform each of these tasks on the whole image before moving onto the next task. However, a better way to do this is to perform all of these tasks on a subset of the image which can be kept in on-chip memory. This will significantly reduce the number of times data would need to be transferred between memory hierarchies.

The Direct Memory Access (DMA) capability of these processors allows data movement across memory hierarchy and across I/O peripherals while the cores continue to perform computations. In the ideal case, there is no overhead associated with data movement and the cores can spend all this time in processing. The I/O and memory bandwidth utilization should be designed so that the computation time for the data processing is larger than the time required for various data movements.

The last element is how to optimize for compute need of applications. Each processor is capable of a certain number of computations (e.g., multiplications and additions) per second. In the most ideal case, you want to use all these computations 100% of the time. However, it’s not feasible to keep all the computational units busy all the time. For example, your algorithm may not have the same number of multiplications and additions to keep these units occupied at every cycle. The challenge is to come up with an implementation that gives the best utilization of the available computational units inside the architecture.

The recent trend in increasing the computational capability of embedded processors while maintaining low power is to use multicore devices. We have two levels of mapping the processing tasks into such architecture. First, the tasks need to be mapped onto the multiple cores. One way that works well for medical imaging, especially for homogenous multicore processors, is the data parallel model. The whole data are sub-grouped and each sub-group is operated independently by individual cores. This idea is illustrated below for Optical Coherence Tomography based imaging demonstrated on TMS320C6678 processor from Texas Instruments.

Different cores work in parallel on parallel sections of the Optical Coherency Tomography (OCT) raw data (The OCT image shown above is courtesy of Prof. Stephen Boppart of University of Illinois Urbana-Champaign)

In the second level of mapping, the computations that individual cores are doing are mapped to the instructions available in the architecture. A good C compiler is essential for saving development time and achieving decent optimization. Most often the instructions are single input multiple data (SIMD) type which allows additions and multiplications to be performed on several data elements in one cycle. Most processing will have a compute intensive loop which is vectorized through these SIMD instructions. TI’s C compiler allows quick optimization of these loops through the use of intrinsics embedded in straight C code.

In conclusion, embedded processors used in many medical systems these days need to handle high levels of computational complexity in real-time and at low power. The three elements discussed: I/O bandwidth, memory bandwidth and compute need are key features for keeping power low and enabling portable imaging devices.