The Problem
The client’s legacy CPU-based beamformer was unable to keep up with the data rate of their new transducer array. Frame rates were dropping below 15fps, making the system unsuitable for cardiac imaging applications. They needed a solution that could handle 4x the data throughput without changing the physical hardware footprint.
Our Approach
We conducted a 2-week assessment to profile the existing C++ codebase and identify parallelizable bottlenecks. Following the assessment, we executed a 6-week sprint to:
- Port critical delay-and-sum kernels to CUDA
- Implement zero-copy memory transfer to minimize latency
- Build a bit-exact validation suite against the CPU reference
Results
The optimized GPU pipeline not only met the performance requirements but freed up CPU resources for post-processing and UI rendering. The solution was delivered with full documentation and a regression test suite.
Facing a similar challenge?
Let's discuss how we can help you achieve similar performance gains.
Request an assessment