Now
Experience
Projects
Three matrix multiplication kernels from scratch — naive, shared memory tiled, and double-buffered vectorized — integrated into PyTorch via C++ extensions and benchmarked against cuDNN. Profiled with Nsight Compute to isolate memory bandwidth vs compute bottlenecks at each step.
Vision-based system for the Anduril AI Grand Prix. Single FPV camera pipeline for gate detection without GPS. Real-time control loop over MAVLink/UDP at 50–120 Hz under strict compute constraints (~100 TOPS).
Skills
Contact