Bryan Anenberg

anenbergb [at] gmail [dot] com

I’m a research engineer specializing in computer vision and generative AI, with a growing focus on multimodal learning and foundation models. I bring over eight years of experience building and deploying machine learning systems at scale across both industry and research environments. My core expertise lies in real-time computer vision, with deep experience designing, optimizing, and deploying neural networks for production use. I have a strong track record of leading projects from concept to deployment—bridging research, scalable training, and real-world system integration.

At Anduril Industries, I developed deep learning object detectors for ground surveillance and counter-UAS platforms such as Sentry Tower and RoadRunner, and built a modular C++ vision library for real-time multi-object tracking. I also automated the full ML workflow—from data ingestion and annotation to distributed training and continuous evaluation—enabling rapid iteration and scalable deployment on embedded systems via TensorRT. Earlier, at Meta Platforms (Oculus), I worked on audio-visual expression tracking and avatar animation for virtual reality.

Beyond applied research, I enjoy staying on the cutting edge of AI through hands-on model reimplementation and experimentation. My latest projects span LLMs, diffusion models, object detection transformers (DETR), and neural network optimization—highlighted by a recent project optimizing the state-of-the-art Co-DETR (Collaborative Detection Transformer) via a custom C++ TensorRT plugin. At the core, I’m excited by the challenge of pushing modern AI models toward greater efficiency, scalability, and real-world impact.

In addition to my product-driven work in industry, I’m committed to continuous learning and broadening my expertise across domains. Recently, I explored medical image processing and radiology in collaboration with Prof. Peter Chang at UC Irvine, where I independently developed a two-stage 3D CNN for atrial segmentation in LGE-MRI — earning 3rd place in the MICCAI 2024 STACOM challenge.

I hold an M.S. in Computer Science with a specialization in AI from Stanford University, where I conducted computer vision research under Prof. Silvio Savarese. I also earned a B.S. in Engineering Physics with a minor in Mathematics, also from Stanford.

Alongside my work in AI, I’ve developed a growing interest in human health and medicine, which has led me to pursue supplemental coursework in biology, chemistry, and biochemistry through UCLA Extension. To explore this interest more directly, I volunteered at Hoag Hospital in Newport Beach, where I observed real-time imaging technologies in action — from catheter-based interventions in interventional radiology and cardiology to robotic-assisted surgeries with the Da Vinci system.

Artificial Intelligence Projects

Computer Vision

2025 Object Detection (DETR) GitHub

From-scratch PyTorch implementation of Detection Transformer (DETR), including sinusoidal positional encoding, multi-head attention, bipartite matching loss, and a training pipeline tailored for efficient convergence within 100 epochs.

2025 Object Detection (YOLO) GitHub

From-scratch PyTorch implementation of YOLOv3 with a ResNeXt backbone, featuring multi-scale anchor-based detection and distributed mixed-precision training with Hugging Face Accelerate.

2025 Image Classification (ResNeXt) GitHub

From-scratch PyTorch implementation of ResNeXt for image classification using grouped convolutions and residual blocks.

2024 3D Image Segmentation (3D U-Net CNN for MRI Medical Imaging) Technical Report GitHub

Won 3rd place in the Multi-class Bi-Atrial Segmentation Challenge at MICCAI 2024 (STACOM workshop) by developing a two-stage cascaded 3D CNN for segmenting atrial structures in late gadolinium enhancement (LGE) MRI volumes, achieving Dice scores up to 0.931, minimizing Hausdorff distance, and earning inclusion in the official benchmarking study paper; the task involved separate segmentation of atrial cavities and thin-walled atrial tissue—a particularly difficult subtask due to anatomical ambiguity and voxel-level sensitivity.

Conducted a comprehensive ablation study exploring over a dozen model, training, and inference configurations to optimize 3D atrial segmentation from LGE-MRI, including systematic variations in architecture (nnU-Net ResEnc vs. MedNeXt), resolution, patch sampling, receptive field size, loss functions, and data augmentation; validated the two-stage cascade design, refined post-processing, and implemented a 50% faster inference strategy to meet competition runtime constraints.

2023 3D Image Classification (3D U-Net CNN for MRI Medical Imaging) GitHub

Developed a 3D CNN with deep supervision to classify glioblastoma MGMT promoter methylation status, enabling non-invasive prediction of a key chemotherapy response biomarker, by training on 4-channel multimodal MRI data (T1, T1Gd, T2, FLAIR) and optimizing with patch-based sampling, tumor masking, and heavy data augmentation to mitigate overfitting on a small, heterogeneous dataset (~500 patients).

Improved model interpretability and training efficiency on high-dimensional 3D medical data, by building custom tools to visualize MRI slices, tumor segmentations, and prediction metadata, and by implementing efficient 3D data loading and augmentation pipelines using MONAI, TorchIO, and PyTorch Lightning under memory and resolution constraints.

2020 - 2024 Object Detection in Production Anduril Industries

Led development of deep learning object detectors for Sentry Tower and RoadRunner, delivering production-grade models for ground surveillance and aerial threat interception in counter-UAS applications, with continual upgrades driven by systematic architectural and data-driven experimentation.

Improved model robustness and reduced false positives by identifying failure modes in field environments, coordinating test site validation events, and applying targeted data augmentations and pipeline changes.

2016 Metric Learning for Fashion Photography Technical Report

Developed a deep learning system that learns compact visual embeddings for real-world fashion images using weakly labeled data from social media. The model is trained with a triplet loss to capture style similarity based on shared attributes like garments and colors, without requiring explicit style labels. The resulting embedding generalizes well to downstream tasks such as fashion style classification.

2015 Deep Reinforcement Learning Technical Report

Investigated alternative sampling strategies for experience replay in Deep Q-Learning, implementing stratified, recency, reward-based, and low-discrepancy methods in an Atari 2600 game-playing agent. Experiments showed that stratified sampling consistently improved training stability and policy performance, achieving higher average rewards and Q-values than uniform sampling in games like Breakout and Kangaroo, likely due to its enhanced temporal diversity in sampled experiences.

2015 Forecasting Social Navigation in Crowded Complex Scenes arXiv paper

Designed and built scalable infrastructure to preprocess and annotate over 200 hours of aerial video via Amazon Mechanical Turk, enabling the creation of a large-scale dataset for studying human navigation in crowded outdoor environments. This supported research on trajectory forecasting, including a new model that predicts motion by accounting for interactions between different types of agents (e.g., pedestrians, cyclists) and their varying sensitivity to social cues.
[Dataset Webpage]

2015 Image Segmentation (GrabCut) Technical Report

Implemented the GrabCut algorithm for foreground-background segmentation and conducted three experiments to improve segmentation quality: varying the number of Gaussian Mixture Model (GMM) components, reinitializing GMM parameters mid-optimization, and restricting background GMMs to pixels within the bounding box. Results showed that increasing GMM components improved segmentation on certain images, while reinitialization stabilized convergence. The best performance—96.94% accuracy and 87.71% Jaccard similarity—was achieved by constraining the background model spatially, enabling better capture of local color distributions near the object. This project was completed as part of CS231B: The Cutting Edge of Computer Vision at Stanford University.

2015 Object Detection (R-CNN) Technical Report

Implemented the R-CNN object detection pipeline, combining selective search for region proposals, CNN-based feature extraction, SVM classification, and bounding box refinement via L2-regularized linear regression. This project was completed as part of CS231B: The Cutting Edge of Computer Vision at Stanford University.

Computer Vision for Video and Tracking

2023 Video Object Detection Anduril Industries

Explored video object detection by extending YOLO-style single-stage detectors to process multiple consecutive frames, integrate auxiliary inputs like optical flow, and incorporate a custom transformer module with attention across spatiotemporal feature maps. Conducted extensive ablation studies showing improved recall in detecting small, fast-moving aerial targets such as drones in cluttered environments.

2022 Optical Flow (FastFlowNet, Self-supervised) Anduril Industries

Demonstrated feasibility of real-time optical flow-assisted detection by adapting FastFlowNet for self-supervised training (using pixel-wise reconstruction loss instead of ground truth flow) and deploying it on NVIDIA Jetson AGX Xavier hardware via a custom CUDA plugin integrated into the C++ tracking engine; tested live on cUAS towers at test ranges.

2021 End-to-End Multi-Object Tracking Anduril Industries

Experimented with deep learning-based multi-object tracking models, including SiamMOT and CenterTrack, to track aerial targets such as drones for counter-UAS applications, training on internal datasets and benchmarking against single-frame object detection baselines like Faster R-CNN.

2020 Real-time Multi-Object Tracking in C++ Anduril Industries

Improved real-time multi-object tracking accuracy for ground surveillance and counter-drone (UAS) operations on platforms such as Sentry Tower and Dust by replacing fragmented, bespoke integration code with a modular, reusable, testable C++ computer vision library that combined a TensorRT-compiled deep learning detector with asynchronous Kalman filtering and sparse optical flow tracking, validated in live military field exercises and deployed in production.

2018 Face Tracking in VR Meta Platforms - AR/VR Oculus

Developed core face tracking technology that contributed to Meta Quest Pro’s Face Tracking API (released after my tenure at Meta) enabling naturalistic avatar expressions in social VR, by implementing keypoint tracking algorithms that map facial movements to FACS-based blendshape activations in real time.
[Face Tracking SDK]

2015 Video Activity Recognition (CNN) Technical Report GitHub

Proposed a two-stage algorithm for activity recognition in temporally untrimmed videos by first localizing potential activity segments and then classifying them with a two-stream convolutional neural network. Video localization was performed using a shot detection heuristic to segment the video into shots, followed by a tubelet-based heuristic to extract a clip likely containing an activity. Activity classification was then performed using a two-stream CNN, with one stream processing RGB frames and the other processing optical flow computed across 30-frame intervals. Experiments on the UCF-101 dataset showed that the RGB frame-based model outperformed the optical flow stream, likely because static visual cues in individual frames were often sufficient for activity recognition, whereas the optical flow model received only a single frame of motion information, limiting its temporal context. Additionally, uncorrected camera motion may have degraded the quality of the optical flow. This project was completed as part of CS231N: Convolutional Neural Networks for Visual Recognition at Stanford University.

2015 Online Single Target Tracker Technical Report

Implemented the TLD (Tracking-Learning-Detection) algorithm for robust online object tracking and evaluated several extensions to improve accuracy and adaptability. The TLD framework combines optical flow tracking, a patch-based detector, and an online learning component that updates the object model to handle appearance changes across frames. The detector uses an ensemble of fern classifiers, each computing binary hashes over image patches and using nearest neighbor matching to determine whether a patch corresponds to the tracked object. Experiments included varying the number and size of ferns, replacing the fern ensemble with a linear SVM, and using Histogram of Oriented Gradients (HOG) features for patch representation. Results showed that HOG features and SVM-based detection improved robustness under occlusion and motion. This project was completed as part of CS231B: The Cutting Edge of Computer Vision at Stanford University.

2014 Video Activity Recognition (handcrafted feature-based) Technical Report GitHub

Developed an activity recognition algorithm for videos using hand-crafted feature representations and a one-vs-rest linear SVM for final classification on the UCF-101 dataset. Improved Dense Trajectory Features (IDTF) were extracted by sampling feature points, tracking them across frames using optical flow, and computing descriptors—Trajectory, HOG (Histogram of Oriented Gradients), HOF (Histogram of Optical Flow), and MBH (Motion Boundary Histogram)—along the resulting trajectories to capture motion and appearance cues. The hundreds of IDTF descriptors per video were aggregated into a compact representation using Fisher Vectors, computed with respect to a Gaussian Mixture Model (GMM) fitted to the descriptor distribution across the dataset. Dimensionality of the Fisher Vectors was reduced using Principal Component Analysis (PCA) before SVM classification. This approach achieved up to 79.34% mean average precision, significantly outperforming static-frame baselines. This project was completed as part of CS221: Artificial Intelligence at Stanford University.

Optimization & Inference

2025 Optimizing Co-DETR for TensorRT Inference GitHub

Achieved 4.3× faster inference (346ms → 79.5ms) for state-of-the-art Co-DETR (Collaborative Detection Transformer) on NVIDIA RTX 4090 by compiling the model from PyTorch to TensorRT using torch.export and TorchDynamo, and implementing a custom C++ TensorRT plugin for multi-scale deformable attention to enable full-model compilation and FP16 optimization.

2020 - 2024 Real-time Vision Model Acceleration on Embedded Systems with TensorRT and C++ Anduril Industries

Implemented and refactored deep learning object detection models for efficient C++ inference using TensorRT, including adapting models for ONNX export, developing custom TensorRT plugins and CUDA kernels, and benchmarking performance on embedded devices like the NVIDIA Jetson AGX Xavier.

Generative AI

2025 Text-to-Image Generation (Latent Diffusion) GitHub

An ongoing from-scratch exploration of text-to-image generation with latent diffusion, focused on understanding core components like VAEs, U-Nets, Diffusion Tranformers, and diffusion schedulers. So far, I’ve trained a baseline Stable Diffusion model using a custom trainer, with custom model and scheduler implementations planned next.

2019 Facial Video Compression (Neural Network–based Codec) [Patent: US11734952B1] Meta Platforms - AR/VR Oculus

Co-invented a neural-network-based facial video compression technique, reducing bandwidth needs while maintaining visual fidelity, by reconstructing facial video frames from transmitted keypoints and partial image data using a generative model.

2018 Photorealistic Facial Video Synthesis (CycleGAN) Meta Platforms - AR/VR Oculus

Trained a CycleGAN to generate photorealistic facial video by transferring lighting and texture from real headset footage onto rendered 2D facial rig views, creating realistic synthetic training data for neural networks that predict 3D facial blendshape coefficients from Oculus VR face tracking cameras.

Large Language Models

2025 From-scratch LLM Implementation GitHub

Built a modern large language model (LLM) from scratch in PyTorch, incorporating Byte Pair Encoding (BPE) tokenization, Rotary Position Embedding (RoPE), RMSNorm, SwiGLU activation, and causal multi-head self-attention. The project was inspired by assignments from Stanford’s CS336: Large Language Models (Spring 2025).

2025 Flash Attention 2 Implementation GitHub

Implemented Flash Attention 2 in Triton and performed extensive benchmarking, achieving a 7× speedup in end-to-end runtime (forward + backward) of the attention operation compared to the naive PyTorch implementation on long sequences using bfloat16 on an RTX 5090 GPU. The speedup comes from GPU-efficient techniques such as operator fusion, tiling to shared memory to minimize global memory latency, and low-precision computation, which together reduce memory overhead and improve parallel execution efficiency.

Sequence Modeling (NLP & Audio)

2025 Masked Language Modeling (BERT) GitHub

From-scratch PyTorch implementation of BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding using Byte Pair Encoding (BPE) tokenization, whole-word masking, gradient accumulation, and 1M-step masked language model training on BookCorpus and Wikipedia.

2018 Temporal CNNs (Audio-driven Facial Animation) [Patent: US10755463B1] Meta Platforms - AR/VR Oculus

Contributed to Oculus LipSync’s audio-driven facial animation system, widely adopted by VR developers to power real-time, expressive avatar movement in Unity and Unreal Engine, by implementing Temporal Convolutional Networks (TCNs) to predict viseme shapes from LogMel audio features, and co-inventing patented methods for full-face animation—including lips, eyebrows, and eyelids—based on speech and pitch-driven input.

Designed and implemented the real-time laughter detection neural network shipped with Oculus LipSync, achieving robust performance across diverse vocal profiles, by training a Temporal Convolutional Network (TCN) on a manually labeled and augmented dataset, and optimizing the model for real-time inference on CPU and DSP.

For technical details and discussion, see the blog posts for Oculus LipSync release v1.28.0 and release v1.30.0, and the [Oculus Lipsync Documentation].

Detailed overview of Oculus LipSync is provided at the PyTorch Developer Conference 2018 [video - my talk begins at 31:45].

2016 Real-time Voice Controlled Meeting Assistant Stanford University

Contributed to a team project to develop an AI-powered meeting assistant integrated with the Amazon Echo, designed to provide real-time voice-driven support during meetings. The system transcribed audio to text, extracted action items, and surfaced insights such as sentiment and energy scores, which were organized into a web dashboard for post-meeting analysis. Built in collaboration with VMWare as part of Stanford's CS210 Software Project Experience course.

2015 LSTM for Quiz Bowl Question Answering Technical Report

Developed a dependency-tree-based LSTM model for answering Quiz Bowl questions that outperforms previous recursive neural network approaches by over 3% in accuracy, leveraging word2vec embeddings and tree-structured memory cells to better capture syntactic and semantic information from the questions. The Tree-LSTM extends standard linear-chain LSTMs by propagating information along the syntactic dependency parse tree of each sentence, allowing the model to selectively retain or forget information based on grammatical structure.

2019 - 2024

Anduril Industries
Computer Vision and Perception Engineer

Led development of deep learning object detectors for Sentry Tower and RoadRunner, delivering production-grade models for ground surveillance and aerial threat interception in counter-UAS operations, with continual improvements driven by systematic architectural and data experimentation.
Enhanced real-time multi-object tracking accuracy by contributing to a C++ vision library that integrated a TensorRT-compiled detector with an online visual tracker; validated in live military field exercises and deployed at operational sites.
Automated the end-to-end ML workflow for the RoadRunner platform by integrating orchestration and visualization tools to manage data ingestion, annotation, and distributed multi-GPU training; enabled CI-driven model iteration with automatic mAP/HOTA benchmarking on continuously updated test sets.
Improved detection of challenging aerial targets by developing and evaluating custom multi-frame video models with temporal context.
Improved model robustness and reduced false positives by identifying failure modes in field environments, coordinating test site validation events, and introducing targeted data augmentations and pipeline updates — such as handling high-dynamic-range thermal inputs more effectively.

2017 - 2019

Meta Platforms - AR/VR Oculus
Computer Vision Engineer

Shipped a real-time laughter detection neural network and contributed to the viseme prediction model for Oculus LipSync, both based on Temporal Convolutional Networks (TCNs) applied to LogMel audio features; optimized for real-time CPU/DSP inference and co-invented patented methods [Patent: US10755463B1] enabling expressive avatar facial animation in Unity/Unreal.
Developed facial keypoint tracking for Meta Quest Pro’s Face Tracking API, enabling real-time avatar expressions via FACS-based blendshape mapping.
Generated photorealistic synthetic training data using CycleGAN, transferring lighting/texture from headset footage to rigged 2D facial views for 3D facial blendshape prediction.
Developed a GAN-based generative compression method for facial video, transmitting only real-time keypoints and a single reference frame; the receiving side reconstructs video using a neural decoder, significantly reducing bandwidth while preserving visual fidelity. [Patent: US11734952B1]

2016

Meta Platforms
Software Engineering Intern

Updated the 360 photo and video upload pipeline to use a computer vision algorithm to automatically detect panoramic photos, ensuring correct 360° content display for users.

2014 - 2017

Stanford University
MS in Computer Science (Artificial Intelligence & Computer Vision; Advisor: Fei-Fei Li)

Conducted research in computer vision under Prof. Silvio Savarese at Stanford's Computational Vision & Geometry Lab, focusing on video activity recognition and object tracking. Contributed to the publication Forecasting Social Navigation in Crowded Complex Scenes by building data and annotation infrastructure for over 200 hours of aerial drone footage.

2014

Oracle (formerly BlueKai)
Software Engineering Intern

2011 - 2015

Stanford University
BS in Engineering Physics, Minor in Mathematics (Advisor: Pat Hanrahan)

As an undergraduate at Stanford, I conducted particle physics research at SLAC National Accelerator Lab with Prof. Ariel Schwartzman, applying machine learning techniques to collision data from the ATLAS experiment at the Large Hadron Collider. In a prior summer, I conducted condensed matter physics research with Prof. Hari Manoharan, designing and analyzing nanoscale conductivity experiments using scanning tunneling microscopy.