Advanced Topics in Computer Vision Course
“Advanced Topics in Computer Vision” course (ECE-GY 9193) at NYU Tandon by David Fouhey
“Advanced Topics in Computer Vision” course (ECE-GY 9193) at NYU Tandon, led by Professor David Fouhey, is a research-driven course that dives deep into cutting-edge developments in computer vision. Instead of traditional lectures, the course centers on reading, analyzing, and discussing recent academic papers. It’s designed for students who want to not only stay current with fast-moving advancements in vision research but also develop the skills needed to critically engage with research—reading thoughtfully, presenting clearly, and providing constructive feedback.
Course OverviewPermalink
This course explores the latest research in computer vision by engaging directly with recent papers from top conferences and journals. There is no textbook; instead, students read primary literature each week, learning to identify important contributions, evaluate the quality of experiments, and understand the broader implications of each work.
The course focuses on two key objectives: building a strong understanding of current vision research, and developing foundational research skills. Students practice reading papers critically, pitching ideas effectively, writing with clarity, presenting to both technical and non-technical audiences, and offering constructive peer feedback. These are not just skills for academic success, but essential tools for contributing meaningfully to the field.
Class sessions are structured around active participation. Each week, all students read the assigned paper(s) and an accompanying “big picture” reading that puts the work into broader context. A group of students prepares a presentation, while others take on the role of discussion leaders—tasked with asking thoughtful questions and guiding conversation. This format, inspired by the Role-Playing Paper Seminar by Alec Jacobson and Colin Raffel, breaks away from long, passive lectures and replaces them with collaborative, in-depth discussions.
A standout feature of the course is its emphasis on peer feedback. Students regularly give and receive feedback on presentations and participation. This process, moderated by the instructor and course assistants, encourages a supportive and intellectually honest environment. Feedback is not competitive—students are not ranked against each other—but is used to help everyone improve.
Overall, Advanced Topics in Computer Vision is ideal for students who want to explore state-of-the-art research, sharpen their academic communication skills, and gain a deeper understanding of how ideas in computer vision evolve and take shape. It’s not about memorizing algorithms—it’s about learning how to think, question, and contribute.
Course Work and GradingPermalink
Note: This is subject to change
- Attendance - 20%
- Course Participation - 15%
- Presentations - 15%
- Project - 50%
ReadingsPermalink
Recognition and Basic Tasks (Data and Architectures)Permalink
- Technical Paper 1: Segment Anything, Alexander Kirillov et al. ICCV 2023
- Technical Paper 2: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, Alexey Dosovitsky et al. ICLR 2021
- Big Picture Paper: 50 Years of Data Science (Up to Page 18), David Donoho, 2015
- Reference Papers:
- Microsoft COCO: Common Objects in Context
- End-to-End Object Detection with Transformers (DETR)
- Deep Residual Learning for Image Recognition (ResNet)
- Detecting Twenty-thousand Classes using Image-level Supervision
Vision and Language ModelsPermalink
- Technical Paper 1: Flamingo: a Visual Language Model for Few-Shot Learning, Jean-Baptiste Alayrac et al., NeurIPS 2022
- Technical Paper 2: Visual Instruction Tuning, Haotian Liu et al. NeurIPS 2023
- Big Picture Paper: Scaling Laws for Neural Language Models, Jared Kaplan et al., 2020
- Reference Papers:
- Learning Transferable Visual Models From Natural Language Supervision
- VQA: Visual Question Answering
- Language Models are Few-Shot Learners
DiffusionPermalink
- Technical Paper 1: Denoising Diffusion Probabilistic Models, Jonathan Ho, Neurips 2020
- Technical Paper 2: Scalable Diffusion Models with Transformers, William Peebles and Saining Xie, ICCV 2023
- Big Picture Paper: Heilmeir Catechism, George Heilmeir, 1970s
- Reference Papers:
- Deep Unsupervised Learning using Nonequilibrium Thermodynamics (harder read – it’s math-heavy)
- Conditional Image Generation with PixelCNN Decoders
- Texture Synthesis By Non-Parametric Sampling
- Neural Discrete Representation Learning
- High-Resolution Image Synthesis with Latent Diffusion Models
DatasetsPermalink
- Technical Paper 1: PASS: An ImageNet replacement for self-supervised pretraining without humans, Yuki Asano et al.
- Technical Paper 2: Do ImageNet Classifiers Generalize to ImageNet?, Benjamin Recht et al. ICML 2019
- Big Picture Paper: Are We Learning Yet?, Thomas Liao et al., 2021
- Reference Papers:
- ImageNet: A Large-Scale Hierarchical Image Database
- Unbiased Look at Dataset Bias
- Does Object Recognition Work for Everyone?
Multiview 3DPermalink
- Technical Paper 1: DUSt3R: Geometric 3D Vision Made Easy, Shuzhe Wang et al. CVPR 2024
- Technical Paper 2: VGGSfM: Visual Geometry Grounded Deep Structure From Motion, Jianyuan Wang et al. CVPR 2024
- Big Picture Paper: Unreasonable effectiveness of data, Alon Halevy, Peter Norvig, and Fernando Pereira, 2009
- Reference Papers:
- Structure-from-Motion Revisited
- Building Rome in a Day
- SuperGlue: Learning Feature Matching with Graph Neural Networks
Neural FieldsPermalink
- Technical Paper 1: 3D Gaussian Splatting for Real-Time Radiance Field Rendering, Bernhard Kerbl. SIGGRAPH 2023
- Technical Paper 2: Mip-NeRF, Jonathan Barron et al. ICCV 2021
- Big Picture Paper: What Makes a (Graphics) Systems Paper Beautiful, Kayvon Fatahlian, ??
- Reference Papers:
- NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
- Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains
- Plenoxels: Radiance Fields without Neural Networks
Single-View 3DPermalink
- Technical Paper 1: UniDepth: Universal Monocular Metric Depth Estimation, Luigi Piccinelli et al. CVPR 2024
- Technical Paper 2: CAT3D: Create Anything in 3D with Multi-View Diffusion Models, Ruiqi Gao and Aleksander Holynski et al., Arxiv 2024.
- Big Picture Paper: Statistical Learning: The Two Cultures, Leo Breiman, 2001
- Reference Papers:
- NYU v2 Dataset
- Learning a predictable and generative vector representation for objects
- Learning to Recover 3D Scene Shape from a Single Image
Self-Supervised LearningPermalink
- Technical Paper 1: 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities, Roman Bachmann et al. Arxiv 2024
- Technical Paper 2: ImageBind: One Embedding Space To Bind Them All, Rohit Girdhar et al., CVPR 2023
- Big Picture Paper: Data Science at the Singularity, David Donoho, 2023
- Reference Papers:
- Unsupervised Visual Representation Learning by Context Prediction
- Masked Autoencoders Are Scalable Vision Learners
- A Cookbook of Self-Supervised Learning
Egocentric VisionPermalink
- Technical Paper 1: Rescaling Egocentric Vision, Dima Damen et al. IJCV 2021
- Technical Paper 2: Ego-Exo4D, Kristen Grauman et al. Arxiv 2024
- Big Picture Paper: The Development of Embodied Cognition: Six Lessons from Babies, Linda Smith and Michael Gasser, 2005
- Reference Papers:
- Understanding Egocentric Activities
HumansPermalink
- Technical Paper 1: 3D Hand Pose Estimation in Everyday Egocentric Images, Aditya Prakash et al., ECCV 2024
- Technical Paper 2: Humans in 4D: Reconstructing and Tracking Humans with Transformers, Shubham Goel et al. ICCV 2023
- Big Picture Paper: SORA, SORA Team, 2024
- Reference Papers:
- SMPL: A Skinned Multi-Person Linear Model
- Embodied Hands: Modeling and Capturing Hands and Bodies Together
- End-to-end Recovery of Human Shape and Pose
- Learning joint reconstruction of hands and manipulated objects
RoboticsPermalink
- Technical Paper 1: Open x-embodiment: Robotic learning datasets and rt-x models, The Open-X Embodiment Collaboration, Arxiv 2023.
- Technical Paper 2: On bringing robots home, Nur Muhammad Mahi Shafiullah, Arxiv 2023
- Big Picture Paper: Intelligence without representation, Rodney Brooks, 1987
- Reference Papers:
- Supersizing Self-supervision: Learning to Grasp from 50K Tries and 700 Robot Hours
- Real-World Robot Learning with Masked Visual Pre-training
- RMA: Rapid Motor Adaptation for Legged Robots
SciencePermalink
- Technical Paper 1: AstroCLIP: a cross-modal foundation model for galaxies, Liam Parker et al., Monthly Notices of the Royal Astronomical Society, 2024
- Technical Paper 2: Gravitationally Lensed Black Hole Emission Tomography, Aviad Levis et al. CVPR 2022
- Big Picture Paper: Position: Is machine learning good or bad for the natural sciences?, David Hogg and Soledad Villar, 2024
ReviewPermalink
Professor David Fouhey is an excellent instructor who brings energy, clarity, and curiosity into every session. He has a talent for breaking down complex research papers into clear, understandable insights, and asks thoughtful questions that push students to think more critically.
This course stands out for its unique format. It doesn’t cover computer vision fundamentals or teach machine learning/deep learning basics—instead, it assumes you’re already familiar with them. The focus is on exploring cutting-edge research, which makes it especially exciting for students and aspiring researchers looking to dive deeper into real-world applications of deep learning in computer vision.
If you’re passionate about staying at the forefront of vision research and want to improve your ability to analyze, present, and discuss academic work, this course is a must-take.