Keynotes

On the “Rough Use” of Machine Learning Techniques: a Story about Unrealistic Prediction, by Chih-Jen Lin Distinguished Professor, National Taiwan University, and Affiliated Professor, MBZUAI

Tentative Program

We are excited to see you in Singapore this coming week! Please see the program schedule below.

As part of the workshop, we hope to work with you for a “position paper” session following the talks to amplify the work that everyone is performing.

09:00–09:15 — Opening (Edward Raff, Odd Erik Gundersen)
Welcome, workshop goals, potential special issue, and position paper plan.

09:15–10:00 — Keynote + Q&A (45 min)
Keynote talk followed by questions.

10:00–10:30 — Session 1: Problem framing & reference baselines (2 papers, 15 min each)
Session Chair: Waqas Ahmed

Automated Reproducibility Has a Problem Statement Problem
Thijs Snelleman, Peter Lundestad Lawrence, Holger H. Hoos, Odd Erik Gundersen

open-sci-ref-0.01: open and reproducible reference baselines for language model and dataset comparison
Marianna Nezhurina, Jörg K.H. Franke, Taishi Nakamura, Timur Carstensen, Niccolò Ajroldi, Ville Komulainen, David Salinas, Jenia Jitsev

10:30–11:00 — Coffee Break (AAAI-defined)

11:00–12:00 — Session 2: Reproducibility in agents, code, benchmarks, and images (4 papers, 15 min each)
Session Chair: Peter Lundestad Lawrence

AI Copilots for Reproducibility in Science: A Case Study
Adrien Bibal, Steven Minton, Deborah Khider, Yolanda Gil

AI-Generated Code Is Not Reproducible (Yet): An Empirical Study of Dependency Gaps in LLM-Based Coding Agents
Bhanu Prakash Vangala, Ali Adibifar, Tanu Malik, Ashish Gehani

Cybersecurity AI Benchmark (CAIBench): A Meta-Benchmark for Evaluating Cybersecurity AI Agents
María Sanz-Gómez, Víctor Mayoral-Vilches, Francesco Balassone, Luis Javier Navarrete Lozano, Cristobal Ricardo Jesús Veas Chavez, Maite del Mundo de Torres

Learning to be Reproducible: Custom Loss Design for Robust Neural Networks
Waqas Ahmed, Sheeba Samuel, Kevin Coakley, Birgitta Koenig-Ries, Odd Erik Gundersen

12:00–13:00 — Lunch

13:00–14:15 — Session 3: Images, stability & simpler methods in applied settings (5 papers, 15 min each)
Session Chair: Thijs Snelleman

Exploration of Reproducible Generated Image Detection
Yihang Duan

Image Tiling for High-Resolution Reasoning: Balancing Local Detail with Global Context
Anatole Jacquin de Margerie, Alexis Roger

Measuring Stability Beyond Accuracy in Small Open-Source Medical Large Language Models for Pediatric Endocrinology
Vanessa D’Amario, Randy Daniel, Alessandro Zanetti, Dhruv Edamadaka, Nitya Alaparthy, Joshua Tarkoff

Simpler Methods Work Better for L1 Penalized Logistic Models and Large Datasets
Edward Raff, James Holt

Forest vs Tree: The $(N, K)$ Trade-off in Reproducible ML Evaluation
Deepak Pandita

14:15-14:30 - Short Break

14:30–15:30 — Workshop Part 1: Towards a position paper (Odd Erik Gundersen, Thijs Snelleman, Edward Raff)
Goal-setting, scope, key claims, outline, and breakout assignments.

15:30–16:00 — Coffee Break

16:00–17:00 — Workshop Part 2: Towards a position paper (Odd Erik Gundersen, Thijs Snelleman, Edward Raff)
Report-backs, synthesis, drafting plan, and next steps.

17:00–17:15 — Closing remarks & forward plan (Edward Raff, Odd Erik Gundersen)
Special issue next steps + position paper writing timeline and responsibilities.