Topic: (Mechanistic) Interpretability, Explainability, Transparency

As large AI systems — particularly language models — grow increasingly powerful and complex, understanding how they operate “under the hood” is no longer optional. Making these typically opaque systems more transparent has become a central goal in modern AI research.

Several key subfields contribute to this effort:

  • (Mechanistic) Interpretability seeks to reverse-engineer the internal computations of models, revealing how specific behaviors and capabilities emerge.
    • Explainability focuses on generating human-understandable reasons for a model’s decisions or outputs.
    • Transparency serves as an overarching goal, pushing for AI systems that are open to inspection, analysis, and understanding, rather than black boxes.

Advances in these areas are critical not only for improving the safety, reliability, and alignment of AI systems, but also for enabling effective debugging, responsible governance, and cumulative scientific progress.

Contact: Tanja Bäumel, Simon Ostermann, Patrick Schramowski