“Data Ark: Defining Provenance Frameworks for AI Training Data” addresses a critical gap in AI infrastructure: while models depend on large-scale training datasets, there is currently little to no scholarly framework for documenting where those datasets come from, how they are assembled, and how they circulate or degrade over time.
What distinguishes Data Ark is its humanities-driven, interdisciplinary approach. Rather than treating training data as opaque technical inputs, we approach datasets as cultural objects—with histories of authorship, transformation, and institutional context. The goal is to build a public-facing, catalogue raisonné–style platform that makes these datasets legible, citable, and accountable.
Team & Interdisciplinary Structure (Northeastern + External Partners):
The project is led by a cross-departmental CAMD team and developed in close collaboration with the Network Science Institute:
- Gloria Sutton (PI, Art + Design / Art History): Project framework, provenance methodology, and external partnerships
- Sylke René Meyer (Creative Practice Research): Research-through-practice and computational/creative translation
- Juliana Rowen Barton (Center for the Arts): Curatorial integration, exhibitions, and public programming
- Jennifer Gradecki (Interactive Media & Game Design): AI systems, LLMs, and responsible AI frameworks
- Derek Curry (Art + Design): Machine learning, computational imaging, and dataset construction
- Julia Flanders (English / Digital Humanities): Metadata standards, archival infrastructure, and long-term stewardship
- Albert-László Barabási + BarabásiLab (Network Science Institute): Network modeling and—critically—access to real training datasets for case studies
- Jacopo Mattia Conti (BarabásiLab / CAMD): Platform architecture, data visualization, and interface design
This interdisciplinary team structure allows the project to bridge art history, curatorial practice, digital humanities, and AI research, which is essential for addressing provenance as both a technical and cultural problem.
Key Use Case / Contribution:
One concrete use case involves working with active training datasets from BarabásiLab to model how datasets are assembled, versioned, and reused across AI systems. These serve as test cases for developing:
- Provenance documentation standards (analogous to archival metadata or catalogues)
- Dataset “genealogies” that track transformations and reuse
- Ethical and institutional context (licensing, attribution, cultural sourcing)
The outcome is not another dataset repository, but a reference infrastructure—a system for documenting and interpreting training data in ways that support transparency, reproducibility, and accountability.
More broadly, Data Ark speaks to a growing institutional challenge: as museums, archives, and cultural collections are increasingly used to train AI systems, there is an urgent need to ensure that their histories, authorship, and conditions of use remain visible rather than becoming computationally naturalized and detached from context.