DSpace: the next generation
(John Mark Ockerbloom)
Why a new architecture?
- Use, scale, dependence on DSpace growing
(code is mushrooming, preservation needs growing, ad-hoc development is tough)
- Architectural needs (set priorities, handle new forms of content, better interoperability with other systems)
- Set directions for an evolutionary, practical system design (try to get ahead of the game a little; create a system in 18-24 months that will suffice for a few years)
DSpace 2
- discussion started in 2004
- summer 2006, group constituted to review architecture
- Online discussion
- weeklong “summit” October 2006
- subgroups, data-model and roadmap development
Survey
- questions and comments about use and customization of DSpace
- 116 responses in one week
- adaptation proved quite common: metadata, database schema, code changes; hard to keep customizations and new version in sync (YES!), so customizations have less uptake than they otherwise might (welcome to my life!)
- common wishes: better modularity, more customizable UI, complex objects, versioning
Architecture manifesto part one: DSpace (Buddha-?)nature
- OSS for digital repositories (avoid scope creep into CMS, wiki, etc.)
- usable based purely on free and OSS (avoid proprietary dependencies, can still support closed-source as option, e.g. Oracle)
- decoupled, stable, and app-neutral core (for integration with other software; “core” = business logic, storage layers)
- useful “out of the box”
Architecture manifesto part two: DSpace development
- Employ and support open standards
- Releases should be minimally disruptive
- Support an exit strategy for content
- Continue to evolve
Scalability issues
- Three scale dimensions (repository size, intensity of use, rate of ingestion/other processing)
- No major architectural limits to scale, but revisions need to accommodate large-scale use
- Performance goals: 10 million items, 10 simultaneous depositors, 100 simultaneous users, 1 second addition overhead at full size scale, accommodation of clusters, unlimited file sizes, ponies for everyone!!1!—or not
Interoperability
- three aspects: data, service, and API-level
- needs: concreate data model; published, documented, stable core interface designed to be extended; support for common, standard protocols
DSpace 2 highlights
- more powerful, flexible data model
- shift in UI (JSP to Manakin)
- core overhaul, documentation (to make extensions and customizations easier to add and maintain)
- focus on extended content lifecycle (format migrations)
- more reuse of third-party development
Data model
- can have multiple metadata records, attached to items or sub-items (such as bitstreams) (rather more FRBR-ish)
- bundles become “manifestations,” used for content only (some old bundles, e.g. licenses, become metadata)
- bitstreams become “content files”
- all of these have identifiers (for better linking, yay!), not just handles for entire item; proposed: URIs based on item identifier with qualifiers for manifestation, content file, version
- persistent identifiers for epeople (quasi-authority-files)
Versioning
- used for non-semantic revisions of content and metadata: e.g. format migrations, metadata revisions, content corrections (typos etc.)
- semantic revisions can be separate items with Relation metadata to link (not enforced by system, but makes citation clearer; needs metadata and UI support for ease-of-use)
- retention of old versions is a matter for repository policy
My question: if we have reliable identifiers for sub-items, why do we have to bother with separate items for semantic versioning?
Other recommendations
- more flexible, more preservable metadata (manage and preserve in persistent store; support multiple records; serializable; not constrained to be flat; default schemas for items and content files; views can be projected into DB for ease of access)
- separate abstract data model from concrete data storage (w/abstraction layer, presumably; can use SRB or DB+filesystem or whatever)
- generalize collections, communities
UI
- Manakin!
- should become standard UI for DSpace 2
- requires an add-on mechanism to be integrated into the system (currently there’s a simple one, but a more generalized approach could make DSpace much more extensible and customizable)
Extension frameworks
- OSS has several (OSGi, Spring); we’d rather reuse one than build it from scratch
- in the meantime, simple one released for Manakin
Event mechanism
- core should include event-notification mechanism
- can be used for history system, view maintenance, UI
- prototype in development
- hoping to get it into DSpace 2
Workflow
- not just for ingestion any more!
- can use third-party software, as with extension frameworks
- need better tools for specifying and modifying workflows
Roadmap to DSpace 2
- core group does detailed specs for core, docco, reimplementation
- architectural oversight committee monitors progress
- wider community supports distribution (development and feedback)
- continued evolution of DSpace 1 meanwhile