Together, we are developing an affordable, open-source, and collaborative institutional repository solution based on the Hyku software.
- Bulking Up, Part 2: Bulk Upload in Hyku Commons
This post is the second of a two-part look at bulk upload in Hyku. The first examined the background of bulk operations and why they are difficult to do well. This post focuses specifically on the application of a bulk import solution in Hyku Commons. The need for bulk upload in this project is similar to those identified by the large Hyku user community. We too need an “easy in and easy out” data solution for our repository users. (Photo above by Pexels on Pixabay)
In PALNI’s 2018 white paper, we identified several valued repository attributes, and have since adopted them as the shared vision Hyku for Consortia project. One of these values speaks directly to the need for bulk upload solutions: “The collaborative institutional repository should be a system which is interoperable and allows free-flow of data. Easy import and export of metadata and objects are possible.” The use cases for bulk ingest are numerous. Migrations from another platform, repurposing data from external sources such as finding aids, and user preference rank high on the list for why one would rather import works and their metadata in bulk rather than piecemeal.
To further illustrate the need for bulk importing, you will find a list of workflow examples in the Hyku for Consortia project documentation. These examples provide hypothetical consortial profiles and repository scenarios based on real-life stories contributed by our Product Management Team. From these scenarios:
- Scenario 1, “Midwest Library Consortium”: Tenant-only Editor is in the archives department and has a digitized archives collection to add to the repository. He creates a new collection and uses one of the pre-populated admin set choices. He then bulk uploads the content and saves it but does not publish it.
- Scenario 3, “Wealthy Alumni College”: Tenant-only Editor begins uploading student works in bulk into the repository with draft metadata, licenses, and embargoes.
- Scenario 5, “Sunnydale Community College”: Student staff member is made Tenant Editor. She uploads minutes in batches with a spreadsheet of basic metadata. The collection is not yet published.
These scenarios helped us to envision all the ways that Hyku might be used for various IR users and content, and to define our collaborative workflows and user roles. Also, without us realizing it at the time, they very much highlighted how essential bulk import is to this work. In three out of five of these examples, we envisioned works being uploaded in bulk by Tenant Editors, who might be an archivist/librarian, grad school staff member, or even a trusted student. These users have metadata in an existing external format, and rekeying hundreds or thousands of metadata values would be a waste of their time.
Shifting away from the hypothetical to the actual, now that we are using Hyku Commons for real-world pilot repositories, the need for bulk import functionality is even more apparent. For example, one of our partner institutions moved content from Digital Commons to CONTENTdm as a stop-gap when they lost access to the platform due to cost. Now they want to move that content into Hyku.
Using the Bridge2Hyku project’s CDM Bridge tool, export was a breeze. We were able to extract all the files and metadata from CONTENTdm in a way that Hyku would understand. But how to get the described works into Hyku? The native Hyku batch import did not provide a solution, since it applied identical metadata to each item. The records we wanted to bulk upload have complete, individual descriptions. We soon learned that this kind of desired bulk import was a much more complicated task, and reached out to Notch8 to find a solution.
With Notch8’s help, we investigated HyBridge (the import counterpart to CDM Bridge’s export), Cdm_Migrator, and Bulkrax as potential bulk import solutions for Hyku Commons. We selected Bulkrax for our project because it seemed to work best for our multi-tenant environment and was easiest to configure within our setup.
According to the Samvera Labs webpage, “Bulkrax is a batteries included importer for Samvera applications. It currently includes support for OAI-PMH (DC and Qualified DC), XML, Bagit, and CSV out of the box. It is also designed to be extensible, allowing you to easily add new importers into your application or to include them with other gems. Bulkrax provides a full admin interface including creating, editing, scheduling and reviewing imports.”
Check out this poster from Samvera Connect 2019 for more information about Bulkrax.
After working with Notch8 to install and update Bulkrax into Hyku Commons, we viewed developer-supplied walkthrough videos (like this one) and wiki documentation to get a better understanding of how to use the CSV importer. It is now possible to bulk upload to Hyku Commons with Bulkrax by importing a zipped folder containing a folder of files and a properly formulated CSV file. The CSV contains rows for each object’s descriptive metadata. Additionally, the first four fields are administrative fields, which govern how the importer imports the files.
- item – Lists the name and extension of the item being imported, such as file.jpg.
- source_identifier – Establishes a persistent identifier for the object being imported.
- model – Identifies the worktype the work will be created as.
- collection – Determines what collection(s) the work will be added to.
One of our challenges is the lack of step by step documentation for these processes. It’s a complex process and a tad finicky, so a very detailed guide would be helpful. Another is the need for separate parsers, and the intervention of a developer to create them, for custom worktypes. For our bulk upload to work for our OER worktype, for example, separate work had to be done to add the parser and to allow the relationships between items mentioned in the last post. Lastly, there were a few oddities along the way that we reported and were added to the Bulkrax project board so that they can receive feedback from the community.
In considering bulk capabilities for our project, the next step is to look towards bulk export with Bulkrax. This functionality currently exists in limited capacity, but it is in further development for wider usability. In keeping with the “easy in, easy out” theme, there are many use cases in which we’d desire the ability to export metadata as well as files from the Hyku Commons tenants. Stay tuned for additional developments on this process!
- Bulking Up
This post is the first of a two-part look at bulk upload and data remediation in Hyku. Part one is going to take a look at the background of bulk operations and why they are difficult to do well. Part two will talk about our specific work to try and address some of these needs in the Hyku for Consortia project. (Photo above by Ryoji Iwata on Unsplash)
Bulk operations in Hyku have a long history. In the initial user survey, conducted way back in 2015, one of the main findings was that Hyku needed to support the “easy in and out” of metadata. Metadata migration/remediation/transformation has always been a major activity in libraries. Think back to what an enormous task retrospective conversion of card catalogs to MARC was. Any library system containing metadata has to be able to manage that data at a large scale.
The design team for Hyku knew that bulk operations would be a key element to allowing potential users to commit to migrating out of their current tools. Hyku entered a market with a number of existing repositories. This new solution might have been able to solve many of the community’s frustrations with those tools, but only if there was an easy way to migrate to it. The initial requirements and personas therefore both reflected the needs to tools to upload and transform metadata from one system to another. Mockups reflected the need to both migrate data as well as remediate it.
This work was then reflected in Github issues during the project development (see: https://github.com/samvera/hyku/issues?q=is%3Aissue+is%3Aopen+bulk), but other more basic needs for repository development (you need a repository to migrate data to, after all) took a higher priority. So a new grant project called Bridge2Hyku picked up where development left off and explored the issue of migration in more depth (https://bridge2hyku.github.io/). Our colleagues at the Bridge to Hyku project did great work analyzing not only how to upload data and objects to Hyku, but also how to get it out of some of the major repository systems currently in use.
All of this work then…but why is metadata migration and bulk creation/upload so difficult?
The nature of structured data is what makes it so powerful: you can index and search it, you can compare like to like, you can organize and sort. In short, it makes order out of chaos. And as humans, that’s what we naturally do: recognize patterns. But, also like humans, we might all see the world slightly differently. So different metadata schema and repository systems can have their own way of seeing the world. Some are quite simple and allow for the same basic type of description of everything. Others are quite granular, allowing for more nuanced description of subtle details that can be important and powerful. So any system to migrate or convert from one system to another typically relies on a lot of human intelligence to see the patterns and make the connections.
But human capacity is only so much. How do you analyze thousands of records? Analytical tools like Open Refine can be helpful. So can guidelines for general rules on the major categories of migration as shown in crosswalks from other projects. But, as these examples perhaps show, these tools are not simple and not necessarily easy to pick up and learn. So any migration process is either going to require a lot of manual intellectual effort, or the creation of new tools to help with this business of organizing and translating.
The quirks of particular systems can also provide barriers. You may come up with a great crosswalk that works for one system, but doesn’t capture the nuance of another. Within Hyku for example, all works are sorted into worktypes. These types define the metadata schema used, the relationships between objects that can be created, and in some cases, the way that the object itself is presented and handled within the repository.
Data from other systems that don’t use this type of organization then require an extra step to define the worktype data should be migrated to. The system that data is coming from can also prove a barrier. Some systems are opaque making it hard to know exactly how data is stored. Others make it difficult to export data out. Many systems can provide an XML feed of records through a tool using the OAI-PMH protocol, but these are then just records, not objects themselves. Others might use a newer protocol like ResourceSync for export, but may be incompatible with systems still relying on OAI-PMH.
Finally, issues can come from the very nature of materials themselves. A particular challenge we’ve had with migration relates to the inter-relationships between objects. As I’ve talked about before, and will likely write about on this blog in the future, one of the key needs we found to assist in the uptake of Open Educational Resources (OER) is the availability of related teaching tools or ancillary materials. A freely available textbook is great, but if there are also related quizzes, videos, or lecture slides, an educator has all they need to make the switch. In order to make these materials visible in an OER repository, we need to have the ability to define lots of different types of relationship like “translation of”, “part of”, or “replaced by” (for new editions).
Creating these relationships may be easy when materials are being uploaded as they are created on an ad-hoc basis. But migrating them to a new environment presents a new challenge: how do you create a relationship between materials that may be next in the queue to be created? There isn’t a simple solution. For us, it’s meant creating some new code to handle the creation of relationships as a second step in the data migration process. The point of this example isn’t necessarily the solution we found to this problem, but the acknowledgment that many other types of materials may present their own unique needs. While uniformity and standardization is good, it’s the balance between standardization and diversity that makes a repository useful.
So bulk operations in repositories is a hard nut to crack. There are similarities in any migration or conversion, but there are also a lot of specific challenges to every situation. In our next post, we ‘ll talk about the development of bulk upload functionality for Hyku Commons and how we addressed challenges in our own work.
- Consortia and Open Source Software
PALNI and PALCI are working together on Hyku because we believe in its potential for improving repository workflows and open access for libraries. But we are also working on it because we understand and value the benefits of collaborative work on open source software. On Friday, June 26th, I am doing a presentation for the West Virginia / Western Pennsylvania chapter of ACRL called “Collaborating for Innovation: Developing Consortial Open Source Software at PALCI,” part of which will involve a discussion of Hyku for Consortia. I’d like to take some time this month to delve into the philosophy that I’ll be discussing there about collaboration and consortia.
We’ve written about collaboration before, and so have others. Generally speaking, consortia help to increase the scale of the work that libraries can do. When consortia work together, they can increase that power even further. Collaboration on open source software projects is a particularly good example of the benefits of this kind of cooperation.
First, a little background on our two consortia. PALNI is a consortium of 24 private, academic libraries and PALCI is a consortium of 70 members of varying types of academic libraries from small to large, public to private. We both include aspects of what Lorcan Dempsey calls (in the second link above) the “classic library consortia activities … some combination of licensing, resource sharing and training, or sometimes manag[ing] a shared library system…” But we are also both trying to increase our value to our networks in new and innovative ways to help them meet new challenges.
One of those challenges we hear about is infrastructure for handling scholarly communication or other types of digital object management. Even with the diversity in our membership, we hear about these issues frequently:
- Cost: Many repository solutions are simply too expensive, either in actual dollars or staff needed to successfully use them.
- Adaptability: Many of our members have diverse needs when it comes to repositories. They have both scholarly communication missions as well as managing their own digital library content, and many solutions fit only one or the other successfully.
- Limited choice: Consolidations and mergers in the vendor market seem to be limiting choice. The fear of getting locked-in to an infrastructure that will become untenable is real.
So here is a clear case where open source software may be helpful. Firstly, because it has a relatively low barrier to entry: the communities supporting many tools are open. While it may take some expertise to really engage, we have that expertise within our networks. There are also potential open source software solutions to these problems that could have a really high impact in meeting member needs. Finally, within PALNI and PALCI we already have some infrastructure and experience in repository management through PALNI’s research and other types of repository services and PALCI’s participation in the HykuDirect Pilot.
Our Hyku for Consortia project then, is an opportunity to help meet these members’ needs in some specific ways. First, we can mitigate the risk they might take on in trying something new. By spreading out that risk among our consortia — each institution supporting the project through a little bit of staff time, or a portion of their membership fee — we decrease that risk for all. Secondly, we also extend the opportunity to our members to help shape a product to specifically suit their needs. We are doing this by including multiple members from both consortia in our Product Management teams, getting feedback from members testing the software, and having open discussions with the membership on our progress.
The sneakiest benefit of all though is that investment in this particular software (or other open source software projects) can have ripple effects in the rest of the environment. The developments we are working on with Hyku will be highly beneficial to other vendors or providers that may offer the service as well. By integrating newer, better standards, we raise the bar for other competing services. So even if our member libraries don’t participate in this specific project, they can experience the benefits in the long-term.
- Hyku in Development and Production + Demo
Within our IMLS grant project, we have been working hard with our product management team and developer Notch8 to define and develop consortially-focused improvements to Hyku. For example, last month we shared some logistics for building collaborative workflows, working towards a master dashboard to control multi-tenant user permissions.
At the same time as these development activities are taking place, this project has also focused on the practical aspect of making the existing version of Hyku usable for our consortial partners to pilot as a working institutional repository. Our work has thus branched into two separate areas: Development and Production.
The Production arm of our work focuses on readying the existing Hyku Commons product for real-world pilot use starting this summer. As a result of user testing from both the PALNI and PALCI sides, we’ve been submitting tickets for small bug fixes and minor improvements which are now happening parallel to the development of features as outlined in the IMLS grant. Notch8 has devoted a lot of resources to our project in both arenas, and we’ve established a great working relationship and clear communication of needs from both sides.
To date we’ve created two clear and separate working instances of Hyku for Development and Production. First, the Development instance acts in a number of purposes:
- A sandbox for PALCI and PALNI institutions to preview and test a Hyku tenant
- A staging area for Notch8 to preliminarily roll out updates, bug fixes, and new features
Second, the Production instance is where work is deployed once tested in the Development environment and also where pilot repositories will be built. It will be publicly available as a working repository soon.
We’ll share the Hyku Commons product (ie, our Production instance) when it has pilot content that is ready to be viewed. For now, checkout the demo video below for a brief look at our Development environment. PALCI and PALNI libraries can request a test repository in the Development instance using this form.
Five minute Hyku Commons demo
- An Inside Look at Collaborative Workflows
It’s no surprise that recent global events related to the COVID-19 pandemic have affected libraries across the globe. As we focus on keeping distance to slow the spread, one bright spot is that we have been remotely collaborating on our cross-consortia repository from the beginning, so it’s offered a welcome sense of continuity in troubled times to continue the project.
Our last posts outline the goals and planned activities of our projects, and in the interim we’ve made excellent progress on defining the requirements and designing the planned outcomes for the first two of our major development activities:
- Building collaborative workflows
- Theming and branding development
With this blog post we’d like to focus on introducing what “building collaborative workflows” means to us. Consortial collaboration is more than just sharing costs. We want to create a tool that will allow us to jointly manage a multi-tenant repository infrastructure. Creating the flexibility for both IR workflows and more “traditional” library-owned content within the same instance of Hyku means enhancing the ability to manage user and tenant settings (enabling different workflows) through the consortial dashboard.
Our process for uncovering a way to address the rather broad task we’d given ourselves leaned into our collaborative process to uncover the places where workflows overlapped and diverged among our consortium members. We asked our Product Management Team to articulate the types of collections they hoped to build with Hyku. They described the types and sources of materials, as well as the people involved, thus identifying where workflows overlapped and diverged among our consortium members.
From these we next began to brainstorm through narrative scenarios of various workflows. These helped to highlight specific shared workflow tasks as well as gaps in the current Hyku product. We also examined the existing user roles and permissions available within and across tenants and articulated the need for some additional levels of permission through narrative documents, matrices, and visualizations of these shared workflows.
By working through this process we realized that a robust dashboard for user/role assignment, and the expansion of a few more roles, would enable us to manage these flexible workflow options. The current multi-tenant administrative dashboard for Hyku only allows for the creation of new tenants and the creation of users. We would need something far more powerful to assign users to our envisioned permission levels in multiple tenants.
With this basic idea and our specific needs for user levels articulated, we turned to work with our development partners at Notch8. Talking through each of our requirements documents, we have come up with a rough development plan. Some of our expectations will likely be adjusted based on the feasibility and difficulty of implementation, but our goal of “building collaborative workflows” will remain the same.
- First is to decouple the “role” functionality in Hyku from the “group” functionality. Currently, permissions are assigned at both levels which can work at cross-purposes.
- Next we will develop the dashboard needed to control these permissions. This part will require us to put our creative thinking caps back on to more fully define what this looks like.
- Finally, we will work on implementing roles at the tenant level through the new dashboard.
We hope you’ve enjoyed this little peak behind the curtain at this behind-the-scenes “collaborative workflow” of our own: a cross-consortial development process between partners in three different states and two different time zones, using shared online tools, working asynchronously but together. We look forward to sharing our results in the future.