Place header for sub dropdown here

Place sub header for sub dropdown here

Subscribe to Our Blog

5 Open Source Digital Preservation Tools

By  Christopher J. Michael Christopher J. Michael  on 2015-12-15 10:29:42  |  Featured in  Insurance , Information Management
Christopher J. Michael
Posted By Christopher J. Michael
in Insurance in Information Management
on 2015-12-15 10:29:42 preservation is the foundation of enterprise archiving.

Electronic records are archived when they have long-term retention needs in order to fulfil legal, business and regulatory requirements. A digital archive is a repository that stores collections of digital objects with the intention of preserving and providing long-term access to the information. Authorized users of these records must be able to access these records seven, 20, 50 years later or even for perpetuity. Organizations must deal with the challenges of digital preservation:

  • File format/technology obsolescence
  • Technology fragility
  • Lack of understanding about digital preservation best practices
  • Authenticity and provenance of the records
  • Declining knowledge about the records in the enterprise, particularly with respect to structured information
  • Uncertainty about the best organizational infrastructures to achieve digital preservation

Digital archiving and preservation are needed to ensure the authenticity, integrity, and protection of electronic records despite limited resources and a constant stream of new complex technologies.

Good news! There exists tools useful during electronic records appraisal and preparation with file identification, file format migration, and metadata extraction.  An organization does not need to use all of these tools - some overlap in features - so implementation depends upon an organization's archive capabilities and policies. 
Although there are open source digital archive storage solutions, such as Archivematica, this article will not cover open source storage software as they typically do not handle records retention and are not validated to federal and industry electronic records regulations such as Title 21 CFR Part  11 or GxP for the life sciences industry. Proprietary repository options that can handle these storage needs include EMC’s InfoArchive, HP’s Records Manager, IBM’s Optim, etc.

Matchbox Tool

The Matchbox tool is able to identify duplicate images. This is a powerful tool as it can even determine duplicate content where files are different, cropped, format, rotated, etc. Another open source tool that has similar capabilities is GNU Diffutils. GNU Diffutils can find the differences between two files. This can be useful to determine the different between an older draft and a final version or two files that were once identical but have been changed by two people. Besides archiving, Matchbox tool and GNU Diffutils can also be used for file share remediation by identifying duplicates to allow for defensible disposition of redundant data and transitory drafts.

DROID (Digital Record Object Identification)

DROID is a software tool created by the National Archives of the United Kingdom for the identification and standardization of file formats and metadata extraction. DROID is able to the exact file format version of digital objects. DROID links its identification of file formats to the authoritative internet-based file format registry PRONOM. DROID and PRONOM can be used as the basis for your enterprise archive’s file format policy to ensure that you only ingest - add to your digital archive - files in open, widely available formats for long term preservation.


Xena stands for “XML Electronic Normalising for Archives” and was developed by the National Archives of Australia as part of their Digital Preservation Software Platform. Normalising is the conversion of digital files to a range of preservation formats that are open, well supported, universally available, and look to remain viable for a long period. Xena is similar to DROID in that it detects the file format of a digital object. However, Xena goes one step further and also transforms digital files into open formats for long term preservation. Xena is a good tool in the battle against file format obsolescence as its conversion ability mitigates the risk of not being able to access files years later, especially those in proprietary formats.

ePADD (email)

ePADD was developed by Stanford University to support the appraisal, ingest, processing, discovery, and delivery processes of email archives. A unique feature of ePADD is that full texts of emails are only accessible from one site. This capability was created for historical archives to deal with copyright where the full text could only be read at the library or archives. However, companies can leverage this tool as well to stay compliant with regulations such as HIPPA and deal with issues of privacy and security. Archived emails could be viewed with sensitive information and PII (Personally Identifiable Information) redacted and the complete text only accessible by users from one location. Remote users can be granted full access if needed. ePADD not only works as a viewer for reading archived email messages, but also for email attachments that could be in a wide variety of file formats.

Web Curator Tool

The Web Curator Tool was developed for archiving websites in collaboration between the British Library and the National Library of New Zealand through the International Internet Preservation Consortium. The Web Curator Tools is a workflow management tool for collecting or “harvesting” websites for archiving. It can capture descriptive metadata and schedule when/how often a target website should be harvested.

Many of the current digital preservation tools were developed for digital preservation of records stored in museums, historical and educational archives, and cultural institutions that similar to the private sector also face digital curation and preservation challenges. While developed for more traditional archives, these pieces of software can also provide value to an enterprise archive since they not only come at no cost, but were designed to for various digital preservation activities across the ISO Open Archival Information System (OAIS) Reference Model and can be altered to suit company or industry specific needs. Before using in your enterprise, be sure to check all licensing details for commercial use and generate a plan for checking not only updates to these details, but also to the upgrade and update paths for the tools themselves.  If you choose to implement these tools, remember that while these tools can be modified, you must credit the origin of the software and mark your changes as a different version. 

Know Your Tools Amendments to the U.S. Federal Rules of Civil Procedure (FRCP) took effect this month (December 2015) with their standard for preservation of Electronically Stored Information (ESI) being reasonable steps. Having a clearly documented, defined and consistently followed policy for retention and disposition, litigation holds, archiving, and retrieval of electronic records will ensure compliance to the new changes to FRCP.

Keep in mind, digital preservation is constantly evolving with technology, so while perfection may not always be possible, reasonable steps must be taken for long term preservation of ESI.

  • Using software like these in conjunction with your records management and archives policies will allow your enterprise to know what you will ingest during the appraisal process. Therefore time and money won’t be spent on preserving duplicates or information past its retention period with no business value.
  • There should be a file format policy in place for your enterprise archive to only ingest formats that are based on open freely available standards, have current and widespread use, are robust enough to be used on multiple types of hardware/software/operating systems, and are not patented proprietary formats.
  • Your policies will change over time as digital preservation practices, standards, and technology continue to evolve. Building software such as these into your archival process following the OAIS Reference Model will lower the risk of retaining files that may not be accessible years down the road during litigation or an audit and assure compliance to legal requirements and regulations.

Remember, these tools are only as strong as those who wield them. Therefore it is imperative to have the strategy, processes, including in house support processes for any tools you implement, policies, and archival storage solution for your enterprise archive in place first.

WP - 5 Ways Structured Archiving Delivers Enterprise Advantage
Christopher J. Michael

Christopher J. Michael

Connect with me on 

Christopher Michael joined Paragon in 2014 and is a member of the Information Governance and Compliance practice. While at Paragon he has worked on several electronic archiving and records management projects with a focus on the pharmaceutical industry. Chris is an active member of ARMA International and is Secretary on the Board of Directors for the ARMA Liberty Bell chapter of Philadelphia. He is also a member of the ARMA Young Professionals Advisory Group and the Information Governance Initiative. His next career goal is to become a CRM (Certified Records Manager). Chris holds a BA in English and History from Ursinus College and earned his MLIS (Master of Library and Information Science) with a concentration in Archival Studies from Drexel University in 2013. Chris worked with archival collections at the Philadelphia Archdiocesan Historical Research Center, Haverford College, and Hagley Museum & Library. Prior to Paragon he worked at the Drexel University School of Law Legal Research Center and Villanova University’s Falvey Memorial Library.

View Comments