It’s the second week of the Introduction to Digital Curation, module and today we’ve had a practical lesson using DROID and understanding digital file formats. This post explores what I’ve taken in and learned from the practical lesson, and including some of my musings on the subject. (I feel like, if I can explain what I’ve learned, then I’ll have a better understanding of what I’ve learned).
DROID (Digital Record Object IDentification) is a free open source software tool designed by The National Archives (UK) for identifying the file formats of batches of digital files. It can be used to identify files, to tell us which format files are in, to tell us if files do not match up with their purported file extension, to identify duplicate files in a batch, to generate hashes for digital files, and to provide a list of the digital holdings in the particular batch. DROID is useful in many ways, and digital files can be run through DROID again and again, and as it provides a hash, or checksum, for each digital file, the checksum can be checked each time the file is run through the DROID software, which can tell you if a particular digital file has changed in some way since the last time it was run through DROID. As archivists, DROID can be extremely useful in providing us with a list of what we have, creating some metadata for us. DROID is extremely useful, and fairly easy to use once you’ve had a play around and get used to its capabilities. However, drilled into us in the lecture was the need to understand what the tool is doing.
DROID is able to identify the digital file formats by using information stored in PRONOM, a digital file registry, also designed by The National Archives (UK). PRONOM is not the only digital file registry in existence, but it’s the registry most commonly used in the UK.
“By definition, electronic records are not inherently human-readable. File formats encode information into a form which can only be processed and rendered comprehensible by very specific combinations of hardware and software. The accessibility of that information is therefore highly vulnerable in today’s rapidly evolving technological environment. This issue is not solely the concern of digital archivists, but of all those responsible for managing and sustaining access to electronic records over even relatively short timescales.
Technical information about the structure of those file formats, and the software products which support them, is therefore a prerequisite for any digital preservation regime. PRONOM was developed to provide this, initially as an internal resource for The National Archives’ staff, but later made publicly available for anyone to use.”
The National Archives (UK), 2017.
So, PRONOM is the file registry that holds the information about digital file formats, from which DROID pulls its information.
The practical session
Our practical class was unstructured, so that we could play around with DROID and work out its functions and capabilities, although our lecturer had set us with some tasks to have a go at based on a given zipped file:
- Download the zipped file of personal documents from the mid 1990s. Try to work out what format they are in. Run them through DROID. Try to open them. Try to find out about any formats you are not familiar with and how you might be able to open them if you can’t.
- Export the results of your DROID scan in .csv format and then import the results into Excel. Manipulate the data to look for possible duplicates.
- What other tools can you find which carry out file profiling and identification etc?
I had watched an online video and read a document on how to use and interpret DROID in advance of the class. The online video was really helpful, as the narration talked through how to use DROID whilst the video showed what was happening. This was particularly useful as I started the practical lesson with a basic idea of what to do. Doing the tasks in person was slightly harder, and I had to keep reverting back to the notes I’d made from the video about which button to click, and how to set particular preferences. Once I was familiar with what to do, it was actually fairly straightforward. It was interesting to see how different files are identified in their particular formats, and in which version. Exporting the results into .csv format and importing the .csv into Microsoft Excel was fairly straightforward too, and once the results were in Excel, you could filter your results however you’d like.
For me, DROID is a really useful tool to identify duplicates any batch of files. For archivists, this can be useful when accessioning digital files, as you can identify any duplicate files that would not need to be kept. Being able to identify the file format is useful too. In our practical, DROID was able to identify files in a WordPerfect format. The WordPerfect software is not used by University College London, so I was not able to open it in its original format. However, having been able to identify the type of document that it was (a word-processed document), I was able to open the file using Microsoft Word. There is an issue with provenance here. In migrating the original WordPerfect document into a Word document, the original word-processed information can be accessed, but something had been lost in the accessing of the information, as it was not being accessed in the same was as it had been created. It’s not all bad news though, accessing the information in a different format is better than not being able to access the information at all!
Header image from Info Tehna.