Wednesday 24 May 2017

Future past: researching archives in the digital age

Last week I took part in this research symposium at the Institute of Historical Research in London. It was a great opportunity to find out what other archives are doing about digitization and born digital records, and how academic users of archives are finding their experience. It was a really interesting day, and my notes go on for pages, so I'm going to attempt to pull out some of the common themes that emerged. There were many opportunities during the day to ask questions, get feedback and talk to others, so my notes are a mixture of speakers and thoughts/ideas found from networking.


The hashtag was #digfuturepast and the symposium was recorded and should be available soon on the IHR website.


Barriers to using digital material

  • Paying for content. Digitization is expensive but academic users are used to having "free" access to collections (actually paid for by their institution). Yet, the digitization has to be paid for somehow, whether through institutions funding it themselves, grant funding or commercial companies providing a paid-for service (eg Ancestry) 
  • Making copies available. Gone are the days when a student or academic would come into an archive every day for a week or a month to do their research. Pressures of time mean they want to make the most of a single visit and be able to take copies away with them or download copies to use at home, yet it is impossible to digitize everything, and there are various reasons why copies may not be allowed at all, eg copyright, commercial sensitivity or preservation.
  • Poor documentation and/or OCR mean that researchers can't find what they're looking for. They may miss relevant items in a plethora of search results, or not get the result they need at all. A reliance on keyword searching misses the opportunity to search the collection more widely and loses the connection between archival sources.
  • Lack of a seamless user experience make it hard to use the material eg legacy systems, different systems for library/archive material, system not optimised for finding archival material.
  • Information literacy issues. We can't always assume that researchers will know how to search in our system, so we need to equip them with the tools to do this. We also need to address the common misconceptions found below.


  • Misconceptions about online access to archives

    • Any online resource is complete and comprehensive. Many only represent a tiny fraction of an archive's holdings, so how do we alert users to this and encourage them to look beyond the digital? It is impossible to digitize everything, due to copyright, staff and equipment resources, having metadata available, issues with storing electronic files etc.
    • Everything will be catalogued. No, digitizing is not the same as cataloguing. Most (all?!) archives have a cataloguing backlog, and, until the material is catalogued, there is no way to access it. This then gives rise to the question about whether it is better to spend resources digitizing some already catalogued material, or catalogue unlisted material that cannot be used at all yet.
    • Digitized version is just the same as the original. No, frequently this isn't the case and their are users who will still need to see the original. This is also one of the reasons why it is vital never to destroy the original.


    Educating researchers

    Time and again the need to educate researchers came up. It was agreed by all present that this is a vital part of training as a historian and that it should be done as early as possible in an academic career. I was pleased by this as we are already doing several of the suggested activities to encourage researchers to engage with our collections, including:

    Case studies

    • The archivist from Boots Heritage who explained how Boots had moved from an entirely internally-focussed business archive to one that was available to researchers thanks to funding from the Wellcome Trust to develop a new digital resource aimed at academic researchers. She had found that getting the right tools was essential so proper cataloguing software (CALM) had been acquired and material was catalogued to stringent standards to make it helpful and meaningful, including creating authority files to be a repository of information about buildings, brands and people. For many researchers this has turned out to be the entry point into the collections. Preservation issues affected the usability of some items and repackaging them into smaller units greatly improved this issue. Care had to be taken to protect Boots' interests, so images are watermarked and download prevented, and commercially sensitive information is not available.
    • Transport for London archives are aiming to collect the evidence that every journey matters, including the digital output of the organisation. They took the opportunity presented by needing to archive born digital material to overhaul and restructure their cataloguing. Although this was resource heavy it has created a more useable catalogue for staff and made it much more available to researchers.
    • Kathleen Chater talked about her research into black people in England in 18th century and how digitized records hadn't helped her solve research problems such as identifying where "black" didn't refer to a person, or to those instances where a black person was identified using another term. Keyword searches frequently produced unusable quantities of results. One of the more helpful things she did was spend three months going through 10000 Old Bailey records on microfilm, which also gave her the helpful context of many other cases (eg how common was it for anyone to be convicted of a particular crime). Although the Old Bailey records have now been digitized they are difficult to search because of OCR problems (the long s) and context is lost.
    • Jo Pugh, a digital development manager at The National Archives, discussed his PhD research in information journeys in archival collections. He related how the problem now isn't amassing information, but restricting what we see. His research had compared how enquiries are formulated on email, phone calls or Twitter and had looked at how the experts (archivists) worked with researchers to resolve archival queries. He had found that research guides could help to reduce uncertainty, eg by explaining how to get the best out of a search.
    • Tom Scott from Wellcome Collections explained how the context of their collections isn't just medical and so users don't know what's in the collections. Searching digitized collections meant items were isolated from their context "searchable but not understandable". They wanted to provide access by having a good reading experience, whether in person or online, so had tried to "encapsulate a librarian": a single domain model from a mix of systems for books, archives etc, extracting meaning of enquiries (eg cross references for TB/consumption/tuberculosis). He stressed that it is really important to record the metrics of what people are actually searching for.
    The symposium rounded up with a discussion of how we could futureproof our collections. My take aways from the day are:


    • Keep doing our existing work on educating researchers as early as possible, and look at how we can expand that with the resources we have.
    • "Futureproofing requires quality cataloguing" - making sure our cataloguing is up-to-scratch.
    • Assess any digitization project to ensure that high quality metadata is in place first and that it will support the needs of researchers wanting to use our collections.



    Wednesday 3 May 2017

    Webinar: preparing to digitise your archives

    Long time no blog! I've been on maternity leave, and am planning to write some reflections on that and returning to work soon. But, in the meantime, here's my write up of a webinar I took part in last week from The National Archives. As with the previous webinar I've taken part in, on forward planning, this was a great opportunity to learn more about a topic in a free and easy format, as it only took an hour of my time at work and there was no need to even leave the building!

    It was clearly structured and covered the basics of planning a digitisation project. This is my summary of the contents:

    Scope your project
    • Spent time deciding what to include and exclude in your project. Digitisation is costly so avoid creating extra work by trying to digitise too much. Be focussed!
    • Start with a small pilot, digitise a small sample and run it through all of the digitisation processes.
    • Consider possible outputs. Tiff files are the most sensible format to capture for the master copy, with 300PPI for most paper originals and 400-600PPI for photographs. PDF is not recommended.
    In-house or outsourced?
    This decision depends on the size of the project, type, budget and internal capacities. The pros and cons are:
    Outsourcing
    Pros: Can be cheaper, technical knowledge isn't needed, less stress for staff, saves time  
    Cons: Less control over project, relocation of collection/providing access to the material, fragile or sensitive material, restrictions on rescoping the project once it is underway.
    In-house
    Pros: More control, staff skill development, may save money in the long run, keeps collections in one place
    Cons: Lack of in-house skills, big investment in equipment needed, lack of suitable infrastructure, no in-house experience

    If considering outsourcing: shop around, get quotes and look at company's existing work. Visit their site and check their set up. Ask for samples early on in the project and have regular project catch ups. Make sure you have a contract.

    Document preparation
    Preservation/conservation: Assess condition of the collection and whether work by a conservator is needed in order to digitise without damaging the originals. Remove all metal pins, clips etc. Digitisation can take place through Melinex sleeves. How are you going to digitise books safely - unbind the volume, use a camera rather than a scanner etc?

    Consider capture and post-processing equipment
    There are pros and cons to using cameras and scanners.
    Document preservation: a camera provides more alternatives to capturing the image without causing damage
    Image quality: cameras tend to produce better results
    Price: bear in mind that equipment needs to be kept up-to-date (this should be factored into the cost of outsourcing). Depending on the size of the project, renting equipment may work out cheaper.
    Useability: scanners tend to be more straightforward to use with fewer settings. Cameras require colour calibration and that the lens be kept clean.
    Versatility: scanners work well with flat materials, but aren't suitable for digitising books. Cameras tend to offer more versatility.

    Post processing
    Images are usually captured in RAW format then need to be processed. RAW files are very large, so this needs to be considered when assessing file storage needs. Obviously the file format must be compatibile with the image processing and storage software being used.

    Metadata and storage
    Technical metadata is included at the capture stage, for instance camera settings, focal length, exposure. It may be embedded within the image and then shared in a spreadsheet.
    Descriptive metadata is the description of what the item is, such as names, dates and places so that the digitised image is discoverable. It can be captured by OCR (although this has severe limitations) or manually (time consuming and expensive).
    Storage ensure you have the the basics, such as a server large enough to store the files and a means of backing them up.

    What I've learnt and will take forward:
    Visit other archives/Special Collections to learn from their experiences.
    Keep it as simple as possible and only capture what is relevant. 
    Know what the outcomes of the project are before commencing image capture. 
    Never destroy the original after digitisation, unless they are acetate negatives.