diff --git a/digitization/av/av_bestpractices.md b/digitization/av/av_bestpractices.md index 4c02a82..0f1b037 100644 --- a/digitization/av/av_bestpractices.md +++ b/digitization/av/av_bestpractices.md @@ -54,7 +54,6 @@ Digitized av content is often multiple components. These components should be pa Example SIP for a/v content: - ``` ms2374_s2_c107d_f7_i1 ├── ms2374_s2_c107d_f7_i1_001.mov diff --git a/digitization/av/av_records.md b/digitization/av/av_records.md index 6ee007d..2d43752 100644 --- a/digitization/av/av_records.md +++ b/digitization/av/av_records.md @@ -7,13 +7,13 @@ grand_parent: Digitization nav_order: 3 --- -When digitizing audiovisual carriers we should update finding aids to reflect new information gained from the digitization process. Use the following fields in ArchivesSpace to hold new information. +When digitizing audiovisual carriers, we should update finding aids to reflect new information gained from the digitization process. Use the following fields in ArchivesSpace to hold new information. ## Scope and Contents Any content notes derived during the digitization process. If we are doing a monitored transfer, we should record notes about the aboutness of the recording. Examples: -"David Einsenhower gives speech and takes questions from the audience at the alumni association dinner in 1987." +"David Eisenhower gives speech and takes questions from the audience at the alumni association dinner in 1987." "1991 Convocation honoring Ronald and Nancy Reagan 10 years after Ronald Reagan was shot and brought to GW hospital for treatment." ## Physical Description diff --git a/digitization/imaging/imaging_bestpractices.md b/digitization/imaging/imaging_bestpractices.md index 9c60899..40545f6 100644 --- a/digitization/imaging/imaging_bestpractices.md +++ b/digitization/imaging/imaging_bestpractices.md @@ -11,7 +11,7 @@ nav_order: 1 This is not policy, but rather a guiding document. For various reasons, projects might not be able to fulfill these recommendations. At present, technical debt related to storage and access systems makes these recommendations difficult to fulfill. -These specifications remain aspirational, but serious efforts should be made to adhere to them to better ensure the long-term preservation and usability of digitized audio and moving image material. +These specifications remain aspirational, but serious efforts should be made to adhere to them to better ensure the long-term preservation and usability of digitized texts and graphics. ## Documents and manuscripts (unbound) diff --git a/digitization/imaging/ocr.md b/digitization/imaging/ocr.md index 6255734..809a040 100644 --- a/digitization/imaging/ocr.md +++ b/digitization/imaging/ocr.md @@ -5,4 +5,13 @@ permalink: /imaging_ocr/ grand_parent: Digitization parent: "Digitization: Imaging Text and Graphics" --- -test \ No newline at end of file +# Optical Character Recognition (OCR) for Text Based Documents + + +## Adobe Acrobat + +## [Tesseract](https://github.com/tesseract-ocr/tesseract) + +### [OCRmyPDF](https://github.com/ocrmypdf/OCRmyPDF) + +OCRmyPDF uses Tesseract to generate a searchable [PDF/A ](https://en.wikipedia.org/?title=PDF/A) file from a PDF or images. diff --git a/managing/accessupload.md b/managing/accessupload.md index fb1cd76..8ee58e3 100644 --- a/managing/accessupload.md +++ b/managing/accessupload.md @@ -31,9 +31,8 @@ Presently, GWU Special Collections only ingests access copies of digital collect - You may use the [ArchivesSpace_to_InternetArchive script]() to generate metadata from the archival description for each record. - If you use this script, you should still review the metadata before upload. The script pulls description from ancestor records (series, resource, ect.) that may not be appropriate for what is represented by the digital content. - An example of this might be rights information. The script will pull the rights statement from the resource (collection) record. This may not apply to item-level description for the digital content. -- You may also include additional derivatives (SRT/VTT caption files, text files) in your upload. To do so, add an additional row to the CSV for each identifier. Match the identifier to the file of the additional derivative. - - Note that the Internet Archive generates OCR for text-based documents automatically. We cannot overwrite this generated OCR. We can still upload any corrected OCR, but it will not be used for full-text search results. -![CSV screenshot](/assets/images/sidecar_upload.PNG) +- You may also include additional derivatives (SRT/VTT caption files, full text txt files) in your upload. To do so, add an additional row to the CSV for each identifier. Match the identifier to the file of the additional derivative. +![CSV screenshot](/assets/images/sidecar_upload.png) ### Uploading Content to Internet Archive - Once both your files and metadata are prepared you are ready to upload! To do so, you can use your command line interface to start the upload. In your command line (PowerShell, Terminal, ect.) navigate to the directory with the files and your CSV spreadsheet.