Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ocr postprocessing #6

Merged
merged 2 commits into from
Jun 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion digitization/av/av_bestpractices.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,6 @@ Digitized av content is often multiple components. These components should be pa

Example SIP for a/v content:


```
ms2374_s2_c107d_f7_i1
├── ms2374_s2_c107d_f7_i1_001.mov
Expand Down
4 changes: 2 additions & 2 deletions digitization/av/av_records.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,13 @@ grand_parent: Digitization
nav_order: 3
---

When digitizing audiovisual carriers we should update finding aids to reflect new information gained from the digitization process. Use the following fields in ArchivesSpace to hold new information.
When digitizing audiovisual carriers, we should update finding aids to reflect new information gained from the digitization process. Use the following fields in ArchivesSpace to hold new information.

## Scope and Contents
Any content notes derived during the digitization process. If we are doing a monitored transfer, we should record notes about the aboutness of the recording.

Examples:
"David Einsenhower gives speech and takes questions from the audience at the alumni association dinner in 1987."
"David Eisenhower gives speech and takes questions from the audience at the alumni association dinner in 1987."
"1991 Convocation honoring Ronald and Nancy Reagan 10 years after Ronald Reagan was shot and brought to GW hospital for treatment."

## Physical Description
Expand Down
2 changes: 1 addition & 1 deletion digitization/imaging/imaging_bestpractices.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ nav_order: 1

This is not policy, but rather a guiding document. For various reasons, projects might not be able to fulfill these recommendations. At present, technical debt related to storage and access systems makes these recommendations difficult to fulfill.

These specifications remain aspirational, but serious efforts should be made to adhere to them to better ensure the long-term preservation and usability of digitized audio and moving image material.
These specifications remain aspirational, but serious efforts should be made to adhere to them to better ensure the long-term preservation and usability of digitized texts and graphics.

## Documents and manuscripts (unbound)

Expand Down
11 changes: 10 additions & 1 deletion digitization/imaging/ocr.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,13 @@ permalink: /imaging_ocr/
grand_parent: Digitization
parent: "Digitization: Imaging Text and Graphics"
---
test
# Optical Character Recognition (OCR) for Text Based Documents


## Adobe Acrobat

## [Tesseract](https://github.com/tesseract-ocr/tesseract)

### [OCRmyPDF](https://github.com/ocrmypdf/OCRmyPDF)

OCRmyPDF uses Tesseract to generate a searchable [PDF/A ](https://en.wikipedia.org/?title=PDF/A) file from a PDF or images.
5 changes: 2 additions & 3 deletions managing/accessupload.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,8 @@ Presently, GWU Special Collections only ingests access copies of digital collect
- You may use the [ArchivesSpace_to_InternetArchive script]() to generate metadata from the archival description for each record.
- If you use this script, you should still review the metadata before upload. The script pulls description from ancestor records (series, resource, ect.) that may not be appropriate for what is represented by the digital content.
- An example of this might be rights information. The script will pull the rights statement from the resource (collection) record. This may not apply to item-level description for the digital content.
- You may also include additional derivatives (SRT/VTT caption files, text files) in your upload. To do so, add an additional row to the CSV for each identifier. Match the identifier to the file of the additional derivative.
- Note that the Internet Archive generates OCR for text-based documents automatically. We cannot overwrite this generated OCR. We can still upload any corrected OCR, but it will not be used for full-text search results.
![CSV screenshot](/assets/images/sidecar_upload.PNG)
- You may also include additional derivatives (SRT/VTT caption files, full text txt files) in your upload. To do so, add an additional row to the CSV for each identifier. Match the identifier to the file of the additional derivative.
![CSV screenshot](/assets/images/sidecar_upload.png)

### Uploading Content to Internet Archive
- Once both your files and metadata are prepared you are ready to upload! To do so, you can use your command line interface to start the upload. In your command line (PowerShell, Terminal, ect.) navigate to the directory with the files and your CSV spreadsheet.
Expand Down