Manuscritos, o sonho do OCR e mais

Lido continuamente com um acervo de dimensão considerável que é, em sua maior parte, composto por manuscritos (~1815-1940).

No nosso atual pré-projeto de coleta manual de dados (para gerar instrumentos de pesquisa) calculamos o prazo de entrega para daqui a 100 anos e olhe lá…

Temos obstáculos como:

documentos estão desmanchando;

fungos patogênicos no acervo tornam o trabalho extremamente insalubre que gera gasto astronômico com EPI;

calor infernal no local de armazenamento.

Nossa intenção atual é tentar arrumar um projeto para fotografar parte de documentação (já que um scanner talvez tome mais tempo e cause mais dano) e, assim, poder fazer a “coleta manual de dados” a partir de imagens em qualquer local – preferencialmente com climatização.

Do ponto de vista da conservação para a posteridade, as fotografias digitais talvez durem menos que o papel deteriorado, mas facilitarão o acesso – que é a função de um arquivo –  e o trabalho de indexação, transcrição, descrição…

QUESTÕES

Retomando o pré-projeto… Nele sobram questões ligadas a quantos “dados” devemos coletar de maneira que não tome 100 anos, mas que não obrigue a voltar lá para olhar tudo de novo e extrair novos dados…

Seria possível usar OCR para reconhecer os manuscritos? Provavelmente não, mas seria possível passar OCR e reconhecer alguma coisa mais ou menos correta dos manuscritos que pudesse ser utilizada como metadados provisórios?

Busquei rapidamente algo sobre processamento OCR de documentos históricos, mas achei muito sobre fonte tipográfica antiga, e quase nada sobre manuscritos. Por exemplo:

OCR Challenges on Historic Documents

Historical Documents in a Digital Library: OCR, Metadata, and Crowdsourcing

O segundo texto, do site Lemonade & Information é bem interessante: fala de reconhecimento ótico de caracteres (OCR) de documentos impressos antigos (cuja fonte tipográfica nem sempre é bem reconhecida), da criação metadados a partir de índices antigos e correção colaborativa de textos processados por OCR.

A seguir, alguns trechos dele:

For a researcher, to profitably use a big digital collection of historic materials, he or she needs to be able to search the contents, to winnow down centuries of text. In other words he or she needs either quality OCR or quality metadata. For a large corpus, if you have a collection that is well-OCR’d, then you can get by without robust metadata. But if you have a collection that is poorly OCR’d, text search will not work — you need to have robust metadata for the library to be useful at all. (LEMONADE & INFORMATION, 2010)
This can be contrasted to a pre-digital form of searching: the index. An example is the comprehensive index to the Virginia Gazette from 1737 to 1790, prepared by historians at Colonial Williamsburg in 1950. In this index are contained references to proper names (people and places) and subject terms. (Colonial newspapers generally were populated by anonymous or pseudonymous pieces, so no authors.) An index like this, rigorously compiled and checked, provides a very different profile: very high precision, moderate to good recall, depending on the rigor, and low fallout. (LEMONADE & INFORMATION, 2010)

What the Virginia Gazette index provides, in essence, is metadata. In a pre-digital world, this was the only way of “searching” the corpus. But in a world of digital libraries, such an index would seem unnecessary. And perhaps it would be, were the online text of newspapers acceptably accurate. (LEMONADE & INFORMATION, 2010)

When digitizing the eighteenth-century run of the Virginia Gazette, the digital humanities specialists did not even seriously consider putting searchable text online. OCR was quickly found not to work well on the microfilm versions of the newspaper, and costs to have the text inputted manually were far beyond their budget. Instead they went through a laborious process of scanning and OCRing the index (which, typeset in Courier in the mid-twentieth century, could be done with high accuracy). They then placed the index online in HTML format, with links leading to the scanned images of newspaper pages. In this they were helped by another feature of the print index: it listed not just the issue date, but the page and column of the entry. (LEMONADE & INFORMATION, 2010)

The creators envisioned a workflow that took advantage of the diligent labor of the mid-century index compilers and married it to the speed and convenience of the digital library. When working with the digital Virginia Gazette, a researcher would first search the index web page for a relevant term. Then he or she could tab back and forth between the index and a set of open images, quickly running through a list of results. All in all, the technique was successful; the disadvantage, of course, is that it is not so easy to read a run of consecutive issues, or even consecutive pages. (LEMONADE & INFORMATION, 2010)

OCR might be one option: it can read article titles with a moderate degree of accuracy, and, if it could pick out proper nouns with any consistency whatsoever, could index those. But, given the poor quality of the microfilm that is used to make scans of newspaper pages, OCR simply can’t cope with the demands. The amount of cleanup required would mean that librarians might as well just read the articles and index the text themselves. At least in this way they could index concepts and make a true subject index — not something that literal OCR software can do. (LEMONADE & INFORMATION, 2010)

What is to be done? The Australian Newspaper Digitisation Program came up with an innovative solution: crowdsourcing. They made it possible for users to “view, edit, correct and save OCR text on whatever article they happened to be looking at.” Knowing that particular documents had unusually bad OCR, they highlighted those images to encourage patrons to improve them. The crowdsourcing was an instant success. Within three months of the project’s launch 1200 individuals had edited “700,000 lines of text within 50,000 articles.” Further, the volunteer correctors were — based on information from that two-thirds who had registered for accounts rather than working anonymously — largely experts in the places and time period covered in the newspapers. This meant they were better able to use context to puzzle out difficult words. (LEMONADE & INFORMATION, 2010)

So in at least two cases crowdsourcing has worked as a way to produce usable, index-ready text from image files and low-quality OCR. Old newspapers are but one source for which this technique has potential. Other printed materials could be made accessible, and beyond print is manuscript. Historical archives in the United States and elsewhere are notoriously conservative institutions. But it would take relatively little effort and not much more in the way of resources for them to provide the materials that could generate their own online community of researchers. It would be enough to provide a digital library of reasonably decent image files of manuscripts, and a web interface that allowed researchers to transcribe the material for their own use while also saving the transcription for other patron’s benefit. Allow users to create tags for the material — as the Australian project does — and you also have the beginnings of a robust set of metadata. (LEMONADE & INFORMATION, 2010)

As ideias que ficam

O desafio de definir o que deve ser coletado manualmente dos documentos: extrair o máximo possível de dados, no menor tempo que se conseguir e com a alguma segurança de nem tão cedo ter que retornar ao acervo para nova coleta. Nesse sentido, a definição de quais dados deverão ser extraídos deve considerar os padrões/modelos de indexação ou metadados existentes (Dubin Core, FRBR e/ou CIDOC-CRM, NOBRADE, ISDIAH, ISAAR-CPF, ISDF…), especificidades do acervo e do usuário dele e otimização do tempo.

Seria possível conseguir metadados vestigiais com OCR? Qualquer coisa mais ou menos certa sobre cada documento? Como testar?

Colaboração: solicitar aos usuários da base de dados que façam transcrição do que ainda não foi transcrito? De que maneira? Compulsoriamente para validar login? Compulsoriamente, com certa periodicidade, para “revalidar” login? Apenas solicitar contribuição?

Montar uma plataforma de transcrição de documentos (ou de montagem de metadados, isto é, preencher formulário) a partir das imagens e levá-las aos cursos de história, arquivo, biblioteconomia, museologia?

Anúncios

Deixe um comentário

Preencha os seus dados abaixo ou clique em um ícone para log in:

Logotipo do WordPress.com

Você está comentando utilizando sua conta WordPress.com. Sair / Alterar )

Imagem do Twitter

Você está comentando utilizando sua conta Twitter. Sair / Alterar )

Foto do Facebook

Você está comentando utilizando sua conta Facebook. Sair / Alterar )

Foto do Google+

Você está comentando utilizando sua conta Google+. Sair / Alterar )

Conectando a %s