OCR Web Services

This pattern aims to discuss a distributed system for Optical Character Recognition (OCR).

Context

There are many open source projects focused on OCR, such as Tesseract, Ocropus, Gamera, etc.
This is a new field in so rapid evolution, that several projects are in beta or even alpha version, with short term unstable releases, difficult to install and customize.
On the other hand, vendors of commercial software for OCR usually are scarcely interested in ancient languages, languages spoken only by linguistic minorities and documents with rare symbols and/or layouts.

Problem

OCR involves the following entities/resources:

  1. OCR engines
  2. Training sets
  3. Scanned page images (input)
  4. Text documents (output)

The same OCR engines, installed on high performance computers, could be used by several users and the same training sets could be used for different documents with similar features (same fonts, page quality, etc.).
Sometimes there are multiple scanned images of the same document but different quality (e.g. on InternetArchive) and, eventually, text documents produced by OCR applied to the same original document can be more or less accurate and more or less featured (only text, minimal or plain formatting and layout, mapping of the text on the original image, etc.).

Solution

OCR Web Services can connect the aforementioned entities.

Scenario

OCR engines are installed on clusters of computers, interfaced by a front-end dispatcher, that receives documents to process, queue them, distribute them to lazy nodes and then send results to the original caller.
Training sets are distributed and selectable by the final user.
Page image zipped folders (or .pdf files) can be sent to the OCR engines
Resulting text documents can be self-corrected by automatic alignment and mixing

Discussion

Which protocols are the best ones for these purposes?
In particular: how to move very large quantities of data, using open protocols?