<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Digital Philology&#039;s Blog</title>
	<atom:link href="http://digitalphilology.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://digitalphilology.wordpress.com</link>
	<description></description>
	<lastBuildDate>Fri, 16 Oct 2009 20:25:27 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='digitalphilology.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Digital Philology&#039;s Blog</title>
		<link>http://digitalphilology.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://digitalphilology.wordpress.com/osd.xml" title="Digital Philology&#039;s Blog" />
	<atom:link rel='hub' href='http://digitalphilology.wordpress.com/?pushpress=hub'/>
		<item>
		<title>OCR Web Services</title>
		<link>http://digitalphilology.wordpress.com/2009/06/30/ocr-web-services/</link>
		<comments>http://digitalphilology.wordpress.com/2009/06/30/ocr-web-services/#comments</comments>
		<pubDate>Tue, 30 Jun 2009 15:20:38 +0000</pubDate>
		<dc:creator>digitalphilology</dc:creator>
				<category><![CDATA[Discussion]]></category>

		<guid isPermaLink="false">http://digitalphilology.wordpress.com/?p=9</guid>
		<description><![CDATA[OCR Web Services This pattern aims to discuss a distributed system for Optical Character Recognition (OCR). Context There are many open source projects focused on OCR, such as Tesseract, Ocropus, Gamera, etc. This is a new field in so rapid evolution, that several projects are in beta or even alpha version, with short term unstable [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=digitalphilology.wordpress.com&amp;blog=6493255&amp;post=9&amp;subd=digitalphilology&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<h2>OCR Web Services</h2>
<p>This pattern aims to discuss a distributed system for Optical Character Recognition (OCR).</p>
<h3>Context</h3>
<p>There are many open source projects focused on OCR, such as <a href="http://code.google.com/p/tesseract-ocr">Tesseract</a>, <a href="http://code.google.com/p/ocropus">Ocropus</a>, <a href="http://gamera.informatik.hsnr.de">Gamera</a>, etc.<br />
This is a new field in so rapid evolution, that several projects are in beta or even alpha version, with short term unstable releases, difficult to install and customize.<br />
On the other hand, vendors of commercial software for OCR usually are scarcely interested in ancient languages, languages spoken only by linguistic minorities and documents with rare symbols and/or layouts.</p>
<h3>Problem</h3>
<p>OCR involves the following entities/resources:</p>
<ol>
<li>OCR engines</li>
<li>Training sets</li>
<li>Scanned page images (input)</li>
<li>Text documents (output)</li>
</ol>
<p>The same OCR engines, installed on high performance computers, could be used by several users and the same training sets could be used for different documents with similar features (same fonts, page quality, etc.).<br />
Sometimes there are multiple scanned images of the same document but different quality (e.g. on <a href="http://archive.net">InternetArchive</a>) and, eventually, text documents produced by OCR applied to the same original document can be more or less accurate and more or less featured (only text, minimal or plain formatting and layout, mapping of the text on the original image, etc.).</p>
<h3>Solution</h3>
<p>OCR Web Services can connect the aforementioned entities.</p>
<h4>Scenario</h4>
<p>OCR engines are installed on clusters of computers, interfaced by a front-end dispatcher, that receives documents to process, queue them, distribute them to lazy nodes and then send results to the original caller.<br />
Training sets are distributed and selectable by the final user.<br />
Page image zipped folders (or .pdf files) can be sent to the OCR engines<br />
Resulting text documents can be self-corrected by automatic alignment and mixing</p>
<h3>Discussion</h3>
<p>Which protocols are the best ones for these purposes?<br />
In particular: how to move very large quantities of data, using open protocols?</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/digitalphilology.wordpress.com/9/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/digitalphilology.wordpress.com/9/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/digitalphilology.wordpress.com/9/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/digitalphilology.wordpress.com/9/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/digitalphilology.wordpress.com/9/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/digitalphilology.wordpress.com/9/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/digitalphilology.wordpress.com/9/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/digitalphilology.wordpress.com/9/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/digitalphilology.wordpress.com/9/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/digitalphilology.wordpress.com/9/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/digitalphilology.wordpress.com/9/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/digitalphilology.wordpress.com/9/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/digitalphilology.wordpress.com/9/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/digitalphilology.wordpress.com/9/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=digitalphilology.wordpress.com&amp;blog=6493255&amp;post=9&amp;subd=digitalphilology&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://digitalphilology.wordpress.com/2009/06/30/ocr-web-services/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/043a91fbb545f886ef2e6532393153b0?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">digitalphilology</media:title>
		</media:content>
	</item>
	</channel>
</rss>
