<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Kyle Anderson &#187; imagemagick</title>
	<atom:link href="http://xkyle.com/tag/imagemagick/feed/" rel="self" type="application/rss+xml" />
	<link>http://xkyle.com</link>
	<description></description>
	<lastBuildDate>Sun, 22 Aug 2010 07:24:37 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Decrypting an eBook to make it Searchable</title>
		<link>http://xkyle.com/2009/06/11/decrypting-an-ebook-to-make-it-searchable/</link>
		<comments>http://xkyle.com/2009/06/11/decrypting-an-ebook-to-make-it-searchable/#comments</comments>
		<pubDate>Thu, 11 Jun 2009 20:23:43 +0000</pubDate>
		<dc:creator>Kyle Anderson</dc:creator>
				<category><![CDATA[Personal]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[drm]]></category>
		<category><![CDATA[ebook]]></category>
		<category><![CDATA[hacking]]></category>
		<category><![CDATA[imagemagick]]></category>

		<guid isPermaLink="false">http://xkyle.com/?p=364</guid>
		<description><![CDATA[So I spent $22 on an ebook for school. It has this crappy DRM that only lets me view the pdf on one computer using only &#8220;Adobe Digital Editions&#8221;. If that wasn&#8217;t so bad, only a small subset of the text is OCR&#8217;d, so most of it isn&#8217;t even searchable! Now I&#8217;m pissed, but wait, [...]]]></description>
			<content:encoded><![CDATA[<p>So I spent $22 on an <a href="http://www.diesel-ebooks.com/cgi-bin/item/0931541611/Voyage-of-Discovery-From-the-Big-Bang-to-the-Ice-Age-eBook.html">ebook</a> for school.</p>
<p>It has this crappy DRM that only lets me view the pdf on one computer using only &#8220;Adobe Digital Editions&#8221;.</p>
<p>If that wasn&#8217;t so bad, only a small subset of the text is OCR&#8217;d, so most of it isn&#8217;t even searchable!</p>
<p>Now I&#8217;m pissed, but wait, what do you say? These files are just RSA encrypted, and I have the key?</p>
<p>Some cool guy named <strong><a href="http://i-u2665-cabbages.blogspot.com/2009/02/circumventing-adobe-adept-drm-for-epub.html">i♥cabbages</a> </strong>has released code do extract your key, and then decrypt the file to a good ol&#8217; plain pdf. If you want to reproduce my steps you will need to use the <a href="http://www.cs.helsinki.fi/u/vahakang/ineptpdf.pyw">PDF decrypter</a> unless you have epubs.</p>
<p>So I use the tool and get a pdf, now I can use one of the most awesome tools in the world: <a href="http://en.wikipedia.org/wiki/ImageMagick">Imagemagick</a>.</p>
<p>Imagemagick can whip this pdf into shape. The first thing I&#8217;m going to do is convert each page into a tiff:</p>
<blockquote><p>$ convert -density 200 input.pdf[1-124] -depth 8 -monochrome %05d.tif</p></blockquote>
<p>Then I&#8217;m going to run tesseract-ocr on them to get the text:</p>
<blockquote><p>$ <span style="font-size: small;">for i in $(seq &#8211;format=%005.f 1 324)<br />
</span><span style="font-size: small;">do</span><span style="font-size: small;"><br />
tesseract $i.tif tesseract-$i -l eng<br />
done</span></p></blockquote>
<p>Now all I have to do is cat all the text together:</p>
<blockquote><p>cat *.txt &gt; output.txt</p></blockquote>
<p>Now I have a fully searchable, plain text file. Exactly what I wanted in the first place!</p>
<p>For the REAL magic, I use agrep to search for strings similar to provided example test questions to help &#8220;highlight&#8221; the answers. More technical details on that magic on <a href="http://wiki.xkyle.com/Answer_Finder">my wiki</a>.</p>
<p><a href="http://xkyle.com/wp-content/uploads/answer.JPG"><img class="alignnone size-medium wp-image-369" title="answer" src="http://xkyle.com/wp-content/uploads/answer-300x25.jpg" alt="answer" width="300" height="25" /></a></p>
<p><strong><strong><br />
</strong></strong></p>
]]></content:encoded>
			<wfw:commentRss>http://xkyle.com/2009/06/11/decrypting-an-ebook-to-make-it-searchable/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
