PDA

View Full Version : More Public Domain Texts for User Databases?



Matt Harmon
07-31-2008, 09:45 AM
Over on Justin Taylor's blog there is a post (see here (http://theologica.blogspot.com/2008/07/plundering-archive-dot-org.html)) mentioning several texts available on Archive.org that are available as either PDF, TXT or even HTML files. Assuming these texts are in fact out of copyright protection, is there any reason (other than someone taking the time to do so) that these could not be converted into Bibleworks user modules?

Matt

Michael Hanel
07-31-2008, 01:11 PM
I've downloaded roughly 1600 volumes from Archive.org, all of them in .pdf format (EXTREMELY large, photographic reproductions).

I considered what you suggest, but found the problem with their .txt files is that they are direct OCR files from the photographic reproductions, crinkled pages, discolorations and all. Consequently, the .txt files contain a huge amount of garbage characters, and would take an immense amount of work to make them into BW7 databases. (Not having worked with HTML files that much, I can't speak to that issue).

As a result I decided to download the much larger .pdf files and, with Adobe Acrobat, add bookmarks to the volumes. I have tried to create workable text files from these, and it can be done (even from Acrobat Reader 8 using cut and paste), but this is also a long process (more Wycliffe, anyone?)

Still labor-intensive, but on a much smaller scale is to simply make the page images a new page in an HTML. This is what was done with the old Hebrew grammar series. It's not as good as having real life text, but it beats nothing.

Downsides: (1) the compiled HTML files tend to be really large (especially depending on what resolution of page images you use). (2) the obvious lack of being able to copy, paste, search, etc.