Results 1 to 10 of 10

Thread: BW6 CCAT -> XML/Unicode Converter

  1. #1

    Lightbulb BW6 CCAT -> XML/Unicode Converter

    I wrote several applications a few months back to deal with Hebrew texts exported from BW6 by means of the 'Export' module (which I'm very grateful for to say the least!). One of the applications takes text exported in CCAT format (selectable in the export 'save as' window in Bibleworks) and converts the exported text files to XML format, converting the text to Unicode in the process as well. (Several excellent scholarly Unicode fonts are available on the Internet such as SBL Hebrew, SIL Ezra, etc., and you will require one of these for the extended characters used in Biblical texts.) More as a toy than anything immediately useful to me (I wrote this also as a quick learning experiment in C#), I've also included the ability to run XSL transformations on the compiled text, meaning (if you know XSL) you can easily convert the CCAT data to formatted HTML, or whatever other format you want. (That said, while XSL isn't the hardest thing to learn, it isn't really easy for non-programmers either. I have a very early script I wrote just to test with that convert BHM xml to rudimentary HTML ... perhaps people can use it as a learning points ... I don't promise I've done anything the best way ... it was just a simple test script.) Maybe this is useful to someone else. (It was only about 10 extra lines of code in .net anyway, and it is a fairly powerful language if you learn how to use it.)

    I don't care to post this application on the Internet (I've already got what I need out of it, but maybe some strange person similar to me might see a use in it as well). I'm even amenable to releasing the source code, so long as people release that this was absolutely nothing more than a 2 day (and incompleted) learning experiment for me, trying to learn C# quickly. Many feature (notable Latin-based conversion -- French, English, etc. -- are missing, and others are rudimentary). Any programmer could probably add these in about 6 hours if they want. It wasn't why I wrote the program, though, so it didn't seem important at the time. It may be utterly full of bugs, and the translation to Unicode likely needs some work, but perhaps it's useful as a starting point for something else.

    Is there any chance something like this could be hosted on the BW site (even the user forum) so that if anyone wants to download it, or work on the source code, they can easily find it? I have several domains and site of my own, and other people could put it on their own site easily enough, but these are very transient homes, and I hate broken links, and only find pages that talk about very specialized tools like this, only to realize not a single live copy exists to be downloaded on the net. I know it's a temporary solution, since new WTM texts will be in Unicode and XML anyway, but until then, it would be nice if someone could find this is they wanted it (even if I don't promise it will work!).


    - Requires Microsoft .NET Framework to run (Windows 98 and higher), which, if you won't already have it (you should!), can be downloaded freely off Microsoft's website.

    - ANSI text uses HTML encoding (meaning " becomes " etc.)

    - As with the CCAT text, Unicode words do NOT have final forms (ex. Final Mem if mem is the last letter, etc.). This is a predictable phenomenon and can easily be change in any program that uses the XML data (or someone can change the program itself).

    - No work has been done on dealing with Latin-based (English, French, etc.) text, and only rudimentary work has been done on Greek Morphological texts (with a view solely to LXM and BGM). The application was developed with Hebrew morphology/text in mind.

    - There are almost certainly bugs in the program, and translation errors with Unicode may well exist. No guarantees of usefulness are at all implied ... consider this simply a starting point for a better application, or a point of departure in getting the CCAT exported text into something more broadly useful. It's not yet a final destination in itself.

    - At this point, no matching is made between Hebrew (I.e., BHS) and Greek (I.e., most English translations) verse numbering. Both fields in the XML file are populated with the same verse and chapter numberings. This is because there is some disagreement on verse mapping, and I hadn't decided how I wanted to proceed with this.

    - There are some definate design issues with the application. (I had been using C# for a whole 2 weeks when I wrote it.) The processing task, for example, should be in a seperate thread, so the interface may temporarily freeze while processing in underway. I would never release something like this commercially, and am a bit ashamed even to release it as it is, but I'm far too busy to fix these kinds of issues, and I wanted to give at least this, since it might still help someone somewhere (so I don't have to feel bad about ignoring my wife for the whole weekend I was fixated on solving this XML/Unicode problem for myself).
    Last edited by Kevin Townsend; 12-29-2004 at 04:36 AM.

  2. #2

    Lightbulb Missing Attachments

    The attachments seem to be missing. I think I've tried this about 5 times, but what's one more. (You're bound to become a pessimist if you do any long-term work in development. Your whole day consists of looking for problems and wondering how you can solve them, only to discover this creates new problems to solve, ad infinitum. At least you're never out of a job!)
    Attached Files Attached Files

  3. #3

    Post Xml

    For curiosity sake, some XML samples are below. Please note that some tags only appear when relevant (such as <isAramaic> [ex. Gen 31:47 root 5], which only occurs when the root word is Aramaic, or <iscompound> which only occurs when the morpheme is part of the previous morpheme in the normal text), so review the XML file carefully before handling it. I didn't write a description of all the tags. I suppose I should have, but most of it should be self-evident anyway. Also ... The Heb book names are not all complete (I think Pentateuch only). You should always go by the ENGBookShort or ENGBookLong attributes when parsing (see note above on ENG and HEB verse/chapter references). Don't confuse the num attribute of a verse with the HEBVerse attribute, either. 'num' is the number of verses in the entire collection and is consecutive (10,412 is a valid 'num' attribute for the verse tag).

    BHM (Gen 1:1):
    <verse num="1" version="WTM" ENGBookShort="Gen" ENGBookLong="Genesis" HEBBookShort="Ber" HEBBookLong="Bereshit" HEBChapter="1" HEBVerse="1" ENGChapter="1" ENGVerse="1" wordsGrouped="7" wordsParsed="11">
    <word num="1" rootANSI="B." rootUnicode="בּ" morphology="Pp+SxxxExHxNxRx" />
    <word num="2" rootANSI="R&quot;)$IYT" rootUnicode="רֵאשִׁית" morphology="ncfsa+SxxxExHxNxRx" iscompound="yes" />
    <word num="3" rootANSI="BR)" rootUnicode="ברא" morphology="vqp3ms+SxxxJxCxAxExHaNxRx" />
    <word num="4" rootANSI="):ELOHIYM" rootUnicode="אֱלֹהִימ" morphology="ncmpa+SxxxExHxNxRx" />
    <word num="5" rootANSI=")&quot;T" rootUnicode="אֵת" morphology="Po+SxxxExHaNxRx" />
    <word num="6" rootANSI="HA" rootUnicode="הַ" morphology="Pa+SxxxExHxNxRx" />
    <word num="7" rootANSI="$FMAYIM" rootUnicode="שָׁמַיִמ" morphology="ncmpa+SxxxExHxNxRx" iscompound="yes" />
    <word num="8" rootANSI="W" rootUnicode="ו" morphology="Pc+SxxxExHxNxRx" />
    <word num="9" rootANSI=")&quot;T" rootUnicode="אֵת" morphology="Po+SxxxExHaNxRx" iscompound="yes" />
    <word num="10" rootANSI="HA" rootUnicode="הַ" morphology="Pa+SxxxExHxNxRx" />
    <word num="11" rootANSI=")EREC" rootUnicode="אֶרֶצ" morphology="ncfsa+SxxxExHxNxRx" iscompound="yes" />

    BHS (Gen 1:1):
    <verse num="1" ENGBookShort="Gen" ENGBookLong="Genesis" HEBBookShort="Ber" HEBBookLong="Bereshit" HEBChapter="1" HEBVerse="1" ENGChapter="1" ENGVerse="1" wordsGrouped="7" wordsParsed="11">
    <word num="1" textANSI="B.:" textUnicode="בְּ" />
    <word num="2" textANSI="R&quot;)$I73YT" textUnicode="רֵאשִׁ֖ית" iscompound="yes" />
    <word num="3" textANSI="B.FRF74)" textUnicode="בָּרָ֣א" />
    <word num="4" textANSI="):ELOHI92YM" textUnicode="אֱלֹהִ֑ימ" />
    <word num="5" textANSI=")&quot;71T" textUnicode="אֵ֥ת" />
    <word num="6" textANSI="HA" textUnicode="הַ" />
    <word num="7" textANSI="$.FMA73YIM" textUnicode="שָּׁמַ֖יִמ" iscompound="yes" />
    <word num="8" textANSI="W:" textUnicode="וְ" />
    <word num="9" textANSI=")&quot;71T" textUnicode="אֵ֥ת" iscompound="yes" />
    <word num="10" textANSI="HF" textUnicode="הָ" />
    <word num="11" textANSI=")F75REC" textUnicode="אָֽרֶצ" iscompound="yes" />


  4. #4

    Default CCAT to Unicode Conversion Rules (Pt. 1)

    Here are the 'rules' for converting CCAT Hebrew to Unicode characters. I post them because there may very will be (and likely are) errors. Sorry to just post the C# code, but I think it's clear enough. "\u####" is a Unicode character, where #### = the character reference scheme, such as "05d6" = Zayin, which is encoded as "Z" in the CCAT scheme. Ergo, all "Z"'s will be replaced with Unicode character "05d6". (If you understand this code, I know ... shame on me. It's terribly inefficient to run a function so many times, but I've since learned the folly of my ways.)

    public static string HebrewANSItoUnicode(string s) // Replace ANSI with UNICODE Data
    // Alefbet
    s = Replace(s, "_S_", "\u05e1"); // SPECIAL
    s = Replace(s, "_P_", "\u05e4"); // SPECIAL
    s = Replace(s, ")", "\u05d0" ); // Alef
    s = Replace(s, "B" ,"\u05d1" ); // Bet
    s = Replace(s, "G" ,"\u05d2" ); // Gimel
    s = Replace(s, "D" ,"\u05d3" ); // Dalet
    s = Replace(s, "H" ,"\u05d4" ); // He
    s = Replace(s, "W" ,"\u05d5" ); // waw
    s = Replace(s, "Z" ,"\u05d6" ); // Zayin
    s = Replace(s, "X" ,"\u05d7" ); // Het
    s = Replace(s, "+" ,"\u05d8" ); // Tet
    s = Replace(s, "Y" ,"\u05d9" ); // Yod
    // Final Kaf (\u05da) ?
    s = Replace(s, "K" ,"\u05db" ); // Kaf
    s = Replace(s, "L" ,"\u05dc" ); // Lamed
    // Final Mem (\u05dd) ?
    s = Replace(s, "M" ,"\u05de" ); // Mem
    // Final Nun (\u05df) ?
    s = Replace(s, "N" ,"\u05e0" ); // Nun
    s = Replace(s, "S" ,"\u05e1" ); // Samek
    s = Replace(s, "(" ,"\u05e2" ); // Ayin
    // Final Pe (\u05e3) ?
    s = Replace(s, "P" ,"\u05e4" ); // Pe
    // Final Zade (\u05e5) ?
    s = Replace(s, "C" ,"\u05e6" ); // Zade
    s = Replace(s, "Q" ,"\u05e7" ); // Qof
    s = Replace(s, "R" ,"\u05e8" ); // Resh
    s = Replace(s, "#" ,"\u05e9" ); // Sin/Shin
    // SAME ???
    s = Replace(s, "&" ,"\u05e9\u05c2" ); // Sin
    s = Replace(s, "$" ,"\u05e9\u05c1" ); // Shin
    s = Replace(s, "T" ,"\u05ea" ); // Taw

    <Continued in next post due to size restrictions!>

  5. #5

    Default CCAT -> Unicode Conversion Rules (pt. 2)

    // Vowels (Replace double characters first)

    s = Replace(s, "W." ,"\u05d5\u05bc" ); // Shureq
    s = Replace(s, "OW" ,"\u05b9\u05d5" ); // Holem Waw
    s = Replace(s, ":A" ,"\u05b2" ); // Hateph-Pathah
    s = Replace(s, ":F" ,"\u05b3" ); // Hateph-Qametz
    s = Replace(s, ":E" ,"\u05b1" ); // Hateph-Segol
    s = Replace(s, "A" ,"\u05b7" ); // Patah
    s = Replace(s, "F" ,"\u05b8" ); // Qametz
    s = Replace(s, "I" ,"\u05b4" ); // Hireq
    s = Replace(s, "E" ,"\u05b6" ); // Segol
    s = Replace(s, "\"" ,"\u05b5" ); // Tsere
    s = Replace(s, "O" ,"\u05b9" ); // Holam
    s = Replace(s, "U" ,"\u05bb" ); // Qibbuts
    s = Replace(s, ":" ,"\u05b0" ); // Schwa
    s = Replace(s, "-" ,"\u05be" ); // Maqqeph
    s = Replace(s, "." ,"\u05bc" ); // Dagesh
    s = Replace(s, "," ,"\u05bf" ); // Rape
    // s = Replace(s, "*" ,"\u" ); // Ketiv
    // s = Replace(s, "**" ,"\u" ); // Qere

    // Other
    s = Replace(s, "92", "\u0591"); // Atnah
    s = Replace(s, "01", "\u0592"); // Segolta
    s = Replace(s, "65", "\u0593"); // Shalshelet
    s = Replace(s, "80", "\u0594"); // Zaqep Parvum
    s = Replace(s, "85", "\u0595"); // Zaqep Magnum
    s = Replace(s, "73", "\u0596"); // Tipha Tarha
    s = Replace(s, "81", "\u0597"); // Rebia
    s = Replace(s, "82", "\u0598"); // Sinnorit
    s = Replace(s, "03", "\u0599"); // Pashta, Azla Legarmeh
    s = Replace(s, "10", "\u059a"); // Yetib
    s = Replace(s, "91", "\u059b"); // Tebir
    s = Replace(s, "61", "\u059c"); // Geresh, Teres
    s = Replace(s, "11", "\u059d"); // Mugrash
    s = Replace(s, "62", "\u059e"); // Garshajim
    s = Replace(s, "84", "\u059f"); // Pazer Mag, Qarne Para
    s = Replace(s, "14", "\u05a0"); // Telisha Magnum
    s = Replace(s, "44", "\u05a0"); // Telisha Magnum (2)
    s = Replace(s, "83", "\u05a1"); // Pazer
    s = Replace(s, "74", "\u05a3"); // Munah
    s = Replace(s, "70", "\u05a4"); // Mahpak, Mehuppak
    s = Replace(s, "71", "\u05a5"); // Mereka
    s = Replace(s, "72", "\u05a6"); // Mereka Kepulah
    s = Replace(s, "94", "\u05a7"); // Darga
    s = Replace(s, "33", "\u05a8"); // Pashta
    s = Replace(s, "63", "\u05a8"); // Azla Legarmeh
    s = Replace(s, "04", "\u05a9"); // Telisha Parvum
    s = Replace(s, "24", "\u05a9"); // Telisha Parvum (2)
    s = Replace(s, "93", "\u05aa"); // Galgal, Jerah
    s = Replace(s, "60", "\u05ab"); // Ole, Mahpakatum
    s = Replace(s, "64", "\u05ac"); // Illuj
    s = Replace(s, "13", "\u05ad"); // Edhi, Tipha
    s = Replace(s, "02", "\u05ad"); // Zarqa, Sinnor

    // *** See SIL Ezra notes for next 3
    s = Replace(s, "35", "\u05bd"); // Meteg
    s = Replace(s, "75", "\u05bd"); // Silluq
    s = Replace(s, "95", "\u05bd"); // Meteg
    s = Replace(s, "05", "\u05c0"); // Paseq
    s = Replace(s, "00", "\u05c3"); // Sop Pasuq
    s = Replace(s, "52", "\u05cf"); // Punctum Extraordinaria above
    s = Replace(s, "53", "\u0323"); // Punctum Extraordinaria lower

    // Possible bug in Bibleworks that outputs useless '/' character
    s = Replace(s, "/", "");

    return (s);

  6. #6
    Join Date
    Oct 2005

    Question BW Hebrew to SBL Hebrew


    Did you see my recent post asking about creating font map tables for exporting Bwhebb to SPTiberian? Your post seem like it approaches meeting this need. I assume that SPTiberian is what you call SBL Hebrew. Has your program developed at all? I am not a programmer in any sense, so I cannot really use a lot of the info you posted. But I could obviously use a finished program. So I am wondering if you have developed your program to be reliable for turning BW Hebrew to other types of Hebrew or possibly have created font map tables for exporting Bwhebb as SPTiberian (SBL's main Hebrew font).

    Thank you for any help you can give me.

    God bless,

    Brian Abasciano

  7. #7

    Post Unicode fonts


    If I understood Kevin's project correctly, he was converting to Unicode. SPTiberian is not a Unicode font, but SBL Hebrew is. Thus the two are not interchangeable.

    SPTiberian predates the wide acceptance of Unicode and SBL Hebrew was developed as its successor as more computers and operating systems were able to handle Unicode.

    As I understand it, BW7 will deal with Unicode directly. If it is possible for you, I suggest using SBL Hebrew directly because it will not be long before the approach used in SPTiberian will fade away.

    If you are needing the OT in Unicode Hebrew check out Christopher Kimball's site at Tanach

    If you want to install a keyboard driver to use Unicode more efficiently see David Instone-Brewer's work in Unicode Fonts for Biblical Studies - made easy. He provides both an explanation of the whats and whys of Unicode as well as a great download for both Windows and Macs that is very helpful.

    Hope this helps,


  8. #8
    Join Date
    Oct 2005

    Smile Oh


    Thank you, that does help. I did not realize that SBL had a newer font. I must have downloaded SPTiberian about a year ago, at which time I believe that the SBL site said that was the font to use for JBL. I also know that T & T Clark presently asks authors to use SPIonic and Tiberian and it seems that some other journals ask for these too. Do you think that anyone asking for SPTiberian would be fine with SBL Hebrew and other unicode fonts? One reason I have been looking into this is because I am writing a book for T & T Clark and they ask for SPIonic and Tiberian to be used. Perhaps I should ask them if unicode fonts are acceptable.

    Thanks again for the info.

    God bless,


  9. #9

    Default Use of Unicode


    Only T & T Clark can answer your question.

    I have used SPTiberian for several years and it does its job quite well. But one drawback of its font technology is that you often must choose from different width characters. For example, the font map lists two different holem. Why? A holem that goes with a vav is much narrower than one that goes with a mem. SPTiberian has five widths of dagesh. It is up to the typist to choose the one that fits best.

    On the other hand, Unicode allows you to input one character and the font rendering technology decides the proper width. Much simpler for the person typing the text.

    Best wishes on your book.


  10. #10
    Join Date
    Oct 2005

    Smile Thanks again


    Thank you for the post. It is helpful to know.


Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts