Dear Sebastian
You wrote:
The Dharmamitra project is preparing the training of OCR models for typeset Devanagi editions this summer,
Can you say a little more about the project.
1) Is this to have different capabilities than SanskritCR (already available).
https://ocr.sanskritdictionary.com/ . From my brief use of SanskritCR it seems to work well for printed editions from the first half of the 20th century.
If I understand correctly, this is a wrapper for Google (=Cloud Vision?) OCR? Cloud vision as of now still struggles to some extent on complex ligatures of lesser common fonts.
We plan to train an end-to-end vision language model. There really is no promise for this to work beyond what the best current solutions already achieve, but we want to give it a try.
2) Is it the actual fonts you want, or sanskrit text written in different fonts.
Actual fonts, yes!
3) If it's fonts you are looking for, is it unicode fonts you want. The reason I'm asking is that the bulk of the literature (older than the last 20 years or so) typeset by computer would be in non-unicode fonts.
We really are interested in eveyrthing, while unicode is the best we might invest time to make other fonts work if needed.
With many thanks,
Sebastian Nehrdich