New version of the Digital Corpus of Sanskrit (DCS)

Sun May 8 11:31:45 UTC 2011

Dear list,

a major new release of the Digital Corpus of Sanskrit is now available at 
http://kjc-fs-cluster.kjc.uni-heidelberg.de/dcs/.

Apart from new texts, the website has the following new features:

1) Part of speech information for nouns; this means that you can search, 
for instance, for hRdaya in Instr. Sg. etc.

2) A user feedback function for unclear text passages and typos. Unclear 
text is marked in different colours in the corpus, and you may propose the 
correct solution in a web-based form. Your feedback is highly appreciated, 
and the names of contributors are included in future versions of the corpus 
(although anonymous contributions are also possible).

3) Most importantly, the corpus now contains unsupervised (= not manually 
corrected!) analyses of some of the shorter texts from the GRETIL 
directory. According to a recent study, we may expect about 5% of lexical 
errors for linguistically easy texts such as the epics and about 10% for 
more complicated ones with unusual, technical, or Buddhist terminology. I 
started with the shorter texts (approx. up to 50 kb) because each analysis 
process is computationally expensive, and my laptop soon showed signs of 
overexertion. References for words found in these texts are included in the 
detail pages for single words, just below the verbal and part-of-speech 
information. A link leads directly to the lines in the source files where 
the words are found. You may check these functions with a popular term such 
as rAjan.
In addition, you find XML files that contain the results of the 
unsupervised analysis for each text. The complete list of these XML files 
is printed at the bottom of the page http://kjc-fs-cluster.kjc.uni-
heidelberg.de/dcs/index.php?contents=corpus. This page also contains an XSL 
file (unsupervised.xsl) that can be used to display the XML files in a more 
appealing format. Please note that you have to include the line 
<?xml-stylesheet type="text/xsl" href="unsupervised.xsl"?>
after the first tag (<?xml version="1.0" encoding="UTF-16"?>) of each XML 
file to display the results in the intended manner. Stylesheet information 
will be included automatically in each file in the next release of the 
corpus.

--- Some technicalities concerning unsupervised analysis, on which I would 
really appreciate off-list feedback ---
a) XML files are currently a popular scheme for storing data in the 
Humanities. I am not an expert in this field. However, processing medium 
sized files phrases using XSL files was extremely slow on my computer. This 
is especially true for operations that use <xsl:if ...> and similar 
constructions (e.g., used for searching lexemes). Performing a <xsl:if>-
based query with a file that contained the analyses of only five short 
texts from GRETIL took more than five minutes on my computer. Did I miss a 
point? Are there more effective ways for doing this, still using XML?
b) As said above, my computer was overexerted when performing the analysis. 
Could some of you imagine to provide computing power, so that the large 
files (Puranas, phil. treatises, ...) could also be processed (32-bit 
systems preferred)?
--- End technicalities ---

Best regards,

Oliver Hellwig

-------------

PD Dr. Oliver Hellwig
South-Asia Institute, University of Heidelberg
Im Neuenheimer Feld 330
69120 Heidelberg, Germany