New version of the Digital Corpus of Sanskrit (DCS)
Oliver Hellwig
hellwig7 at GMX.DE
Sun May 8 11:31:45 UTC 2011
Dear list,
a major new release of the Digital Corpus of Sanskrit is now available at
http://kjc-fs-cluster.kjc.uni-heidelberg.de/dcs/.
Apart from new texts, the website has the following new features:
1) Part of speech information for nouns; this means that you can search,
for instance, for hRdaya in Instr. Sg. etc.
2) A user feedback function for unclear text passages and typos. Unclear
text is marked in different colours in the corpus, and you may propose the
correct solution in a web-based form. Your feedback is highly appreciated,
and the names of contributors are included in future versions of the corpus
(although anonymous contributions are also possible).
3) Most importantly, the corpus now contains unsupervised (= not manually
corrected!) analyses of some of the shorter texts from the GRETIL
directory. According to a recent study, we may expect about 5% of lexical
errors for linguistically easy texts such as the epics and about 10% for
more complicated ones with unusual, technical, or Buddhist terminology. I
started with the shorter texts (approx. up to 50 kb) because each analysis
process is computationally expensive, and my laptop soon showed signs of
overexertion. References for words found in these texts are included in the
detail pages for single words, just below the verbal and part-of-speech
information. A link leads directly to the lines in the source files where
the words are found. You may check these functions with a popular term such
as rAjan.
In addition, you find XML files that contain the results of the
unsupervised analysis for each text. The complete list of these XML files
is printed at the bottom of the page http://kjc-fs-cluster.kjc.uni-
heidelberg.de/dcs/index.php?contents=corpus. This page also contains an XSL
file (unsupervised.xsl) that can be used to display the XML files in a more
appealing format. Please note that you have to include the line
<?xml-stylesheet type="text/xsl" href="unsupervised.xsl"?>
after the first tag (<?xml version="1.0" encoding="UTF-16"?>) of each XML
file to display the results in the intended manner. Stylesheet information
will be included automatically in each file in the next release of the
corpus.
--- Some technicalities concerning unsupervised analysis, on which I would
really appreciate off-list feedback ---
a) XML files are currently a popular scheme for storing data in the
Humanities. I am not an expert in this field. However, processing medium
sized files phrases using XSL files was extremely slow on my computer. This
is especially true for operations that use <xsl:if ...> and similar
constructions (e.g., used for searching lexemes). Performing a <xsl:if>-
based query with a file that contained the analyses of only five short
texts from GRETIL took more than five minutes on my computer. Did I miss a
point? Are there more effective ways for doing this, still using XML?
b) As said above, my computer was overexerted when performing the analysis.
Could some of you imagine to provide computing power, so that the large
files (Puranas, phil. treatises, ...) could also be processed (32-bit
systems preferred)?
--- End technicalities ---
Best regards,
Oliver Hellwig
-------------
PD Dr. Oliver Hellwig
South-Asia Institute, University of Heidelberg
Im Neuenheimer Feld 330
69120 Heidelberg, Germany
More information about the INDOLOGY
mailing list