Introducing TBRC’s Tibetan eText Repository
We are thrilled to announce the release of our Tibetan eText Repository on tbrc.org. The eText repository is a collaborative effort between TBRC and many publishers, authors and institutions to archive and make available through search, a large corpus of searchable Tibetan texts.
The Power of a Tibetan eText Repository
Imagine being able to search a place name, a person’s name, a topic, a title, a term inside texts, across many collections,in many different traditions, in different points in time. You would discover connections that otherwise might be impossible to imagine. The power of the Tibetan eText repository is discovery. As one of TBRC’s lineage patriarchs His Holiness Drigung Chetsang Rinpoche said “this will end sectarianism.”
Impact on Scholarship
The impact of this type of resource is best expressed from scholars in the field.
“As Tibetan Studies specialists increasingly move to using searchable digital text, the Tibetan Buddhist Resource Center is once again at the forefront of the discipline, anticipating the needs of scholars and making freely available a body of literature that is unprecedented in its scope and accuracy. These cutting-edge technological innovations will revolutionize the way we do research, and the way that Tibetan Buddhism is studied and taught in universities… A tremendous contribution to the field.” Jose I. Cabezon, University of California, Santa Barbara
“This wonderful tool is going to transform the way Tibetan studies can be done. Previously, searching for how a specific term was used in different texts required a tremendous amount of work, asking colleagues for references, guessing at where it might be used and painstakingly searching for the term. Moreover, thanks to this tool, we may also soon have a proper Tibetan dictionary, with dated examples of the use of specific terms.” Gray Tuttle, Columbia University.
“In philological research, and particularly in translation, progress can be brought to a halt over the meaning of a single term (e.g. la bzla’ ba). A large body of e-texts produced by OCR, coupled with search tools, enables the novice to have the same broad context of literature that hitherto only belonged to the greatest scholars.” Professor Kurt Keutzer, University of California, Berkeley.
“Anyone accessing the TBRC website will immediately realize that its new design is a major achievement in terms of its visual appeal and its user-friendliness. The enormous number of e-text resources that are now prominently on display and available for research will no doubt change the way in which most, if not all, of us do their research. This is a major event in the history of Tibetan Studies and an amazing tool for Tibetan philology. Jeff Wallman and his team are to be warmly congratulated and surely deserve the gratitude of anyone working in this field.” Leonard W.J. van der Kuijp, Harvard University
“Large corpora of Buddhist texts have recently become available in digital format thanks to TBRC. This enables the implementation of powerful search functions and corpus-linguistic methods. A methodological approach based on searching inside large collections of texts in a systematic way opens new horizons for Indo-Tibetan research. We are in a position to ‘mine content’ faster than ever. Therefore we can better understand complex linguistic, philological, anthropological, philosophical, and cultural phenomena in their own textual context. To illustrate this point, we can now for example carry out a research program consisting of a corpus-based discourse analysis of specific Buddhist systems of thought, which would be impossible without tools such as those developed by TBRC. This is the approach I followed for my own research on ’Ju Mi pham rnam rgyal rgya mtsho’s interpretation of the two truths. Searching key terms, collocates, synonyms, definitions, and clusters of technical terms inside the 15,000 folios of Mipham’s gsum ‘bum would be fastidious without such powerful searching tools. Due to the sheer size of some Buddhist corpora of texts and the complexity of the research topics at hand, drawing inferences on the basis of specific or isolated occurrences of some technical terms without fully understanding their usage in their own textual context could be methodologically unsound. Search functions as well as corpus-linguistic methods provide a solution to this quandary and represent an extremely promising approach to better understanding Buddhist philosophy and practices. ” Gregory Forgues, University of Vienna
“Over the past decade, the Tibetan Buddhist Resource Center has revolutionized the study of Tibetan literature with its extensive library of digitized texts. Now with their latest offering– a repository of searchable e-texts—they promise to do it again. The ramifications are hard to imagine; certainly, they will be far-reaching!” Jake Dalton, University of California, Berkeley
“The Tibetan Buddhist Resource Center’s release of an extended database of searchable Tibetan e-texts represents a paradigm shift in the kinds of research that are now possible. This collection opens up the Tibetan literary world to data-driven quantitative and qualitative analysis of technical vocabulary, place and personal names, literary topics and genres, etc. It’s impact will continue to grow as the corpus of machine-readable texts expands and as more advanced and complex forms of searching become possible. This is certainly one of the most exciting developments in recent history for the study of Tibetan literature and textual production.” Andrew Quintman, Yale University
“Ever since its inception, TBRC has been the premier online destination for translators and scholars of Tibetan writings. Now, with this vast repository of e-texts, we have something like our very own Hubble telescope, a powerful tool for exploring the rich universe of Tibetan literature, allowing us to chart the evolution of terms and ideas, and discover previously unobserved intertextual connections—thereby radically transforming the way we work.” Adam Pearcey (Lotsawa House)
“A huge leap forward for Tibetan studies! Searching electronically through multiple texts for terms and topics will uncover a rich treasure of associations and new contexts for our knowledge of Tibetan religion, culture, and history. The possibilities are endless.” Janet Gyatso, Harvard University
What are eTexts?
eText is short for electronic text (as opposed to physical text). When we scan a text we create an image of a text. A scanned text is not an eText. An eText is a digital document whose content is readily accessible to the user. An eText is said to be “born digital” meaning it was created first in electronic format.
A Short History of TBRC’s Interest in eTexts
Actually TBRC’s interest in eTexts started with the formation of TBRC, as Gene Smith foresaw the extraordinary power of searching inside texts. In the formation documents, Gene Smith states:
“Many Tibetan text publication projects utilize computer input only to obtain the printed output and then lack the capability to preserve this input. In order to preserve texts that have already been input electronically, TBRC will advocate appropriate levels of technology, description, and storage for the preservation and archiving of these digital documents being produced worldwide. The Center will assist owners of electronic text input and digital images of Tibetan materials by offering advisory services, and will create a Digital Text Databank to serve as a repository and a link to these files.”
We have done exactly that.
Gene’s early efforts at gathering input centered around special genres of literature he was interested in – mostly historical texts and biographies. At the time, these documents were in fonts specifically created to print texts, and not amenable to digital storage and retrieval. With the adoption of Tibetan Unicode however, digital storage and retrieval has become a reality. Toward the end of his life, Gene embarked on gathering input for what we called at the time – the Dharmacloud. And since then we have built the necessary framework to carry out Gene’s vision.
Sources of eTexts
There are two main sources of eTexts
- Lineage Masters (input eTexts)
- Optical Character Recognition (OCR eTexts)
We are indebted to our lineage masters and the noble monasteries that work tirelessly to print Tibetan works. Tibetan book culture is an extraordinary activity and these institutions create enormous volumes of born digital texts. In many cases, these efforts are aimed at producing print editions. We hope, however, with the release of a robust Tibetan eText Repository, we can assist these projects in digital storage, archiving and use of the input eTexts beyond the printed form – just as Gene Smith envisioned.
We have been entrusted to care for the following collections:
- Karma Lekshey Ling
- Drikung Chetsang Rinpoche
- Vajra Vidya – Thrangu Rinpoche
- Palri Parkhang
- Karma Delek
- Guru Lama
- Shechen Monastery
- Tulku Sangag Rinpoche
- Larun Gar
- Various institutions
These texts come in a variety of document formats and fonts. Therefore, we have created a suite of tools to normalize and convert these documents into Tibetan Unicode so that they can be properly stored and searched.
Optical Character Recognition (OCR)
In 2009, I began to work with the Russian team at the Rime Center, Moscow. With a grant from the Trace Foundation, we were able to scan and optically recognize the Kanjur and Tenjur dpe sdur ma. This effort was primarily experimental and much of the OCR technology the Rime Center built, is still being developed and is being used by them to produce large amounts of eTexts.
In 2013, we began to build a partnership with the University of California, Berkeley to build a stable platform to produce eTexts from our scanned archive. Under the direction of TBRC Board member Professor Kurt Keutzer, Research and Development Engineer Zach Rowinski created a Tibetan OCR engine called Namsel OCR. Through this program TBRC scans are being converted into OCR eTexts at an enormous rate and we hope to have 25% of the TBRC Library completely searchable within one year. Already a large collection has been OCR’ed and with this release of our new search application on tbrc.org, is now fully searchable within TBRC.
Discovery – Deep Search & Deep Context
TBRC’s primary focus for the Tibetan eText Repository is two-fold:
- Deep Search
- Deep Context
We are building the necessary technology to allow readers to search inside eTexts. This core technology includes Tibetan analyzers for the Lucene search engine, eXist database extensions for Tibetan sorting and keyword matching and query routines that allow us to handle text search from the online library at tbrc.org. When a reader searches an eText they find matches in the texts themselves. But these matches are then contextualized. That is they are identified by the classification system within TBRC. Within the deep context of the TBRC Library metadata, readers can then match the text they found with scanned sources – searching in this framework allows readers to make concordance with bibliographic information and scanned sources.
Lastly, at the encouragement of Professors David Germano and Marcus Bingenheimer, we have selected TEI-XML as the data format for the eTexts. This will allow greater interoperability between collections, as well as corpus-wide connectivity to other projects in different languages, such as Chinese.
Tibetan Buddhist Resource Center
September 23, 2014