Wednesday, June 29, 2022

Accessing the MMDL database for links to the manuscripts


It wasn’t easy browsing the titles of palm-leaf manuscripts listed on the Myanmar Manuscript Database Library (MMDL) website hosted by the Toronto University. The problem with me was that I couldn’t very well read their transliteration the way I would be reading them in our Myanmar (Burmese) script.

To begin with, the title for the very first manuscript file in the U Pho Thi Library database is: “A-me:-tō-puṃ”. And I was at a loss, until I looked up the first page the corresponding pdf file and it was just “အမေးတော်ပုံ”. Aha!

As it is, the transliteration may serve the needs of the international and local scholars well. But for us ordinary folks it would be good to add the Burmese titles also. Besides, Burmese titles would eliminate any chance of misidentifying the manuscripts, as for example, by misreading the transliterations.

On the other hand, I was aware of the quirks of using Burmese text with computer applications. I think this was largely because we were so slow in adopting the Unicode system.

To see if I could add Burmese titles to the database I tried (i) scraping the MMDL database for the links to the manuscripts and their URLs, (ii) adding a few Burmese titles (for demonstration only), and (iii) write them out to a text file. All those were done within the R statistical environment via the RStudio software.

scraping the MMDL database

To see if the MMDL website allow scraping I looked into their “robots.txt” file at https://mmdl.utoronto.ca/robots.txt. The result was:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

Looking in their databases page with https://mmdl.utoronto.ca/databases/robots.txt there was no “robots.txt” file there. I think that means there is no restriction to scrape the MMDL database.

Now to get all page-links(Page: A – C / D – F / G – I / J – L / M – O / P – R / S – U / V – X / Y – Z) for UPT, I run the script due to Paul Rougieux from here.

library(magrittr)
library(tibble)
scraplinks <- function(url){
    # Create an html document from the url
    webpage <- xml2::read_html(url)
    # Extract the URLs
    url_ <- webpage %>%
        rvest::html_nodes("a") %>%
        rvest::html_attr("href")
    # Extract the link text
    link_ <- webpage %>%
        rvest::html_nodes("a") %>%
        rvest::html_text()
    return(tibble(link = link_, url = url_))
}
UPT_mainLinks <- scraplinks("https://mmdl.utoronto.ca/databases/u-po-thi-library/")

By inspecting the tibble output UPT_mainLinks we find the urls as:

UPT_mainLinks[c(13, 15:23),]
link
<chr>
url
<chr>
Homehttps://mmdl.utoronto.ca/
A – C/databases/u-po-thi-library/a-c/
D – F/databases/u-po-thi-library/d-f/
G – I/databases/u-po-thi-library/g-i/
J – L/databases/u-po-thi-library/j-l/
M – O/databases/u-po-thi-library/m-o/
P – R/databases/u-po-thi-library/p-r/
S – U/databases/u-po-thi-library/s-u/
V – X/databases/u-po-thi-library/v-x/
Y – Z/databases/u-po-thi-library/y-z/


Thus We could construct complete urls for pages “A-C” to “Y-Z” in UPT as follows:

Purl0 <- substring(UPT_mainLinks$url[13],1, nchar(UPT_mainLinks$url[13])-1)
UPT_Purl <- character()
for (i in 15:23) {
  UPT_Purl[i-14] <- paste0(Purl0,UPT_mainLinks$url[i])
}
UPT_Purl
[1] "https://mmdl.utoronto.ca/databases/u-po-thi-library/a-c/"
[2] "https://mmdl.utoronto.ca/databases/u-po-thi-library/d-f/"
[3] "https://mmdl.utoronto.ca/databases/u-po-thi-library/g-i/"
[4] "https://mmdl.utoronto.ca/databases/u-po-thi-library/j-l/"
[5] "https://mmdl.utoronto.ca/databases/u-po-thi-library/m-o/"
[6] "https://mmdl.utoronto.ca/databases/u-po-thi-library/p-r/"
[7] "https://mmdl.utoronto.ca/databases/u-po-thi-library/s-u/"
[8] "https://mmdl.utoronto.ca/databases/u-po-thi-library/v-x/"
[9] "https://mmdl.utoronto.ca/databases/u-po-thi-library/y-z/"


Get links for UPT_Purl 1:9

UPT_pLink <- list()
for (j in 1:9) {
  UPT_pLink[[j]] <- scraplinks(UPT_Purl[j]) %>%
   .[grep("https://digicoll.", .$url), ]   # select only the links for manuscripts
} 

Combine all the manuscript links into a single tibble.

UPT_mLink <- do.call(rbind, UPT_pLink)
UPT_mLink
link
<chr>
A-me:-to-pu<U+1E43>
Abhidhammattasa<U+1E45>gaha-ga<U+1E47><U+1E6D>hi-sac-nissaya
Abhidhammattha-vibhavani (va) <U+1E6C>ika-kyo
Abhidhammatthasa<U+1E45>gaha-dipani
Abhidhammatthasa<U+1E45>gaha-dipani-nissaya
Abhidhammatthasa<U+1E45>gaha-dipani-nissaya (Mon)
Abhidhammatthasa<U+1E45>gaha-nissaya
Abhidhammatthavibhavani (<U+1E6C>ika-kyo)
Abhidhamma 7-kyam: Anu<U+1E6D>ika (8 sections)
Abhidhamma 7-kyam: Anu<U+1E6D>ika (8 sections)

So We find the U Pho Thee Library database contains 1202 manuscripts.


Collecting the Myanmar language titles of the manuscripts and adding them to the database

Now to get the title of manuscript in Myanmar language, I had to open the corresponding pdf file, one by one, from the UPT database site, and look at the title page. An example of title page is given below:

The following is the Myanmar language titles of first five manuscripts in the UPT database which I am using for my present exercise:

A-me:-tō-puṃ
Abhidhammattasaṅgaha-gaṇṭhi-sac-nissaya
Abhidhammattha-vibhāvanī (vā) Ṭīkā-kyō
Abhidhammatthasaṅgaha-dīpanī
Abhidhammatthasaṅgaha-dīpanī-nissaya

အမေးတော်ပုံ
အဘိဓမ္မတ္ထသင်္ဂဟကဏ္ဍိသစ်နိသျ
အဘိဓမ္မတ္ထဝိဘာဝနီ (ဝါ) ဋီကာကျော်
အဘိဓမ္မတ္ထသင်္ဂဟဒီပနီ
အဘိဓမ္မတ္ထသင်္ဂဟဒီပနီနိသျ

But this method of acquiring the title is clearly useless for handling the whole of the database. That takes too much effort and time. Also, in my random selection of manuscripts, in some I couldn’t locate the Myanmar titles!

The most efficient approach would be to use the list of titles in Myanmar language that was used originally for the transilterations. Most likely, the MMDL project would have stored the metadata together with the data for the manuscripts. Then the MMDL Project itself would be able to complete the task of adding Myanmar language titles to their databases without much effort if they care to do so.

Create column for link in Burmese.

newCol = character(length =1202)
newCol[1:5] <- c("အမေးတော်ပုံ", "အဘိဓမ္မတ္ထသင်္ဂဟကဏ္ဍိသစ်နိသျ", "အဘိဓမ္မတ္ထဝိဘာဝနီ (ဝါ) ဋီကာကျော်", "အဘိဓမ္မတ္ထသင်္ဂဟဒီပနီ", "အဘိဓမ္မတ္ထသင်္ဂဟဒီပနီနိသျ")
head(newCol)
[1] "အမေးတော်ပုံ"                  "အဘိဓမ္မတ္ထသင်္ဂဟကဏ္ဍိသစ်နိသျ"      
[3] "အဘိဓမ္မတ္ထဝိဘာဝနီ (ဝါ) ဋီကာကျော်" "အဘိဓမ္မတ္ထသင်္ဂဟဒီပနီ"           
[5] "အဘိဓမ္မတ္ထသင်္ဂဟဒီပနီနိသျ"         ""                         


Add “linkMM” to tibble after “url”

UPT_mLink.1 <- UPT_mLink %>% add_column(linkMM = newCol, .after = "url")
UPT_mLink.1
link
<chr>
A<U+1E0D><U+1E0D>arasi-dhammasat[tha]-la<U+1E45>ka
A<U+1E0D><U+1E0D>asa<U+1E45>khepava<U+1E47><U+1E47>ana (Dhammasat[tha])
A<U+1E45>guttara-nikaya-pali-to dutiya thup (Chakka mha Ekadasa-nipata)
A<U+1E45>guttara-nikaya-pa<U+1E37>i-to (pa<U+1E6D>hama) thup (Eka to Pañcaka sections)
A<U+1E45>guttara-pa<U+1E37>i-to-nissaya (Dasa<U+1E45>guttara, Ekadasa<U+1E45>guttara)
A<U+1E45>guttuir-a<U+1E6D><U+1E6D>hakatha (Catuka mha Ekadasa)
A<U+1E45>guttuir-a<U+1E6D><U+1E6D>hakatha (Catukka mha Ekadasa-kathi) dutiya thup (7 sections)
A<U+1E45>guttuir-a<U+1E6D><U+1E6D>hakatha pa<U+1E6D>hama thup (4 sections)
A<U+1E45>guttuir-pa<U+1E37>i-to (Eka mha Pañcaka) (pa<U+1E6D>hama) thup (4 sections)
A<U+1E45>guttuir-pa<U+1E37>i-to (pa<U+1E6D>hama) thup (5 sections)