Accessing the MMDL database for links to the manuscripts
It wasn’t easy browsing the titles of palm-leaf manuscripts listed on the Myanmar Manuscript Database Library (MMDL) website hosted by the Toronto University. The problem with me was that I couldn’t very well read their transliteration the way I would be reading them in our Myanmar (Burmese) script.
To begin with, the title for the very first manuscript file in the U Pho Thi Library database is: “A-me:-tō-puṃ”. And I was at a loss, until I looked up the first page the corresponding pdf file and it was just “အမေးတော်ပုံ”. Aha!
As it is, the transliteration may serve the needs of the international and local scholars well. But for us ordinary folks it would be good to add the Burmese titles also. Besides, Burmese titles would eliminate any chance of misidentifying the manuscripts, as for example, by misreading the transliterations.
On the other hand, I was aware of the quirks of using Burmese text with computer applications. I think this was largely because we were so slow in adopting the Unicode system.
To see if I could add Burmese titles to the database I tried (i) scraping the MMDL database for the links to the manuscripts and their URLs, (ii) adding a few Burmese titles (for demonstration only), and (iii) write them out to a text file. All those were done within the R statistical environment via the RStudio software.
scraping the MMDL database
To see if the MMDL website allow scraping I looked into their “robots.txt” file at https://mmdl.utoronto.ca/robots.txt. The result was:
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Looking in their databases page with https://mmdl.utoronto.ca/databases/robots.txt there was no “robots.txt” file there. I think that means there is no restriction to scrape the MMDL database.
Now to get all page-links(Page: A – C / D – F / G – I / J – L / M – O / P – R / S – U / V – X / Y – Z) for UPT, I run the script due to Paul Rougieux from here.
library(magrittr)
library(tibble)
scraplinks <- function(url){
# Create an html document from the url
webpage <- xml2::read_html(url)
# Extract the URLs
url_ <- webpage %>%
rvest::html_nodes("a") %>%
rvest::html_attr("href")
# Extract the link text
link_ <- webpage %>%
rvest::html_nodes("a") %>%
rvest::html_text()
return(tibble(link = link_, url = url_))
}
UPT_mainLinks <- scraplinks("https://mmdl.utoronto.ca/databases/u-po-thi-library/")
`
By inspecting the tibble output UPT_mainLinks we find the urls as:
UPT_mainLinks[c(13, 15:23),]
<chr> | <chr> |
---|---|
Home | https://mmdl.utoronto.ca/ |
A – C | /databases/u-po-thi-library/a-c/ |
D – F | /databases/u-po-thi-library/d-f/ |
G – I | /databases/u-po-thi-library/g-i/ |
J – L | /databases/u-po-thi-library/j-l/ |
M – O | /databases/u-po-thi-library/m-o/ |
P – R | /databases/u-po-thi-library/p-r/ |
S – U | /databases/u-po-thi-library/s-u/ |
V – X | /databases/u-po-thi-library/v-x/ |
Y – Z | /databases/u-po-thi-library/y-z/ |
Thus We could construct complete urls for pages “A-C” to “Y-Z” in UPT as follows:
Purl0 <- substring(UPT_mainLinks$url[13],1, nchar(UPT_mainLinks$url[13])-1)
UPT_Purl <- character()
for (i in 15:23) {
UPT_Purl[i-14] <- paste0(Purl0,UPT_mainLinks$url[i])
}
UPT_Purl
[1] "https://mmdl.utoronto.ca/databases/u-po-thi-library/a-c/"
[2] "https://mmdl.utoronto.ca/databases/u-po-thi-library/d-f/"
[3] "https://mmdl.utoronto.ca/databases/u-po-thi-library/g-i/"
[4] "https://mmdl.utoronto.ca/databases/u-po-thi-library/j-l/"
[5] "https://mmdl.utoronto.ca/databases/u-po-thi-library/m-o/"
[6] "https://mmdl.utoronto.ca/databases/u-po-thi-library/p-r/"
[7] "https://mmdl.utoronto.ca/databases/u-po-thi-library/s-u/"
[8] "https://mmdl.utoronto.ca/databases/u-po-thi-library/v-x/"
[9] "https://mmdl.utoronto.ca/databases/u-po-thi-library/y-z/"
Get links for UPT_Purl 1:9
UPT_pLink <- list()
for (j in 1:9) {
UPT_pLink[[j]] <- scraplinks(UPT_Purl[j]) %>%
.[grep("https://digicoll.", .$url), ] # select only the links for manuscripts
}
Combine all the manuscript links into a single tibble.
UPT_mLink <- do.call(rbind, UPT_pLink)
UPT_mLink
<chr> | |
---|---|
A-me:-to-pu<U+1E43> | |
Abhidhammattasa<U+1E45>gaha-ga<U+1E47><U+1E6D>hi-sac-nissaya | |
Abhidhammattha-vibhavani (va) <U+1E6C>ika-kyo | |
Abhidhammatthasa<U+1E45>gaha-dipani | |
Abhidhammatthasa<U+1E45>gaha-dipani-nissaya | |
Abhidhammatthasa<U+1E45>gaha-dipani-nissaya (Mon) | |
Abhidhammatthasa<U+1E45>gaha-nissaya | |
Abhidhammatthavibhavani (<U+1E6C>ika-kyo) | |
Abhidhamma 7-kyam: Anu<U+1E6D>ika (8 sections) | |
Abhidhamma 7-kyam: Anu<U+1E6D>ika (8 sections) |
So We find the U Pho Thee Library database contains 1202 manuscripts.
Collecting the Myanmar language titles of the manuscripts and adding them to the database
Now to get the title of manuscript in Myanmar language, I had to open the corresponding pdf file, one by one, from the UPT database site, and look at the title page. An example of title page is given below:
The following is the Myanmar language titles of first five manuscripts in the UPT database which I am using for my present exercise:
A-me:-tō-puṃ
Abhidhammattasaṅgaha-gaṇṭhi-sac-nissaya
Abhidhammattha-vibhāvanī (vā) Ṭīkā-kyō
Abhidhammatthasaṅgaha-dīpanī
Abhidhammatthasaṅgaha-dīpanī-nissaya
အမေးတော်ပုံ
အဘိဓမ္မတ္ထသင်္ဂဟကဏ္ဍိသစ်နိသျ
အဘိဓမ္မတ္ထဝိဘာဝနီ (ဝါ) ဋီကာကျော်
အဘိဓမ္မတ္ထသင်္ဂဟဒီပနီ
အဘိဓမ္မတ္ထသင်္ဂဟဒီပနီနိသျ
But this method of acquiring the title is clearly useless for handling the whole of the database. That takes too much effort and time. Also, in my random selection of manuscripts, in some I couldn’t locate the Myanmar titles!
The most efficient approach would be to use the list of titles in Myanmar language that was used originally for the transilterations. Most likely, the MMDL project would have stored the metadata together with the data for the manuscripts. Then the MMDL Project itself would be able to complete the task of adding Myanmar language titles to their databases without much effort if they care to do so.
Create column for link in Burmese.
newCol = character(length =1202)
newCol[1:5] <- c("အမေးတော်ပုံ", "အဘိဓမ္မတ္ထသင်္ဂဟကဏ္ဍိသစ်နိသျ", "အဘိဓမ္မတ္ထဝိဘာဝနီ (ဝါ) ဋီကာကျော်", "အဘိဓမ္မတ္ထသင်္ဂဟဒီပနီ", "အဘိဓမ္မတ္ထသင်္ဂဟဒီပနီနိသျ")
head(newCol)
[1] "အမေးတော်ပုံ" "အဘိဓမ္မတ္ထသင်္ဂဟကဏ္ဍိသစ်နိသျ"
[3] "အဘိဓမ္မတ္ထဝိဘာဝနီ (ဝါ) ဋီကာကျော်" "အဘိဓမ္မတ္ထသင်္ဂဟဒီပနီ"
[5] "အဘိဓမ္မတ္ထသင်္ဂဟဒီပနီနိသျ" ""
Add “linkMM” to tibble after “url”
UPT_mLink.1 <- UPT_mLink %>% add_column(linkMM = newCol, .after = "url")
UPT_mLink.1
<chr> | |
---|---|
A<U+1E0D><U+1E0D>arasi-dhammasat[tha]-la<U+1E45>ka | |
A<U+1E0D><U+1E0D>asa<U+1E45>khepava<U+1E47><U+1E47>ana (Dhammasat[tha]) | |
A<U+1E45>guttara-nikaya-pali-to dutiya thup (Chakka mha Ekadasa-nipata) | |
A<U+1E45>guttara-nikaya-pa<U+1E37>i-to (pa<U+1E6D>hama) thup (Eka to Pañcaka sections) | |
A<U+1E45>guttara-pa<U+1E37>i-to-nissaya (Dasa<U+1E45>guttara, Ekadasa<U+1E45>guttara) | |
A<U+1E45>guttuir-a<U+1E6D><U+1E6D>hakatha (Catuka mha Ekadasa) | |
A<U+1E45>guttuir-a<U+1E6D><U+1E6D>hakatha (Catukka mha Ekadasa-kathi) dutiya thup (7 sections) | |
A<U+1E45>guttuir-a<U+1E6D><U+1E6D>hakatha pa<U+1E6D>hama thup (4 sections) | |
A<U+1E45>guttuir-pa<U+1E37>i-to (Eka mha Pañcaka) (pa<U+1E6D>hama) thup (4 sections) | |
A<U+1E45>guttuir-pa<U+1E37>i-to (pa<U+1E6D>hama) thup (5 sections) |
Creating a pdf file of links for the manuscripts in the U Pho Thee database
The workflow for creating a pdf file is, (i) export data from R as semi-colon delimited text file, and (ii) open it in a spreadsheet and export it as pdf file. Instead of this manual approach, I tried a number of alternatives for programmetically creating the pdf file inside R. Failed miserably!
Creating the text file
The standard way of writing out a text file in R like write.csv() or write.csv1() cannot produce text in Myanmar language. The only way is to use writeLines() with the “useBytes = TRUE” option.
y <- do.call("paste", c(UPT_mLink.1, sep = ";")) %>%
c("Title;URL;Title(Myanmar)", .)
y[1:10]
[1] "Title;URL;Title(Myanmar)"
[2] "A-me:-tō-puṃ;https://digicoll.library.utoronto.ca/mmdl/UPT097F.pdf;အမေးတော်ပုံ"
[3] "Abhidhammattasaṅgaha-gaṇṭhi-sac-nissaya;https://digicoll.library.utoronto.ca/mmdl/UPT642F.pdf;အဘိဓမ္မတ္ထသင်္ဂဟကဏ္ဍိသစ်နိသျ"
[4] "Abhidhammattha-vibhāvanī (vā) Ṭīkā-kyō;https://digicoll.library.utoronto.ca/mmdl/UPT725_3F.pdf;အဘိဓမ္မတ္ထဝိဘာဝနီ (ဝါ) ဋီကာကျော်"
[5] "Abhidhammatthasaṅgaha-dīpanī;https://digicoll.library.utoronto.ca/mmdl/UPT530_2F.pdf;အဘိဓမ္မတ္ထသင်္ဂဟဒီပနီ"
[6] "Abhidhammatthasaṅgaha-dīpanī-nissaya;https://digicoll.library.utoronto.ca/mmdl/UPT393F.pdf;အဘိဓမ္မတ္ထသင်္ဂဟဒီပနီနိသျ"
[7] "Abhidhammatthasaṅgaha-dīpanī-nissaya (Mon);https://digicoll.library.utoronto.ca/mmdl/UPT295F.pdf;"
[8] "Abhidhammatthasaṅgaha-nissaya;https://digicoll.library.utoronto.ca/mmdl/UPT527_2F.pdf;"
[9] "Abhidhammatthavibhāvanī (Ṭīkā-kyō);https://digicoll.library.utoronto.ca/mmdl/UPT520_4F.pdf;"
[10] "Abhidhammā 7-kyam: Anuṭīkā (8 sections);https://digicoll.library.utoronto.ca/mmdl/UPT700F.pdf;"
writeLines(y, "UPT_db.txt", useBytes = TRUE)
Opening the text file in LibreOffice Calc, and exporting to pdf
The “UPT_db.txt” is a semi-colon separated text file. To create a pdf file from this, we (i) create a blank spreadsheet in LibreOffice Calc, and import the text file (ii) format the page as legal, landscape, and with dotted row-borders, (iii) select the area containing data, and (iv) export to pdf.
Adding the remaining Myanmar language titles, when available, could be done easily on the above spread sheet or programmetically through adding to the source text file within R.
With this pdf file you can search English or Myanmar text and when you click any of the URLs, the corresponding file will be downloaded from the MMDL website.
The Bagaya Monastery Database (BGY)
Acessing the BGY database gives links to 206 pdf manuscript files. This database could also be handled with exacty the same workflow as for UPT.
No comments:
Post a Comment