As a musically-illiterate song lover, I loved all beautiful songs, irrespective of the genre and language. This remark, I should add, is not specifically designed to irritate my long-lost friend, who won’t care for anything that is not soft, slow and pleasing to the ears and invariably in Myanmar language.
Anyway, I was looking for Myanmar classical songs sung by the late Win Oo, a successful actor, short story writer, and songster. Once I’d heard him sing the classical song “Kha-nway-sun” or Eve of spring accompanied by Myanmar harp, played by virtuoso harpist U Ba Than. It was a fabulous performance by both artists and I would have liked to enjoy more of that genre of music that is known as “Yodaya” in our language. For some reason or other, I wasn’t able to find some more songs in this genre on youtube until a few days back. Then I discovered that the same duo had recorded a number of classical songs in the pinnacle of classical music known as “Maha Gita” or Greater Music.
After enjoying a few songs in this series I discovered the “Sylvan” composition better known as ပန်းဟေဝန် or “Flowering woods”, which I read somewhere as the song that inspired Prince Pyinsi to compose his classic ပန်းမြိုင်လယ် or “In the midst of the flowering woods”. Alas the author of this beautiful composition was unknown and it must have been lost passing down the lyrics of the song orally and its music composition by ear from generation to generation.
After enjoying a few songs in this series I discovered the “Sylvan” composition better known as ပန်းဟေဝန် or “Flowering woods”, which I read somewhere as the song that inspired Prince Pyinsi to compose his classic ပန်းမြိုင်လယ် or “In the midst of the flowering woods”. Alas the author of this beautiful composition was unknown and it must have been lost passing down the lyrics of the song orally and its music composition by ear from generation to generation.
As a fredgling in NLP, but an old birdie in biological age at which it is barely able to spread its wings, I was suddenly inspired to find out the author of this piece of music. I guess it could be done by using what is known as stylometric methods in natural language processing. Wikipedia described stylometry as “… the application of the study of linguistic style, usually to written language, but it has successfully been applied to music and to fine-art paintings as well.” I am citing two interesting applications of stylometry from the Wikipedia article:
In April 2015, researchers using stylometry techniques identified a play, Double Falsehood, as being the work of William Shakespeare. Researchers analyzed 54 plays by Shakespeare and John Fletcher and compared average sentence length, studied the use of unusual words and quantified the complexity and psychological valence of its language.
In 2018, Mark Glickman, senior lecturer in statistics at Harvard University worked with Ryan Song, a former statistics student at Harvard, and Jason Brown, a professor at Dalhousie University in Nova Scotia, applied stylometry to find that, most likely, The Beatles’ song “In My Life” was composed by John Lennon, but with a 50% chance that Paul McCartney wrote the middle eight.
Assuming that the ပန်းဟေဝန် song was written by one of the authors of other Yodaya songs with known authors, I would need to have the lyrics of those songs in Unicode text as well as the one I would be investigationg for authorship. The Rectified National standard version of the text of Maha Gita songs has been downloaded from here.
It contained a total of 35 songs in the Yodaya genre of which authorship for 16 has not been shown. That means trying to identify athorship of any one of the songs from the other half of songs with known authorship would be doomed to fail unless one is extremely lucky, I guess. Well, first thing first, and I could start converting song images from the pdf file to Myanmar Unicode text.
Looking for OCR (Optical Character Recognition) application, I was lucky to find at once the blog post by Hla Hla Htay, “OCR for Myanmar Unicode text” which shows how Myanmar text images could conveniently be converted to Unicode text by simply opening them with Google Docs.
Here’s how I did the OCR for the ပန်းဟေဝန် song:
- (1) I opened the Maha Gita pdf file with GIMP software (GIMP is open source and very powerful) - (2) Selected the page containing the ပန်းဟေဝန် song to import; it contained two songs
- (3) Cropped the song I want to process and saved it as a png file (may be you could as well use the jpeg format)
- (4) Uploaded the graphic file to Google Drive
- (5) Opened the uploaded file with Google Docs, and it was done!
- (1) I opened the Maha Gita pdf file with GIMP software (GIMP is open source and very powerful) - (2) Selected the page containing the ပန်းဟေဝန် song to import; it contained two songs
- (3) Cropped the song I want to process and saved it as a png file (may be you could as well use the jpeg format)
- (4) Uploaded the graphic file to Google Drive
- (5) Opened the uploaded file with Google Docs, and it was done!
You’ll see that the graphic file of song when opened in Google Docs shows the image at the top part and the Unicode text below it. Now you can download the file to your computer hard disk. I saved it in the odt or open document format to be used with LibreOffice software.
As we would expect, the resulted text is not perfect. Some characters came out incorrect and a few characters were missed altogether. In addition to the default 72-dpi resolution of the graphic file, I tried increasing that to 200-dpi and 300-dpi to see if that will improve the OCR. It seems that there wasn’t much improvement as seen below.
The yellow highlight shows incorrect characters and red highlight shows omissions.
As we would expect, the resulted text is not perfect. Some characters came out incorrect and a few characters were missed altogether. In addition to the default 72-dpi resolution of the graphic file, I tried increasing that to 200-dpi and 300-dpi to see if that will improve the OCR. It seems that there wasn’t much improvement as seen below.
The yellow highlight shows incorrect characters and red highlight shows omissions.
Result for 72-dpi
Result for 200-dpi
Result for 300-dpi
You’ll see that, it is not too much work to correct the errors in the text file. Still it will be some work for me to convert all the 19 “Yodaya” songs with known authors to begin my research in stylometry. May be you would be inspired to beat me in this game.
No comments:
Post a Comment