Sylvan Stroll: Likely source of problems with text search in pdf files

This is the one-page pdf file containing Myanmar language text I shared for testing text search.

Here I tested searching for a two-character string consisting only of stacked-consonants (ပါဌ်ဆင့်). Surprisingly, Acrobat Reader which performed poorly as shown in my last post, beats the Chrome browser this time.

Searching “မ္မတ္တ” with Chrome: failed

Searching “မ္မတ္တ” with Acrobat Reader: success!

Note that the Chrome, Microsoft Edge, and Opera browsers generally outperformed Acrobat Reader in Myanmar-language text search in pdf file (see my last post).
Could such erratic performance be due to the differences in capability of each browser or the Acrobat Reader to do text search? Or were the conversion of Myanmar language text to pdf format itself the source of problems?

To get some idea, I took the pdf file in question, convert the contents into text and tried text searching with regex (regular expression).

“A regular expression … is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for”find” or “find and replace” operations on strings, or for input validation.”

Import data from pdf file into R

library(pdftools)
UPT_pdfTxt <- pdf_text("extractP1_UPT_db_fw.pdf")

Run text search

I created some search strings that either the browsers or Acrobat reader couldn’t handle and ran the search.

library(stringr)
# add ".*" in search string to retrieve the entire line from source text
patt <- c(".*အဘိဓမ္မတ္ထသင်္ဂဟဒီပနီ.*",  ".*မ္မတ္ထ.*",  ".*င်.*",  ".*ဍိ.*",  ".*ဏိ.*")
m <- c()
for (i in 1:5) {
  m[i] <- str_extract_all(UPT_pdfTxt, patt[i])
}

The searches were completed effortlessly.

Write the results to a text file

# Remove ".*" from search string for output text
patt.1 <- gsub(".*", "'", patt, fixed = TRUE)
q <- list()
for (i in 1:5){
  q[[i]] <- c(paste0("Search string = ", patt.1[i], collapse = ""), "Results:", m[[i]], "\n")
}
# Write the results to a text file
writeLines(unlist(q), "q.txt", useBytes = TRUE)

Results shown in the “q.txt” file

Conclusion

The source text used for this exercise was imported into R from the pdf file used for testing text search with the browsers and the Adobe Acrobat Reader. The test for finding text using regex gives results without any error. Therefore, we could rule out problems associated with the conversion of Myanmar text into pdf format as the source of erratic performance in Myanmar text search.

So we are left with the particular implementation of text search function as the likely source of problem, in each browser or the Acrobat Reader, among others.

Friday, July 8, 2022

Likely source of problems with text search in pdf files

Import data from pdf file into R

Run text search

Write the results to a text file

Conclusion

No comments:

Post a Comment