Reading Japanese text from an image is very easy. Using this image as an example, it is simple to use an OCR utility to read the text. Install the tesseract utility.
[root@localhost Pictures]# dnf in tesseract |
Then download the Japanese language data from GitHub.
And put the files into the /usr/share/tesseract/tessdata/ directory.
[root@localhost Pictures]# ls -hula /usr/share/tesseract/tessdata/ total 41M drwxr-xr-x. 4 root root 129 Aug 19 08:00 . drwxr-xr-x. 3 root root 22 Aug 19 07:59 .. drwxr-xr-x. 2 root root 4.0K Apr 1 2022 configs -rw-r--r--. 1 root root 4.0M Aug 19 07:33 eng.traineddata -rw-r--r--. 1 jcartwright jcartwright 35M Aug 19 08:00 jpn.traineddata -rw-r--r--. 1 jcartwright jcartwright 2.9M Aug 19 07:59 jpn_vert.traineddata -rw-r--r--. 1 root root 572 Dec 27 2019 pdf.ttf drwxr-xr-x. 2 root root 98 Apr 1 2022 tessconfigs |
Then we are all set to try this out. This works quite well, to be honest.
(jcartwright@localhost) 192.168.1.5 Pictures $ tesseract japaneseadsammydavisjrsuntorywhiskywhitealksdf_465_683_int.jpg stdout -l jpn --dpi 150 Detected 7 diacritics 選 ぶ ウ イ ス キ ー で 、 男 が 分 か る 。 ゥ サ ン ト ソ ー ホ ワ イ f ト |
This works very well to find and print the correct Japanese characters. Even on this image, it worked very well.
This is a great example of the usage of the Linux command line to solve interesting problems.