Posted: . At: 8:44 AM. This was 9 months ago. Post ID: 18387
Page permalink. WordPress uses cookies, or tiny pieces of information stored on your computer, to verify who you are. There are cookies for logged in users and for commenters.
These cookies expire two weeks after they are set.



Sponsored



How to use simple OCR to read Japanese text from an image.


Reading Japanese text from an image is very easy. Using this image as an example, it is simple to use an OCR utility to read the text. Install the tesseract utility.

[root@localhost Pictures]# dnf in tesseract

Then download the Japanese language data from GitHub.

And put the files into the /usr/share/tesseract/tessdata/ directory.

[root@localhost Pictures]# ls -hula /usr/share/tesseract/tessdata/
total 41M
drwxr-xr-x. 4 root        root         129 Aug 19 08:00 .
drwxr-xr-x. 3 root        root          22 Aug 19 07:59 ..
drwxr-xr-x. 2 root        root        4.0K Apr  1  2022 configs
-rw-r--r--. 1 root        root        4.0M Aug 19 07:33 eng.traineddata
-rw-r--r--. 1 jcartwright jcartwright  35M Aug 19 08:00 jpn.traineddata
-rw-r--r--. 1 jcartwright jcartwright 2.9M Aug 19 07:59 jpn_vert.traineddata
-rw-r--r--. 1 root        root         572 Dec 27  2019 pdf.ttf
drwxr-xr-x. 2 root        root          98 Apr  1  2022 tessconfigs

Then we are all set to try this out. This works quite well, to be honest.

(jcartwright@localhost) 192.168.1.5 Pictures  $ tesseract japaneseadsammydavisjrsuntorywhiskywhitealksdf_465_683_int.jpg stdout -l jpn --dpi 150
Detected 7 diacritics
 
 
 
選 ぶ ウ イ ス キ ー で 、 男 が 分 か る 。
 
 
 
ゥ サ ン ト ソ ー ホ ワ イ f ト

This works very well to find and print the correct Japanese characters. Even on this image, it worked very well.

This is a great example of the usage of the Linux command line to solve interesting problems.


Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.