/

Chinese Character Recognition

Python / Easy OCR / Machine Learning

PROJECT

Overview


Chinese being one of the most spoken languages around the globe has over 1 billion native speakers, further, the Sino-Tibetan language family has evolved into various regional dialects spoken today. With well over 3000 characters in everyday use, the language poses an immense challenge to digitize it. Although various OCR tools are available in the market currently, they are unreliable and inconsistent The Chinese language structure being different than those of the Latin-derived languages makes it challenging to apply the same principles to build OCR around it. This sample size mainly consisted of handwritten brush strokes of Chinese characters We ran our algorithm mainly MNIST data set of digits and CASIA data set of over 7000 characters and each character having a sample size of 50+. Although we had to randomize and pick a subset of the CASIA data set due to time and Compute constraints. So far, our algorithm's OCR accuracy is at an acceptable rate only for digitally generated images, but these images are only available in the environment to such an extent. General real-life images have decent accuracy as of now but to push it to a further acceptable level of accuracy a higher level of data cleaning and pre-processing is required, and we have got it to work on only JPEG image format so far.

Tech Stack

Easy OCR
Python
Matplotlib
openCV

Back