Skip to content

Latest commit

 

History

History
110 lines (73 loc) · 5.25 KB

readme-en.md

File metadata and controls

110 lines (73 loc) · 5.25 KB

EasyOCR Manual


EasyOCR is implemented using the Java language OCR recognition component, its work is based on the open source Tesseract-OCR engine. With a few simple API, the Java language can be used to call Tesseract-OCR engine to complete the picture content identification work.

** EasyOCR main feature is to some common CAPTCHA CAPTCHA recognition feature provides automatic integration (automatic cleanup to complete the picture, the image content recognition CAPTCHA code work). **

CAPTCHA:Completely Automated Public Turing Test to Tell Computers and Humans Apart

** Features: **

  • Specific code used to clean up the image, identify integration implementation, support for image cleanup, deformation and rotation three scenarios, built many common types of CAPTCHA cleanup options
  • API minimalist, a method, a line of code to complete
  • Because based Tesseract-OCR engine, so the engine supports multiple languages are recognized
  • Supports plugins, the ability to write extensions to clean up the integration of identification based EasyOCR CAPTCHA

** Character Recognition Description: **

tesseract-ocr is a relatively accurate free open source OCR engine. Character Recognition OCR engine but relatively complex, especially in smaller text, font design is not clear, paragraph text, text spacing smaller under less than ideal, such as Scene Recognition will (this is a common problem, compared to other top business ABBYY engine, Tesseract OCR is not inferior), but if we can treat early recognition processing and optimization pictograph (specific text, it is necessary to convert the specific processing method such as picture cleaning, pitch adjustment, the aspect ratio adjustment ...... etc. etc.), can greatly improve the recognition rate.

EasyOCR Use these steps:

  1. You must first download and install [Tesseract-OCR (project home)] (https://code.google.com/p/tesseract-ocr/ "Tesserat-OCR Homepage") in the server. Add Tesseract-OCR perform directory in the PATH environment variable (optional, but recommended setting).

  2. Add easyocr-3.0.4-RELEASE.jar

  • Maven
<dependency>
	<groupId>cn.easyproject</groupId>
	<artifactId>easyocr</artifactId>
	<version>3.0.4-RELEASE</version>
</dependency>
  1. Call API

EasyOCR API:

Built EasyOCR two main API:

  1. ImageClean: CAPTCHA cleanup class, complete a variety of CAPTCHA cleanup and output. Support for image cleanup (built several predefined picture cleanup mode selection switch can be flexible), deformation and rotation of three scenarios, and the scene at the same time support the application, to improve text recognition rate.

  2. EasyOCR: OCR text recognition Pictures core classes, complete the call to OCR engine. Internally With ImageClean complete automatic cleaning, identify integration work.

EasyOCR Examples:

demo_eurotext.png

img_INTERFERENCE_LINE.png

img_NORMAL.jpg

EasyOCR e=new EasyOCR();
//Direct identification picture content
System.out.println(e.discern("images/demo_eurotext.png")); 
//Direct identification CAPTCHA picture content
System.out.println(e.discernAndAutoCleanImage("images/img_INTERFERENCE_LINE.png",ImageType.CAPTCHA_INTERFERENCE_LINE)); 
//CAPTCHA, through: general cleaning, automatic integration scenarios deformation processing, identifying content
System.out.println(e.discernAndAutoCleanImage("images/img_NORMAL.jpg", ImageType.CAPTCHA_NORMAL, 1.6, 0.7));
		

Tip: For verification code image suitable deformation helps to improve the recognition rate. Under special circumstances need to adjust the ratio can be observed by multiple analysis to get the right ratio.

for(double imageWidthRatio=0.8;imageWidthRatio<=2;imageWidthRatio+=0.1){
	for (double imageHeightRatio = 0.8;imageHeightRatio<=2.8;imageHeightRatio+=0.1) {
		System.out.println(e.discernAndAutoCleanImage("images/d.jpg",ImageType.CAPTCHA_NORMAL,imageWidthRatio,imageHeightRatio));
	}
}

EasyOCR Chinese Identification Example:

Tesseract default recognition language is English, tesseractOptions property can be modified by identifying the type of language.

EasyOCR e=new EasyOCR();
// Set recognize command-line arguments for the Chinese (the default is English)
e.setTesseractOptions(EasyOCR.OPTION_LANG_CHI_SIM);

System.out.println(e.discern("C:\\novel.png"));

End

Comments

If you have more comments, suggestions or ideas, please contact me.

Email:[email protected]

http://www.easyproject.cn