OCRing & Identifying page structure — Tesseract + hOCR
Extracting the text from the images with the help of OCR engines is more fun than it sounds. The input images can be tilted, contain broken texts, thick lines around the text making it difficult for our systems to identify the correct text. Sometimes, we also need to consider the page structure and extract only specific sections of text. So if your OCR engine just randomly extract the text from images, even if it is accurate, it will be of less use. In this case, you need the text to be extracted in systematic manner retaining the page structure.
After trying different OCR engines (tesseract, cuneiform), tweaking them and different image processing techniques, I have put my learning here in the hope that someone can benefit from this.
Let’s begin the journey…
Tesseract 4
Tesseract is an open source OCR engine developed by Google (since 2006). The latest stable version is Tesseract 4 which is LSTM based.
To recognise an image containing a single character, we typically use a Convolutional Neural Network (CNN). Text of arbitrary length is a sequence of characters, and such problems are solved using Recurrent Neural Networks (RNNs) and LSTM is a popular form of RNN. Read more about RNN and LSTM here.
hOCR
hOCR is a format for representing OCR output, including layout information, character confidences, bounding boxes and style information. hOCR is kind of XML file. More on hOCR here.
Input
This is one of the PDF file I am trying to OCR. In the end, I want to extract all the US Patents, US Publications, Foreign references and Non-Patent Literature (NPL) Documents. For foreign references, I want to also extract the country code.
Steps
- Converting the PDF to images: I have used
convert_from_path
function ofpdf2image
library of python. - Image processing to removing thick blocks or lines from the images:
I used theconvert
program which is a member ofImageMagick
suite of tools.ImageMagick
is open-source and comes in-built with Linux. Source
convert input.jpg \
-type Grayscale \
-negate \
-define morphology:compose=darken \
-morphology Thinning 'Rectangle:1x30+0+0<' \
-negate \
converted_image.jpg
Explanation
- convert input.png : load the picture.
- -type Grayscale: make sure ImageMagick knows it’s a grayscale image.
- -negate: invert image color layers. Lines and characters will be white and background black.
- -define morphology:compose=darken: define that areas identified by morphology will be darkened.
- -morphology Thinning ‘Rectangle:1x30+0+0<’ define a 1px by 30px rectangle kernel that will be used for identify the line shapes. Only if this kernel fits inside a white shape (remember we negate colors) this big or bigger, it will be darkened. The < flag allows it to rotate.
- -negate: Invert colors a second time. Now characters will be black again, and background will be white.
- out.png: The output file to be generated.
Here are the images input and converted images:
3. OCR using tesseract and getting the output as hOCR: I have used pytesseract
module of Python.
from pytesseract import pytesseractpytesseract.run_tesseract("converted_image","output_hocr",extension='jpg', lang=None,config="--psm 4 -c tessedit_create_hocr=1")# "converted_image" is the output of convert utility and input to tesseract
# "output_hocr" is the name of the output hOCR file. The output file will have .hocr extension: "output_hocr.hocr"
4. Parsing hOCR file to get all the lines:
hOCR gives us the paragraphs, lines, words text and the coordinates of the their bounding boxes. We will parse our hOCR (which is basically a XML) using Beautiful Soup
and lxml
.
import bs4xml_input = open("output_hocr.hocr","r",encoding="utf-8")
soup = bs4.BeautifulSoup(xml_input,'lxml')ocr_lines = soup.findAll("span", {"class": "ocr_line"})#We will save coordinates of line and the text contained in the line in lines_structure list
lines_structure = []
for line in ocr_lines:
line_text = line.text.replace("\n"," ").strip()
title = line['title']
#The coordinates of the bounding box
x1,y1,x2,y2 = map(int, title[5:title.find(";")].split())
lines_structure.append({"x1":x1,"y1":y1,"x2":x2,"y2":y2,"text": line_text})
5. Identifying different sections using page structure:
To identify the different sections of text in the page, you can leverage the coordinates of the words, lines, paragraphs provided in hOCR file. In my case, I needed to group the text on same line, so the words will be on a same horizontal axis, i.e., their y-coordinates will be equal or very close to each other.
The same goes for the paragraph text. Sometimes, the same paragraph is broken into two paragraphs and system treats them as different entities. The geometry of the page is then analysed to club the lines.
Last but not the least…
When using tesseract, you can try out different OCR engine modes (OEM) page segmentation modes (psm). You can get the list of different OEMs and PSMs from the following commands:
tesseract --help-oem
tesseract --help-psm
In my case, psm 4 worked the best and I left the OEM to default.
For tesseract, you may also need to preprocess your image, apply various image processing techniques for better accuracy. I’ll write a separate article on this.
This was all from my side. Feel free to share your thoughts/feedback on the response section down here. If you find this article helpful and interesting, don’t hesitate to the hit clap button 👏.