Drishti : Goggles for Indian Languages


Drishti a realtime multi-language snapshot translation, speech and search mobile application architecture which uses best possible components to achieve efficient text extraction and processing of image captured , keeping in mind the storage and computational limitations of mobile.

The work involves study of existing desktop and mobile OCR application in Indian and foreign languages followed by discussion and understanding of the components of Drishti – 1)OCR 2)Text correction 3)Text Translator 4)Speech synthesizer.

Thorough analysis of various free OCR engines namely Tesseract, OCRopus, Ocrad is done using UNLV ISRI performance toolkit as well as using different images to find that Tesseract Engine  to have the maximum accuracy. Leptonica image processing library is used to improve and smoothen the input image and make it Tesseract Recognisable. Since string returned from OCR showed single letter to be mis-recognized, dictionary of valid words is used to identify nearest neighbor using levenshtein distance Algorithm. Google was found to be best Text-translator for foreign as well as Indian language on its comparison with text translators like Babelfish , Bing ,dictionary.com , WordLingo ,  Free translator etc. Android TTS is used as speech synthesizer for English, Spanish, German, French but for regional languages http post is sent to Google servers as android TTS is still limited to few languages. Proposition and implementation of architecture is followed by testing and computation of efficiency. Various use cases for the application, performance factors, future improvements have been discussed.


Screen Shot 2014-01-22 at 4.29.22 AM

Fig 5.1 Architecture of Drishti Application

The above diagram shows the architecture of drishti and its working. The image is captured using high resolution camera of smartphone device. Care is taken that the image is brightly lit and is not at angle to make it – skewed or perspective image. It has been found that brightness and color do not hamper OCR results as much as geometry. We take the following assumptions while clicking pictures –

1) The applications will only recognize a few commonly used italic font types

2) Text should be sufficiently large and thick

3) Text should be evenly illuminated

4) Text should be roughly horizontal. A significant skew angle is undesirable.

After taking picture, the image is preprocessed to enhance OCR results by doing YUV enhancement to remove the extra white spaces present .Next step is to convert the image into tesseract detectable format which is done by leptonica image processing library . Now the training data of the required language is downloaded. The OCR engine uses the training data to go through the process of analysis of text .First characters are determined which are joined together to form a word and hence a sentence . page analysis is done using tab-spot detection algorithm . The extracted text is post processed to improve accuracy .This process is something similar to spell checker .The corrected text is sent to the google servers for translation.The translated text obtained is spoken out loud for the aid of tourists and people whose eye sight is low . The extracted or translated text can be used for targeted search. Various detailed and very useful use cases are discussed in the next sections.


The user interface of the application is user-friendly. The following section shows the interface and functions of various components visible in the interface.

As the application starts the user views the following interface:

Screen Shot 2014-01-25 at 6.27.04 AM

Figure 5.2 :First Screen /capture activity

The Square rectangle is adjustable focus box , which is focused on text to be extracted so that only useful text and no junk is taken in the picture .

The camera button is clicked when we are sure of the text has been focused properly nad we want to perform OCR process.

Pencil button is pressed if user does not want to enter data through OCR scan but to it manually for translation and speech .

When OCR is successful , the extracted OCR text is sent as an intent to the translation-speech activity as shown in figure below :

Screen Shot 2014-01-25 at 6.27.19 AM

Figure 5.3 translation activity

The first activity does OCR on image to obtain word hello in English which is translated to hindi as “Namaste “ and spoken out in hindi dialect . Another example of English text OCR to hindi language translation is shown in the second figure .

The radio buttons are for selecting that the search will be targeted on original text or translated text .

The 12 buttons shown are methodology for targeted search of the text extracted through snapshot . They are :


Text is converted into speech using speech synthesis engine provided by android architecture for languages like german , Spanish , English .For regional language there is no support of android TTS thus we use google translator API to provide the speech synthesis of text sent .The results from the operation were satisfactory .

2)Word Meaning

For the given word input , word meaning is found on dictionary.com for foreign languages .For regional shabdkosh.com is used to provide meaning of hindi words. The accuracy of shanbdkosh was found to be pretty good.


Consider that you are reading news and want to read more articles related to that news, all you have to do is to click picture of the news so that OCR is performed on it to extract heading/keyword and click news button for extensive news articles till date.

4)Google Maps

User wants to google a location and has address printed , all he has to do is perform an OCR of the text and hit google maps button to get the direction to the required location


General Web Search of any kind of printed text is done by clicking this button.

6)Google Images

Images are smarter , efficient and faster way of finding things on net .This allows the user to search images related to to the text they extracted.

7)Flicker Images

Searches through the flicker data to provide the user with list of picture with tag of the text extracted .


Have a name and want to know more detail of the concerned person. Just click picture of the name and press the button to know about the linked in professional  profile of anyone user desires .


This is share intent and uploads the contents captured and recognized  on the cloud seamlessly for later use .


If you have a long printed text which needs to be sent via mail or message , rather than attaching image OCR can be done is this is highly useful feature for students exchanging notes  .

11)Search videos

Search youtube videos related to  the extracted text from OCR.

12)Search books

If you are in a bookstore and want to search more about a particular book.all one has to do is take picture of title and do targeted search by clicking books icon.

The third screen is the preference window screen which is opened by clicking the menu button from first screen or second screen .It keeps track of the user choices , the language to be recognized and translated to . There is list of black listed and white listed .black listed characters will be removed from the OCR during post processing .

Screen Shot 2014-01-25 at 6.27.40 AM

 Figure 5.4: preference activity


Screen Shot 2014-01-25 at 6.27.52 AM

Figure 5.5 :list of languages

There are various languages which can be recognized by tesseract engine. Drishti is capable of recognizing 36 languages including hindi . the tesseract can be extended to other regional languages like gujarati bengali etc

USE CASE Scenarios 

Examples of Use Case /Scenarios where Drishti can be of use have been discussed below where the application may be used :

1)Book Search
This is example of Books targeted search .The title of the book is captured , drishti runs OCR engine on it to recognize the title in text , this sends a book search intent to get the result as shown in figure .

Screen Shot 2014-01-25 at 6.28.25 AM

2)Image Search

The Hindi Ocr engine is selected for the given image captured to obtain saree in hindi which is converted into English using translator . image search intent results in the following output

Screen Shot 2014-01-25 at 6.28.54 AM


The below figure is demonstration of English text to Tamil language . Tamil text is spoken as well in tamil dialect so that user gets to know the pronunciation of words as well .


4)People search 

Screen Shot 2014-01-25 at 6.30.16 AM

In this the extracted name from journal or business card is passed as an intent to LinkedIn serach task to get the required list of suggested names .

5)Video  search – The name of artist “AR Rahman” is OCR ed here and you tube video search is done

Screen Shot 2014-01-25 at 6.30.36 AM

5)Word Meaning – the word to searched is put in box and passed to OCR engine for text extraction  . the extracted text is sent to dictionary for meaning .

Screen Shot 2014-01-25 at 6.30.44 AM

6)Location Search  – location search from given address is one of the most interesting search as well as useful . GPS can be used to improve search result .    

Screen Shot 2014-01-25 at 6.30.57 AM