AI Orange Belt - Assignments Correction
ASSIGNMENT 2 - Dropbox
What is the category of the task accomplished by an OCR?
OCR = classification problem (find the associated letter > word > sentence > text)
What are the inputs and outputs of an OCR? In other words, what is the task accomplished
Image Document (pixels) => Text
How would you go with training an OCR? In other words, what type of data would you use? How would you enrich it?
Data = documents, text databases, synthetic text, transformed documents (distortions)
What is the name of the models that have been used?
Word Deep Net (CNN, bidirectional LSTM, CTC Layer)
What were the tradeoffs considered between old, off-the-shelf existing commercial solutions and creating an in-house OCR?
Cost of the API calls was too big + the commercial solutions would not be specifically fine-tuned for their documents, hence less performant
How did Dropbox gather the data? What are the strategies used?
First, asked a small percentage of users to donate their data. Then generated synthetic data simply by taking the text from project gutenberg words, rendering them on an image with different fonts and applying geometric/photometric transformations.
How did Dropbox annotate the data?
Created their homemade annotation platform, named DropTurk, in order to be able to hire annotators and make them sign NDAs.
Dropbox actually refined the task of the OCR to three subtasks. What are the characteristics of those subtasks? experience (annotated data), performance measure, input / output, the model used, category of the task (classification, regression, etc)
Name and describe three problems encountered when putting the system end to end, and in production with the refinements
Bad performance from the word detector, would wrongfully stick two words as one or split words
Orientation detection, in order to manage the rotated scans and put them straight
PDF formatting issues
The Dropbox team made heavy use of precision and recall to evaluate their system. In terms of cost and user experience, do you think both should be maximized? Or one should be preferred over the other.
For the word detector, the choice made by Dropbox to optimise cost (in terms of training time) is to maximise recall and not care about the low precision of the word detector, but instead optimize the next task (read one word) in order to be able to handle the false positives (empty images with no words). All in all, not seeing a word is the problem, better to not risk any miss.
For the word deep net, they tried to strike a balance between precision and recall and just beat the baseline (the commercial state of the art solution at their disposal).