• AI Orange Belt - Assignments Correction

    ASSIGNMENT 2 - Dropbox




    What is the category of the task accomplished by an OCR?

    OCR = classification problem (find the associated letter > word > sentence > text)


    What are the inputs and outputs of an OCR? In other words, what is the task accomplished

    Image Document (pixels) => Text


    How would you go with training an OCR? In other words, what type of data would you use? How would you enrich it?

    Data = documents, text databases, synthetic text, transformed documents (distortions)


    What is the name of the models that have been used?

    Word Deep Net (CNN, bidirectional LSTM, CTC Layer)


    What were the tradeoffs considered between old, off-the-shelf existing commercial solutions and creating an in-house OCR?

    Cost of the API calls was too big + the commercial solutions would not be specifically fine-tuned for their documents, hence less performant


    How did Dropbox gather the data? What are the strategies used?

    First, asked a small percentage of users to donate their data. Then generated synthetic data simply by taking the text from project gutenberg words, rendering them on an image with different fonts and applying geometric/photometric transformations.


    How did Dropbox annotate the data?

    Created their homemade annotation platform, named DropTurk, in order to be able to hire annotators and make them sign NDAs.


    Dropbox actually refined the task of the OCR to three subtasks. What are the characteristics of those subtasks? experience (annotated data), performance measure, input / output, the model used, category of the task (classification, regression, etc)

    • Task 1 - read a word (word deep net)
    • Data = word images
    • Performance = recall, precision using SWA (single word accuracy)
    • Model used = Neural Network (CNN LSTM)
    • input / output = image => word on the image
    • category of task = classification


    • Task 2 - word detector
    • Data = documents (image)
    • Performance = number of true words detected / total number of words in the document (recall)
    • Model = Maximally Stable Extremal Regions
    • Input / ouput = documents (image) => images (bounding boxes)
    • Category of task = segmentation


    • Task 3 - Orientation detector
    • Data = text
    • Performance = accuracy
    • Model = software rule based (no AI)
    • Input / output = text, image => word bounding box
    • Category of task = N/A


    • Task 4 - Wordinator
    • Data = Multi-word window
    • Performance = accuracy
    • Input / output = multi-word bounding box, image => word bounding box
    • Category of task = segmentation


    Name and describe three problems encountered when putting the system end to end, and in production with the refinements

    Bad performance from the word detector, would wrongfully stick two words as one or split words

    Orientation detection, in order to manage the rotated scans and put them straight

    PDF formatting issues


    The Dropbox team made heavy use of precision and recall to evaluate their system. In terms of cost and user experience, do you think both should be maximized? Or one should be preferred over the other.

    For the word detector, the choice made by Dropbox to optimise cost (in terms of training time) is to maximise recall and not care about the low precision of the word detector, but instead optimize the next task (read one word) in order to be able to handle the false positives (empty images with no words). All in all, not seeing a word is the problem, better to not risk any miss.


    For the word deep net, they tried to strike a balance between precision and recall and just beat the baseline (the commercial state of the art solution at their disposal).