Dropbox is a company that hosts files in the cloud, for companies and individuals, like an online hard drive available from anywhere - Google Drive or BOX are direct competitors. The files have to be fast and easy to access.
Read carefully the Dropbox use case below, of the implementation from scratch to deployment of a new machine learning system. You can then answer the questions below to the best of your ability.
Read the assignment and send your answers at the email address firstname.lastname@example.org.
- What is the category of the task accomplished by an OCR? Classification, Regression, Clustering…
- What are the inputs and outputs of an OCR? In other words, what is the task accomplished
- How would you go with training an OCR? In other words, what type of data would you use? How would you enrich it?
- What is the name of the models that have been used? No need to know how they work...
- What were the tradeoffs considered between old, off-the-shelf existing commercial solutions and creating an in-house OCR?
- How did Dropbox gather the data? What are the strategies used?
- Dropbox actually refined the task of the OCR to three subtasks. What are the characteristics of those subtasks? experience (annotated data), performance measure, input / output, model used, category of task (classification, regression, etc)
- Name and describe three problems encountered when putting the system end to end, and in production with the refinements
- The Dropbox team made heavy use of precision and recall to evaluate their system. In terms of cost and user experience, do you think both should be maximised? Or one should be preferred over the other.