Description
This article describes the steps and levers you have in order to improve the results or accuracy or your model. But first let's understand why you might need these improvements
Some examples of improvements required
- A phone number is labelled as 'Invoice_ID'
- The text 'Date:' is getting captured along with the data '12.02.2020'
- There is no entry found for 'Client_email' even if an email is present on the file.
- The model is picking up 'Invoice ID' correctly for Vendor A correctly on 80 of 100 files but labelling the phone number as Invoice ID on 20 files for the same vendor.
- The model is picking up a 'Due date' as 'Invoice date'
- The tables sometimes have extra rows or fields
- The columns extracted from the table have the wrong label
All of these examples boil down to 2 scenarios
-
The model incorrectly predicted a piece of text
-
The model missed predicting a piece of text
This is great! We've simplified our problem into 2 scenarios, both of which can be solved using the same steps. Next time you have an issue with your results think of it in terms of these 2 statements.
How to improve the results and accuracy?
In almost all of these cases, the results can be improved by a simple retraining of your model. Let's get straight to how you can do it.
One thing to note here is that you will only know if you need these improvements by uploading documents to the model and seeing the results. The more documents you upload, the better your understanding of what improvements are required. So if you haven't already please go to the extract data section of your model and upload some files :)
Once you've done this, here are the 3 steps to get better results
Step 1: Verify files
- Go to the extract data section of your model -> https://app.nanonets.com/#/ocr/test/<MY_MODEL_ID>
- Please inspect these files for the above 2 errors that we've discussed and perform the following changes
-
The model incorrectly predicted a piece of text
- Delete the incorrect box and draw a new box around the right piece of text
OR - Change the label of the text that the model predicted incorrectly to the correct label
- Delete the incorrect box and draw a new box around the right piece of text
-
The model missed predicting a piece of text
- Draw a box around the piece of text that the model missed predicting
-
- Once everything on the file seems correct, you can hit the Approve file button in order to mark the results as correct/reviewed/verified by a human.
Once the files are Approved, they will show up on the Approved tab as well as on the list view under Review status as 'Approved'.
We recommend verifying/approving at least 10 files before retraining your model, but it can be retrained with even a single file
Step 2: Click on "Improve Model" CTA
Step 3: Give confirmation of re-training
This will trigger a model re-train - i.e. it will learn from the input data you've provided and not make the same mistakes in the future.
FAQs
Q: How long does retraining take?
A: The time depends on how many images are in your dataset and the tier of Plan you are subscribed to. Typically most model should be trained within 45 minutes - 2 hour.
Q: Is retraining a continuous process or a one time thing?
A: This is a great question. We always recommend retraining the model whenever you need improvements.
Q: What if my results don't improve after retraining?
A: We share your frustration here, but we can almost guarantee that the results will improve. In the exceptional case that it doesn't, please reach out to support@nanonets.com and we'll resolve this for you
Q: Okay, I've understood the basic principles of retraining now, can someone help me with marking the files as approved or verified?
A: Yes, if you're on the Enterprise Plan, our annotation team can help you verify files. Drop an email to support@nanonets.com
Q: Can I query the files that are marked as verified from an API
A: Yes! You can do this from this API -> https://nanonets.com/documentation/#operation/OCRModelListPredictionFiles