Contact Us

NLP Annotation represented by colorful mesh illustration

Designing Your First NLP Annotation Job

Beth Bell
Principal Machine Learning Engineer

Machine learning models are only as good as the data on which they are trained. For many NLP problems, preparing data for model training requires setting up an annotation project in which to create labels for the data. Mistakes in the labeling step will incur costs in the more expensive modeling step, so getting the labels right the first time will save you time and hassle. We at KUNGFU.AI have put together the following step-by-step practices to ensure that once you have the relevant data for your business problem, your annotation project goes smoothly and produces high quality results on which you can reliably train machine learning models.

Pick labels that describe the actual content of your data. You might know ahead of time what labels your content suggests. If you want to train a model to make restaurant recommendations based on customer reviews, you might want to extract the restaurant name, type of cuisine, and location, all of which can help customers decide where to eat.

If you want to train an intent model to identify customer issues in e-commerce, you might know the most common and easily solvable problems ahead of time.

If you don’t know what the common issues are then perform textual analysis to identify them before deciding on labels.

Determine what level of granularity you need from your labels. Are sentence-level labels good enough, or do you need token-level labels? The type of task that you want to perform will determine your granularity requirements. If you want to train a model to de-identify text data then you should label identifying information at the token level.

If you want to train a sentiment model then you might only need sentence or phrase-level annotation.

There is usually a tradeoff between time to annotate and the minimum level of granularity needed to train models that satisfy the business problem.

Make sure that your labels are not overly specific or general. Say you want to train an intent model for a travel agent AI. Extreme numbers of intent labels can increase cognitive load for annotators and result in poor annotation quality. However, your labels should be mutually exclusive and specific enough to guide the conversation to the appropriate action.

Do some annotation yourself with real data in your chosen annotation tool. You might find that your labels or your data do not quite suit the task and that you need to modify the annotation job in some way, like chunk text differently or add labels, as in this e-commerce intent example:

This step will also give you an idea of how long the annotation takes so that you can estimate the timeline and number of annotators required.

Write clear documentation and be as succinct as possible. Make sure to include definitions of and examples for all of your labels, especially edge cases, with screenshots from your annotation tool. Update this documentation frequently with feedback from annotators, and if possible, watch someone label a few documents using only your instructions to determine what is missing.

Use internal annotators rather than outsourcing to a crowd. This is a must when a project requires any amount of subject matter expertise. Internal annotators can ask clarifying questions, you can provide corrective feedback based on your review of the labels, and they can provide direct feedback on or even updates to the annotation instructions. Consider incentives for internal annotators to prevent hurried annotations that yield poor quality.

Have multiple annotators label the same document and measure agreement. If annotation accuracy is a challenge, rely on records for which there is inter-annotator agreement. Inter-annotator agreement is a good proxy for the quality of your labels. Cohen’s Kappa is one but not the only measure of agreement. Be careful to review and not just discard documents that have label disagreement; they might indicate a section of poor documentation or a tricky class in your dataset.

Review labels early and often, especially when using external annotators. Require a short evaluation period prior to launching into the full annotation job or introduce gold-standard questions if your annotation tool allows. You can expect some mistakes due to human error, but guard against systematic errors that will bias your model.  If you are unsure of where or if any problems exist, start by reviewing a random sample of the labels. If your model is performing poorly on a particular class, review model errors with labels for that class. If a particular class has been tricky for annotators in the past, continue to target that class for review.

These guidelines represent a list of lessons learned throughout our own annotation projects and are by no means exhaustive. Hopefully they prove useful as you begin your first NLP annotation project. We have included additional resources below should you require a deeper dive. We strive to improve our annotation approaches, and welcome feedback on the suggested guidelines outlined here.


Many thanks to Reed Coke for his valuable feedback on this post.


By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.
X Icon