8 min to read
Stemming? Lemmatization? What?
Taking a high-level dive into what stemming and lemmatization do for natural language processing tasks and how they do it.
Stemming and Lemmatization
In natural language processing, there may come a time when you want your program to recognize that the words “ask” and “asked” are just different tenses of the1 same verb. This is the idea of reducing different forms of a word to a core root. Words that are derived from one another can be mapped to a central word or symbol, especially if they have the same core meaning.
Maybe this is in an information retrieval setting and you want to boost your algorithm’s recall. Or perhaps you are trying to analyze word usage in a corpus and wish to condense related words so that you don’t have as much variability. Either way, this technique of text normalization may be useful to you.
This is where something like stemming or lemmatization comes in, something that you may have heard of before! But what’s the difference between the two? And what do they actually do? These are two questions that we are going to explore today!
So What Are They?
At their core, both of these techniques tackle the same idea: Reduce a word to its root or base unit. It’s often a data pre-processing step and is something good to be familiar with. Though they both wish to solve this same idea, they go about it completely different ways. Let’s take a look!
Stemming is definitely the simpler of the two approaches. With stemming, words are reduced to their word stems. A word stem need not be the same root as a dictionary-based morphological root, it just is an equal to or smaller form of the word.
Stemming algorithms are typically rule-based. You can view them as heuristic process that sort-of lops off the ends of words. A word is looked at and run through a series of conditionals that determine how to cut it down.
For example, we may have a suffix rule that, based on a list of known suffixes, cuts them off. In the English language, we have suffixes like “-ed” and “-ing” which may be useful to cut off in order to map the words “cook,” “cooking,” and “cooked” all to the same stem of “cook.”
Overstemming and Understemming
However, because stemming is usually based on heuristics, it is far from perfect. In fact, it commonly suffers from two issues in particular: overstemming and understemming.
Overstemming comes from when too much of a word is cut off. This can result in nonsensical stems, where all the meaning of the word is lost or muddled. Or it can result in words being resolved to the same stems, even though they probably should not be.
Take the four words university, universal, universities, and universe. A stemming algorithm that resolves these four words to the stem “univers” has overstemmed. While it might be nice to have universal and universe stemmed together and university and universities stemmed together, all four do not fit. A better resolution might have the first two resolve to “univers” and the latter two resolve to “universi.” But enforcing rules that make that so might result in more issues arising.
Understemming is the opposite issue. It comes from when we have several words that actually are forms of one another. It would be nice for them to all resolve to the same stem, but unfortunately, they do not.
This can be seen if we have a stemming algorithm that stems the words data and datum to “dat” and “datu.” And you might be thinking, well, just resolve these both to “dat.” However, then what do we do with date? And is there a good general rule? Or are we just enforcing a very specific rule for a very specific example?
Those questions quickly become issues when it comes to stemming. Enforcing new rules and heuristics can quickly get out of hand. Solving one or two over/under stemming issues can result in two more popping up! Making a good stemming algorithm is hard work.
Speaking of which…
Stemming Algorithm Examples
Two stemming algorithms I immediately came in contact with when I first started using stemming were the Porter stemmer and the Snowball stemmer from NLTK. While I won’t go into a lot of details about either, I will highlight a little bit about them so that you can know even more than I did when I first started using them.
- Porter stemmer: This stemming algorithm is an older one. It’s from the 1980s and its main concern is removing the common endings to words so that they can be resolved to a common form. It’s not too complex and development on it is frozen. Typically, it’s a nice starting basic stemmer, but it’s not really advised to use it for any production/complex application. Instead, it has its place in research as a nice, basic stemming algorithm that can guarantee reproducibility. It also is a very gentle stemming algorithm when compared to others.
- Snowball stemmer: This algorithm is also known as the Porter2 stemming algorithm. It is almost universally accepted as better than the Porter stemmer, even being acknowledged as such by the individual who created the Porter stemmer. That being said, it is also more aggressive than the Porter stemmer. A lot of the things added to the Snowball stemmer were because of issues noticed with the Porter stemmer. There is about a 5% difference in the way that Snowball stems versus Porter.
- Lancaster stemmer: Just for fun, the Lancaster stemming algorithm is another algorithm that you can use. This one is the most aggressive stemming algorithm of the bunch. However, if you use the stemmer in NLTK, you can add your own custom rules to this algorithm very easily. It’s a good choice for that. One complaint around this stemming algorithm though is that it sometimes is overly aggressive and can really transform words into strange stems. Just make sure it does what you want it to before you go with this option!
We’ve talked about stemming, but what about the other side of things? How is lemmatization different? Well, if we think of stemming as just take a best guess of where to snip a word based on how it looks, lemmatization is a more calculated process. It involves resolving words to their dictionary form. In fact, a lemma of a word is its dictionary or canonical form!
Because lemmatization is more nuanced in this respect, it requires a little more to actually make work. For lemmatization to resolve a word to its lemma, it needs to know its part of speech. That requires extra computational linguistics power such as a part of speech tagger. This allows it to do better resolutions (like resolving is and are to “be”).
Another thing to note about lemmatization is that it’s often times harder to create a lemmatizer in a new language than it is a stemming algorithm. Because lemmatizers require a lot more knowledge about the structure of a language, it’s a much more intensive process than just trying to set up a heuristic stemming algorithm.
Luckily, if you’re working in English, you can quickly use lemmatization through NLTK just like you do with stemming. To get the best results, however, you’ll have to feed the part of speech tags to the lemmatizer, otherwise it might not reduce all the words to the lemmas you desire. More than that, it’s based off of the WordNet database (which is kind of like a web of synonyms or a thesaurus) so if there isn’t a good link there, then you won’t get the right lemma anyways.
One more thing before I wrap up here: If you choose to use either lemmatization or stemming in your NLP application, always be sure to test performance with that addition. In many applications, you may find that either ends up messing with performance in a bad way just as often as it helps boost performance. Both of these techniques are really designed with recall in mind, but precision tends to suffer as a result. But if recall is what you’re aiming for (like with a search engine) then maybe that’s alright!
Also, this blog post mostly centers around the English language. Other languages, even if they seem somewhat related, have drastically different results with stemming and lemmatization. The general concepts remain the same, but the specific implementations will be drastically different. Hopefully this blog at least helps with the high-level if you’re planning on working with a different language entirely!
If you enjoyed this post and are hungry for more NLP readings, why not check out another blog post I wrote about how word embeddings work and the different types you can encounter. Or, if you like sentences more, why not check out my summary of a paper that analyzed how different sentence embeddings affect downstream and linguistic tasks!