The Beginner’s Guide to Natural Language Processing with Python


The Beginner’s Guide to Natural Language Processing with Python
Image by Author | Created on Canva

Learning natural language processing can be a super useful addition to your developer toolkit. From the basics to building LLM-powered applications, you can get up to speed natural language processing—in a few weeks—one small step at a time. And this article will help you get started.

In this article, we’ll learn the basics of natural language processing with Python—taking a code-first approach using NLTK or the Natural Language Toolkit (NLTK). Let’s begin!

▶️ Link to the Google Colab notebook for this tutorial

Installing NLTK

Before diving into NLP tasks, we need to install the Natural Language Toolkit (NLTK). NLTK provides a suite of text processing tools—tokenizers, lemmatizers, POS taggers, and preloaded datasets. It’s more like a Swiss army knife of NLP. Setting it up involves installing the library and downloading the necessary datasets and models.

Install the NLTK Library

Run the following command in your terminal or command prompt to install NLTK:

This installs the core NLTK library, which contains the main modules needed for text processing tasks.

Download NLTK Resources

After installation, download NLTK’s pre-packaged datasets and tools. These include stopword lists, tokenizers, and lexicons like WordNet:

Text Preprocessing

Text preprocessing is an essential step in NLP, transforming raw text into a clean and structured format that makes it easier to analyze. The goal is to zero in on the meaningful components of the text while also breaking down the text into chunks that can be processed.

In this section, we cover three important preprocessing steps: tokenization, stop word removal, and stemming.

Tokenization

Tokenization is one of the common preprocessing tasks. It involves splitting text into smaller units—tokens. These tokens can be words, sentences, or even sub-word units, depending on the task.

  • Sentence tokenization splits the text into sentences
  • Word tokenization splits the text into words and punctuation marks

In the following code, we use NLTK’s sent_tokenize to split the input text into sentences, and word_tokenize to break it down into words. But we also do a super simple prerpocessing step of removing all punctuation from the text:

This allows us to analyze the structure of the text at both the sentence and word levels.

In this example, sent_tokenize(text) splits the input string into sentences, returning a list of sentence strings. The output of this function is a list with two elements: one for each sentence in the original text.

Next, the word_tokenize(text) function is applied to the same text. It breaks down the text into individual words and punctuation, treating things like parentheses and exclamation marks as separate tokens. But we’ve removed all punctuation, so the output is as follows:

Stopwords Removal

Stopwords are common words such as “the,” “and,” or “is” that occur frequently but carry little meaning in most analyses. Removing these words helps focus on the more meaningful words in the text.

In essence, you filter out stop words to reduce noise in the dataset. We can use NLTK’s stopwords corpus to identify and remove stop words from the list of tokens obtained after tokenization:

Here, we load the set of English stop words using stopwords.words(‘english’) from NLTK. Then, we use a list comprehension to iterate over the list of tokens generated by word_tokenize. By checking whether each token (converted to lowercase) is in the set of stop words, we remove common words that don’t contribute to the meaning of the text.

Here’s the filtered result:

Stemming

Stemming is the process of reducing words to their root form by removing affixes like suffixes and prefixes. The root form may not always be a valid word in the dictionary, but it helps in standardizing variations of the same word.

Porter Stemmer is a common stemming algorithm that works by removing suffixes. Let’s use NLTK’s PorterStemmer to stem the filtered word list:

Here, we initialize the PorterStemmer and use it to process each word in the list filtered_words.

The stemmer.stem() function strips common suffixes like “-ing,” “-ed,” and “-ly” from words to reduce them to their root form. While stemming helps reduce the number of variations of words, it’s important to note that the results may not always be valid dictionary words.

Before we proceed, here’s a summary of the text preprocessing steps:

  • Tokenization breaks text into smaller units.
  • Stop word removal filters out common, non-meaningful words to focus on more significant terms in the analysis.
  • Stemming reduces words to their root forms, simplifying variations and helping standardize text for analysis.

With these preprocessing steps completed, you can move on to learn about lemmatization, part-of-speech tagging, and named entity recognition.

Lemmatization

Lemmatization is similar to stemming in that it also reduces words to their base form. But unlike stemming, lemmatization returns valid dictionary words. Lemmatization factors in the context such as its part of speech (POS) to reduce the word to its lemma. For example, the words “running” and “ran” would be reduced to “run.”

Lemmatization generally produces more accurate results than stemming, as it keeps the word in a recognizable form. The most common tool for lemmatization in NLTK is the WordNetLemmatizer, which uses the WordNet lexical database.

  • Lemmatization reduces a word to its lemma by considering its meaning and context—not just by chopping off affixes.
  • WordNetLemmatizer is the NLTK tool commonly used for lemmatization.

In the code snippet below, we use NLTK’s WordNetLemmatizer to lemmatize words from the previously filtered list:

Here, we initialize the WordNetLemmatizer and use its lemmatize() method to process each word in the filtered_words list. We specify pos=’v’ to tell the lemmatizer that we’d like to reduce verbs in the text to their root form. This helps the lemmatizer understand the part of speech and apply the correct lemmatization rule.

So why is lemmatization helpful? Lemmatization is particularly useful when you want to reduce words to their base form but still retain their meaning. It’s a more accurate and context-sensitive method compared to stemming. Which makes  it ideal for tasks that require high accuracy, such as text classification or sentiment analysis.

Part-of-Speech (POS) Tagging

Part-of-speech (POS) tagging involves identifying the grammatical category of each word in a sentence, such as nouns, verbs, adjectives, adverbs, and more. POS tagging also helps understand the syntactic structure of a sentence, enabling better handling of tasks such as text parsing, information extraction, and machine translation.

The POS tags assigned to words can be based on a standard set such as the Penn Treebank POS tags. For example, in the sentence “The dog runs fast,” “dog” would be tagged as a noun (NN), “runs” as a verb (VBZ), and “fast” as an adverb (RB).

  • POS tagging assigns labels to words in a sentence.
  • Tagging helps analyze the syntax of the sentence and understand word functions in context.

With NLTK, you can perform POS tagging using the pos_tag function, which tags each word in a list of tokens with its part of speech. In the following example, we first tokenize the text and then use NLTK’s pos_tag function to assign POS tags to each word.

This should output:

POS tagging is necessary for understanding sentence structure and for tasks that involve syntactic analysis, such as named entity recognition (NER) and machine translation.

Named Entity Recognition (NER)

Named Entity Recognition (NER) is an NLP task used to identify and classify named entities in a text, such as the names of people, organizations, locations, and dates. This technique is essential for understanding and extracting useful information from text.

Here is an example:

In this case, NER helps extract geographical references, such as the landmark and the city.

This can then be used in various tasks, such as summarizing articles, extracting information for knowledge graphs, and more.

Wrap-Up & Next Steps

In this guide, we’ve covered essential concepts in natural language processing using NLTK—from basic text preprocessing to slightly more involved techniques like lemmatization, POS tagging, and named entity recognition.

So where do you go from here? As you continue your NLP journey, here are a few next steps to consider:

  • Work on simple text classification problems using algorithms like logistic regression, support vector machine, and Naive Bayes.
  • Try sentiment analysis with tools like VADER or by training your own classifier.
  • Dive deeper into topic modeling or text summarization tasks.
  • Explore other NLP libraries such as spaCy or Hugging Face’s Transformers for state-of-the-art models.

What would you like to learn next? Let us know! Keep coding!



Source link