How to easily get Stop Words for your Language Learning Model

Photo by Drew Beamer on Unsplash

How to easily get Stop Words for your Language Learning Model

Streamlining insights, empowering language models: Unlocking meaning by excluding stop words from textual analysis.

It is almost impossible to read the news without ChatGPT or similar LLMs being mentioned at least a couple of times. LLM stands for large language model or language learning model and is a group of AI models that work gigantic datasets made up of text primarily.

Let us ask the most famous of all LLMs to see how she describes a language learning model.

ChatGPT: What is a language learning model?

LLMs are typically based on natural language processing (NLP) and machine learning techniques. They are trained on large amounts of language data to develop an understanding of grammar, vocabulary, syntax, and other linguistic elements. These models can then generate responses, provide explanations, or offer practice exercises to aid learners in their language acquisition journey.

LLMs can be used in various ways to support language learning. They can serve as virtual tutors, providing personalized feedback and guidance to learners. They can simulate conversations, allowing learners to practice speaking and listening skills.

Creating your own LLM requires deep knowledge of machine learning, statistics and possibly also linguistics and is simply not something you do overnight. But what you could do is to build a simple word cloud to visualize the word density in a text.

What are stop words and why are they important?

Stop words, in the context of word clouds, refer to common words that are often excluded from the analysis. These are words that appear frequently in a given language but typically do not carry significant meaning or contribute to the overall context. Examples of stop words in English include "the," "and," "of," "is," "in," and "to." By removing stop words from the text before generating a word cloud, the focus can be placed on more meaningful and contextually important words.

The exclusion of stop words allows the word cloud to highlight the keywords and concepts that are more distinctive and informative in the analyzed text. However, the selection of stop words may vary depending on the specific analysis and the desired outcome. Some applications may choose to include certain stop words that are considered important in their specific context.

One way to find stop words is to use the free Stop word API. The API allows you to fetch the most common stop words for different languages and categories.

Using the Stop Word API

The API is really simple to use and since it is free to use it requires no complicated authentication. The documentation can be found here but let me show you a few examples on how get the stop words into a Python list.

import urllib.request, json 

url = "https://stopwordapi.com/api/v1/stopwords?langs=en&categories=4"

# Fetch data and save in dict data 
with urllib.request.urlopen(url) as u:
    words = json.loads(u.read().decode())

Line 1: import the python request module to perform HTTP request.Line 2: defines the url to call in a variable called url
Line 3: opens a connection to the url
Line 4: reads the response into the JSON module that parses it into the words variable as a list.

What you can do now with the list is to simply search and replace each word in your text. Using the API let you focus on extracting knowledge from the text instead of copy pasting words from the internet into an array in your programming language.

The response you get from the API looks something like this.

[
    "a",
    "a's",
    "able",
    "about",
    "above",
    "according",
    ...
]

I hope you found this useful and that you the next time when you have to clean up your text uses the Stop Word API.