How to R Count Number of Words in a String?

Key points

  • A word count is the number of words in a text or a collection of texts. It can be useful for many purposes, such as measuring the length and complexity of a text by comparing texts from different sources.
  • In R, there are many ways to count words in a dataset, depending on the format and structure of your data. In this article, we focused on counting words in a character string, a sequence of letters, numbers, symbols, or spaces enclosed by quotation marks. A character string can represent a word, a sentence, a paragraph, or a document.
  • There are two main steps to count words in a character string using R: split the character string into individual words based on criteria, such as spaces, punctuation marks, or special symbols, and count the number of elements in the resulting vector or list.
  • Different functions and packages in R can help you perform these steps. This article introduced three methods: the base R function strsplit(), the stringr package from the tidyverse, and the strings package.
  • We also showed some examples of how to count words in different types of datasets using R, such as counting words in a column of a data frame, counting words in a row of a data frame, or counting words in the entire dataset.
Package Function Description
base R strsplit() Splits a character string into substrings based on a specified pattern
base R length() Returns the length of an object
base R sum() Returns the sum of all the elements in an object
base R paste() Concatenates strings with an optional separator
stringr str_count() Counts the number of matches of a pattern in a string
dplyr mutate() Adds new variables or modifies existing variables in a data frame
stringi stri_count_words() Counts the number of words in a string based on Unicode rules

How to Count the Number of Words in a String Using R

Introduction

I’m Zubair Goraya, a Ph.D. Scholar, Certified data analyst, and Freelancer, I love sharing my knowledge and experience with R programming. 

In this article, I will show you how to count the number of words in a dataset using R. This is a common task in text analysis and natural language processing, and it can help you explore and summarize your data. 

You will learn how to use different functions and packages in R to count words in a column, a row, or the entire dataset. You will also see some examples and resources to help you practice and learn more.

What is a word count, and why is it useful?

A word count is the number of words in a text or a collection of texts. It can be calculated at different levels, such as characters, words, sentences, paragraphs, or documents.

 A word count can be useful for many purposes, such as:

  • Measuring the length and complexity of a text
  • Comparing texts from different sources or genres
  • Analyzing the frequency and distribution of words or topics
  • Evaluating the readability and quality of a text
  • Estimating the time and effort required to read or write a text

In R, there are many ways to count words in a dataset, depending on the format and structure of your data. I will focus on counting words in a character string, a sequence of letters, numbers, symbols, or spaces enclosed by quotation marks. A character string can represent a word, a sentence, a paragraph, or a document.

How to count words in a character string using R?

There are two main steps to counting words in a character string using R:

  • Split the character string into words based on criteria such as spaces, punctuation marks, or special symbols.
  • Count the number of elements in the resulting vector or list.

Different functions and packages in R can help you perform these steps.

This article will cover these three methods:

  • Using the base R function strsplit()
  • Using the stringr package from the tidyverse
  • Using the stringi package

Method 1: Using the base R function strsplit()

The base R function strsplit() splits a character string into substrings based on a specified pattern. The pattern can be a single character, such as a space " ", or a regular expression, such as "\\W+", which matches any non-word character. The output of strsplit() is a list of vectors, each containing the substrings for each input element.

For example, suppose we have a character vector called text that contains three sentences:

text <- c("I like cheese.", "I don't want to be here.", "I am alone.")

Split Sentence into Words

We can use strsplit() to split each sentence into words based on spaces:

words <- strsplit(text, " ")
words
strsplit() function to split each sentence into words based on spaces

The output is a list of three vectors containing each sentence's words. Note that the punctuation marks are still attached to some words.

Split Sentence based on Regular Expression

To remove them, we can use a regular expression that matches any non-word character:

words <- strsplit(text, "\\W+")
words
split sentences based on a regular expression that matches any non-word character

Now, the output is cleaner and more consistent.

Count the Number of words in each Sentence

To count the number of words in each sentence, we can use the function length() to get the length of each vector:

word_count <- sapply(words, length)
word_count
length() function to count the number of words in each sentence

The output is a numeric vector containing each sentence's word count. 

Total word count

To get the total word count for the entire text, we can use the function sum() to add up all the elements:

total_word_count <- sum(word_count)
total_word_count
sum() function to find the total word count for the entire text

The output is a single number representing the entire text's total word count.

Method 2: Using the stringr package from the tidyverse

The stringr package is part of the tidyverse, a collection of packages that provide a consistent and easy-to-use set of tools for data manipulation and analysis in R. 

The stringr package provides a set of functions wrappers around the stringi package, a fast and comprehensive package for string processing.

The stringr functions have consistent names, arguments, and outputs and similarly deal with missing values and zero-length vectors.

Install and load the stringr package

To use the stringr package, you need to install it first:

install.packages("stringr")
Then you need to load it into your R session:
library(stringr)

Count words in a Character string

To count words in a character string using the stringr package, you can use the function str_count(), which counts the number of matches of a pattern in a string. 

The pattern can be a single character, a regular expression, or a fixed string. The output of str_count() is a numeric vector with the same length as the input.

For example, suppose we have the same character vector text as before:

text <- c("I like cheese.", "I don't want to be here.", "I am alone.")
We can use str_count() to count the number of words in each sentence based on spaces:
word_count <- str_count(text, " ")
word_count
str_count() to count the number of words in each sentence

The output is a numeric vector that contains the number of spaces for each sentence. To get the number of words, we need to add one to each element:
word_count <- word_count + 1
word_count
str_count() to count the number of words in each sentence

Alternatively, we can use a regular expression that matches any word character, such as "\\w":

word_count <- str_count(text, "\\w+")
word_count
use a regular expression that matches any word character, such as "\\w"

The output is the same as before. To get the total word count for the entire text, we can use the function sum() as before:
total_word_count <- sum(word_count)
total_word_count
total word count for the entire text, we can use the function sum()

The output differs slightly from the previous method because we did not count the apostrophe in “don’t” as a word character.

Method 3: Using the stringi package

The stringi package is a fast and comprehensive package for string processing in R. It provides many functions compatible with Unicode, a standard for encoding and representing text in different languages and scripts. 

The stringi package also supports many features, such as locale-sensitive operations, case conversions, transliterations, collation order, and regular expressions.

Install and Load the stringi package 

To use the stringi package, you need to install it first:

install.packages("stringi")

Then you need to load it into your R session:

library(stringi)

Count words in a Character String

To count words in a character string using the stringi package, you can use the function stri_count_words(), which counts the number of words in a string based on Unicode rules. 

The output of stri_count_words() is a numeric vector with the same length as the input.

For example, suppose we have the same character vector text as before:

text <- c("I like cheese.", "I don't want to be here.", "I am alone.")
We can use stri_count_words() to count the number of words in each sentence:
word_count <- stri_count_words(text)
word_count

stri_count_words() to count the number of words in each sentence

The output is similar to the previous method. To get the total word count for the entire text, we can use the function sum() as before:

total_word_count <- sum(word_count)
total_word_count
total word count for the entire text, we can use the function sum()

The output is also similar to the previous method.

Examples and resources

In this section, I will show you examples of how to count words in different datasets using R. I will also provide some resources to learn more about text analysis and natural language processing using R.

Example 1: Counting words in a column of a data frame

Suppose we have a data frame called df that contains two columns: id and content. The column content contains some sentences that we want to count words for. Here is what the data frame looks like:(

df <- data.frame(
  id = c(2356, 3456),
  content = c("I like cheese.\n Positive\nI don't want to be here.\n Negative\n", 
              "I am alone.\n Neutral\n")
)
df
data frame called df that contains two columns: id and content

We can use the methods described above to count the number of words in the column content of the data frame df. 

For example, using the stringr package, we can do the following:

library(stringr)
word_count <- str_count(df$content, "\\w+")
word_count

count the number of words in the column content of the data frame df

The output is a numeric vector that contains the word count for each element of the column content. To add this vector as a new column to the data frame, we can use the function mutate() from the dplyr package, which is also part of the tidyverse:

library(dplyr)
df <- mutate(df, word_count = word_count)
df

we can use the function mutate() from the dplyr package

The output is a modified data frame that contains a new column called word_count. We can see that the first element has ten words, and the second element has four words.

Example 2: Counting words in a row of a data frame

Suppose we have a different data frame called df2 that contains three columns: id, title, and body. The column's title and body contain some texts for which we want to count words. Here is what the data frame looks like:

df2 <- data.frame(
  id = c(1234, 5678),
  title = c("How to count words in R", "Why you should learn R"),
  body = c("In this article, I will show you how to count words in a dataset using R. This is a common task in text analysis and natural language processing, and it can help you explore and summarize your data. You will learn how to use different functions and packages in R to count words in a column, a row, or the entire dataset. You will also see some examples and resources to help you practice and learn more.", 
           "R is a powerful and versatile data analysis and visualization programming language. It has many features and advantages make it a popular choice among data scientists, statisticians, researchers, and educators. In this article, I will explain why you should learn R and how it can benefit you in your career and personal projects.")
)
df2

data frame called df2 that contains three columns: id, title, and body

To count the number of words in each row of the data frame df2, we need to combine the texts from columns, title, and body into one character string for each row. 

One way to do this is to use the function paste() from base R, which concatenates strings with an optional separator. For example, we can use a space " " as the separator:

text <- paste(df2$title, df2$body, sep = " ")
text

function paste() from base R, which concatenates strings with an optional separator

The output is a character vector containing each row's concatenated texts. Then, we can use the methods described above to count words in this vector. For example, using the stringi package, we can do the following:

library(stringi)
word_count <- stri_count_words(text)
word_count
word count using the stringi package

The output is a numeric vector containing each row's word count. To add this vector as a new column to the data frame, we can use the function mutate() as before:

df2 <- mutate(df2, word_count = word_count)
df2

add vector as a new column to the data frame, we can use the function mutate()

The output is a modified data frame that contains a new column called word_count. We can see that the first row has 46 words, and the second row has 37.

Example 3: Counting words in the entire dataset

We want to count the number of words in the entire dataset, regardless of the columns or rows. One way to do this is to convert the data frame into a character vector and then use any of the methods described above to count words in the vector. 

For example, using the stringr package, we can do the following:

library(stringr)
text <- as.character(df2)
word_count <- str_count(text, "\\w+")
total_word_count <- sum(word_count)
total_word_count

count the number of words in the entire dataset
The output is a single number representing the entire dataset's word count. We can see that there are 87 words in the data frame df2.

Conclusion

In this article, I have shown you how to count words in a dataset using R. You have learned how to use different functions and packages in R to count words in a character string, a column, a row, or the entire dataset. You have also seen some examples and resources to help you practice and learn more.

I hope you have found this article helpful and informative. If you have any questions or feedback, please comment below or contact me at info@data03.online. If you need help with your data analysis projects, Get a Quote.

Frequently Asked Questions (FAQs)

How do I install and load the packages used in this article?

You can use the function install.packages() with the package's name as an argument to install a package. For example, to install the stringr package, you can run install.packages("stringr"). To load a package into your R session, you can use the function library() with the package's name as an argument. For example, to load the stringr package, you can run library(stringr).

How do I choose which method to use to count words in R?

This question has no definitive answer, as different methods may have different advantages and disadvantages depending on your data and goals. Some factors that you may consider are:

  • The speed and performance of the functions
  • The consistency and compatibility of the functions with other packages or tools
  • The flexibility and customization of the functions for different scenarios or languages
  • The readability and simplicity of the code

Try different methods and compare their results and outputs to see which suits your needs best.

How do I count words in other languages using R?

Depending on the language and the script you are working with, you may need different functions or packages to handle different encoding systems or word segmentation rules.

For example, if you are working with Chinese texts, you may need to use a package like jiebaR that can perform word segmentation for Chinese texts. If you are working with Arabic texts, you may need to use a package like arabic that can handle Arabic script and diacritics.

How do I count words in multiple columns or rows at once using R?

If you want to count words in multiple columns or rows at once using R, you may need to use some functions from other packages that can help you manipulate your data more easily. For example, if you want to count words in multiple columns simultaneously, use the function unite() from the tidyr package, which is also part of the tidyverse. This function can combine multiple columns into one column with an optional separator. For example, suppose we have a data frame called df3 that contains three columns: id, title, and body. We can use unite() to combine the columns title and body into one column called text with a space as the separator:

library(tidyr)
df3 <- unite(df3, text, title, body, sep = " ")
df3
How do I count words in a document or a file using R?

If you want to count words in a document or a file using R, you may need to use some functions or packages to help you read and import your data into R. For example, if you want to count words in a plain text file, you can use the function readLines() from base R, which reads text lines from a connection (such as a file) into a character vector. For example, suppose we have a text file called example.txt that contains some texts we want to count words for. We can use readLines() to read the file into R:

text <- readLines("example.txt")
text

How do I count words in different formats or sources using R?

If you want to count words in different formats or sources using R, such as PDF files, HTML files, web pages, tweets, etc., you may need to use some functions or packages to help you extract and process the texts from these sources. For example, if you want to count words in a PDF file, you may need to use a package like pdftools to extract text from PDF documents. If you want to count words in an HTML file or a web page, you may need to use a package like rvest that can scrape web data. If you want to count words in tweets, you may need to use a package like rtweet that can access Twitter’s API.

How do I count other elements in a text using R?

Suppose you want to count other elements in a text using R, such as characters, sentences, paragraphs, documents, etc.. In that case, you may need different functions or packages to help you split or identify these elements based on different criteria. For example, if you want to count characters in a text using R, you can use the function nchar() from base R, which returns the number of characters in an object. If you want to count sentences in a text using R, you can use the function str_count() from the stringr package with a regular expression that matches sentence boundaries, such as "\\S\\s+[\\.\\?\\!]\\s+". If you want to count paragraphs in a text using R, use the function str_count() with a regular expression matching paragraph boundaries, such as "\\n\\s*\\n".

How do I visualize or summarize the word counts using R?

If you want to visualize or summarize the word counts using R, you may need to use some functions or packages to help you create plots or tables based on your data. For example, if you want to create a bar plot of the word counts for each sentence in your text using R, you can use the function barplot() from base R, which creates a bar plot with vertical or horizontal bars, and which takes a vector or matrix of values as an argument. For example, suppose we have a character vector called text that contains three sentences:

text <- c("I like cheese.", "I don't want to be here.", "I am alone.")
library(stringr)
word_count <- str_count(text, "\\w+")
word_count
barplot(word_count, names.arg = text, main = "Word counts for each sentence", xlab = "Sentence", ylab = "Word count")
We can see that the second sentence has the highest word count, while the first and third sentences have the same word count.

If you want to create a table of the word counts for each sentence using R, you can use the function table() from base R, which creates a contingency table of counts for factors or categorical variables. For example, suppose we have the same character vector text and the same numeric vector word_count as before. We can use the table() function to create a table of the word counts:

table(text, word_count)

#>                      word_count
#> text                  3 6
#>   I am alone.         1 0
#>   I don't want to be here. 0 1
#>   I like cheese.      1 0

The output is a table that shows the frequency of each word count for each sentence. We can see that there are two sentences with 3 words and one sentence with 6

# rstudio count the number of words in a dataset

About the author

Zubair Goraya
Ph.D. Scholar | Certified Data Analyst | Blogger | Completed 5000+ data projects | Passionate about unravelling insights through data.
-->

Post a Comment

Have A Question?We will reply within minutes
Hello, how can we help you?
Start chat...