Key points
- Descriptive analysis is the process of summarizing, describing, and presenting the main features of a dataset.
- Descriptive statistics are numerical or graphical summaries of a dataset, such as mean, median, mode, standard deviation, quartiles, range, skewness, kurtosis, correlation, etc.
- R is a powerful and versatile tool for data science, which provides various functions and packages for calculating and displaying descriptive statistics.
- The summary() and sapply() functions are built-in functions in R that can obtain descriptive statistics for a data frame or a list of data frames.
- The descstat, a function in rstudio, aids in performing detailed descriptive statistics and summarytools packages are additional packages in R that can compute and display descriptive statistics conveniently and elegantly and generate a descriptive analysis report in various formats.
Table of Contents
Code used in this article
Package |
Function |
Description |
base |
summary() |
Summarize an object |
base |
sapply() |
Apply function to elements |
base |
table() |
Create frequency table |
base |
cut() |
Create factor from numeric |
base |
quantile() |
Calculate sample quantiles |
descstat |
desc() |
Compute descriptive statistics |
descstat |
descby() |
Descriptive stats by group |
descstat |
descall() |
Descriptive stats for all |
descstat |
descplot() |
Display graphical stats |
summarytools |
dfSummary() |
Summary for data frame |
summarytools |
freq() |
Frequency table |
summarytools |
descr() |
Display descriptive stats |
summarytools |
ctable() |
Cross-tabulation |
summarytools |
view() |
view() function in rstudio allows you to visualize your dataset. |
Descriptive Analysis in R
Descriptive analysis is the process of summarizing, describing, and presenting the main features of a dataset. It helps us understand the data's distribution, central tendency, and variability and identify outliers or anomalies. Descriptive analysis is often the first step in any data analysis project, as it provides a quick overview of the data and helps us decide what further steps to take.
-
What are the main types of descriptive statistics and how to calculate them in R
-
How to use the summary() and sapply() functions to obtain descriptive statistics for a data frame
-
How to use the descstat package to compute and display descriptive statistics in a convenient and elegant way
-
How to create and interpret various graphical representations of descriptive statistics, such as histograms, boxplots, scatterplots, and correlation plots
-
How to generate a descriptive analysis report using the summarytools package
By the end of this article, you will know about descriptive analysis in R and be able to apply it to your data. You will also get tips and tricks on improving your data analysis skills and avoiding common pitfalls.
What is Descriptive Statistics?
Descriptive statistics are numerical or graphical summaries of a dataset. They help us describe the basic characteristics of the data, such as the range, mean, median, mode, standard deviation, quartiles, interquartile range, skewness, kurtosis, correlation, and so on. There are two main types of descriptive statistics:
- Measures of Central Tendency
- Measures of Dispersion.
Measures of Central Tendency
Measures of central tendency are values that indicate a dataset's center or typical value. They include:
Mean
The arithmetic average of the data values. It is calculated by adding up all the values and dividing by the number of values. The mean is sensitive to outliers, which can skew it up or down.
Median
The middle value of the data is arranged in ascending or descending order. It is calculated by finding the value that splits the data into two halves. The median is less sensitive to outliers than the mean, as it only depends on the middle values.
Mode
The most frequent value in the data. It is calculated by finding the value that occurs most in the data. The mode can be useful for categorical or discrete data, showing the most common category or value. There can be multiple modes in a dataset or none at all.
Measures of Dispersion
Measures of dispersion are values that indicate the spread or variability of the data. They include:
Range
The difference between the maximum and minimum values in the data. It is calculated by subtracting the minimum value from the maximum value. The range shows the overall extent of the data, but it does not account for the distribution of the values.
Variance
The average squared deviations of the datadata values from the mean. It is calculated by adding the squared differences between each value and the mean and dividing by the number of valuesnumber of values. The variance measures how far the data values are spread around the mean but are not in the same units as the data.
Standard Deviation
The square root of the variance. It is calculated by taking the square root of the variance. The standard deviation measures how far the data values are spread around the mean and are in the same units as the data.
Quartiles
The values that divide the data into four equal parts when arranged in ascending or descending order. They are calculated by finding the medians of the data's lower and upper half. The quartiles are:
- First Quartile (Q1): The median of the lower half of the data. It is also the 25th percentile of the data, meaning 25% of the data values are below it.
- Second Quartile (Q2): The median of the whole data. It is also the 50th percentile of the data, meaning that 50% of the data values are below it.
- Third Quartile (Q3): The median of the upper half of the data. It is also the 75th percentile of the data, meaning 75% of the data values are below it.
- Interquartile Range (IQR): The difference between the third and first quartiles. It is calculated by subtracting Q1 from Q3. The IQR measures the spread of the middle 50% of the data and is less affected by outliers than the range.
How to Calculate Descriptive Statistics in R
There are many ways to calculate descriptive statistics in R, but in this article, we will focus on two built-in and easy-to-use functions: summary() and sapply().
Using the summary() Function
The summary() function is a generic function that produces a summary of an object. The summary() function can return different results depending on the object type. For example, if the object is a vector, the summary() function will return the vector's minimum, maximum, mean, median, first quartile, and third quartile. If the object is a factor, the summary() function will return the frequency of each factor level. If the object is a data frame, the summary() function will return the summary of each data frame column.
In this tutorial, we will use the iris dataset, a built-in dataset in R that contains the measurements of 150 iris flowers from three species: setosa, versicolor, and virginica. The dataset has five variables: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.
To load the iris dataset and assign it to a variable called dat, we can use the following code:
# load the iris dataset and rename it dat
dat <- iris
To view the first six rows and the structure of the dataset, we can use the following code:
# view the first six rows of the dataset
head(dat)
# view the structure of the dataset
str(dat)
We can see that the dataset has 150 observations (rows) and five variables (columns). The first four variables are numeric, and the last is a factor with three levels.
To calculate the descriptive statistics for each variable in the dataset, we can use the summary() function as follows:
# calculate the descriptive statistics for each variable summary(dat)
We can see that the summary() function returns the minimum, maximum, mean, median, first quartile, and third quartile for each numeric variable and the frequency of each level for the factor variable. This gives us a quick overview of the data but needs to show us the standard deviation, variance, range, or interquartile range, which are also important measures of dispersion. To calculate these statistics, we can use another function: sapply().
Using the sapply() Function
The sapply() function is a function that applies a function to each element of a vector, list, or data frame and returns a simplified result. For example, if we apply the mean() function to a data frame using sapply(), we will get a vector of the mean values of each data frame column. Similarly, we can apply other functions, such as sd(), var(), range(), or IQR(), to obtain other descriptive statistics.
To illustrate how to use the sapply() function, we will use the same iris dataset that we used before. To calculate the standard deviation, variance, range, and interquartile range for each numeric variable in the dataset, we can use the following code:
# calculate the standard deviation for each numeric variable sapply(dat[, -5], sd) # calculate the variance for each numeric variable sapply(dat[, -5], var) # calculate the range for each numeric variable sapply(dat[, -5], range) # calculate the interquartile range for each numeric variable sapply(dat[, -5], IQR)
We can see that the sapply() function returns a vector, a matrix, or a list of the descriptive statistics for each numeric variable, depending on the output of the function we apply.
Note that we use dat[, -5] to select only the numeric variables, as the fifth variable (Species) is a factor and cannot be used with these functions. To calculate the descriptive statistics for the factor variable, we can use the table() function, which returns the frequency of each factor level.
# calculate the frequency of each level of the factor variable table(dat$Species)
We can see that the table() function returns a vector of the frequency of each level of the factor variable, which is the same as the summary() function. However, the table() function can also be used to create cross-tabulations of two or more factors, which can be useful for exploring the relationship between categorical variables.
For example, if we want to see how the species of the iris flowers vary by the quartile of the sepal length, we can use the following code:
# create a cross-tabulation of species by quartile of sepal length table(dat$Species, cut(dat$Sepal.Length, breaks = quantile(dat$Sepal.Length)))
We can see that the table() function returns a matrix of the frequency of each combination of the species and the quartile of the sepal length. We use the cut() function to create a factor variable from the sepal length variable, using the quantile() function to specify the breaks.
It shows us how the species of the iris flowers are distributed across the range of the sepal length. We can see that most of the setosa flowers have a sepal length between 4.3 and 5.1 cm, most of the versicolor flowers have a sepal length between 5.1 and 6.4 cm, and most of the virginica flowers have a sepal length between 5.8 and 7.9 cm. It suggests a significant difference in the sepal length among the species of the iris flowers.
Conclusion
In this article, we learned how to perform descriptive analysis in R, a powerful and versatile tool for data science. We have learned:
- What are the main types of descriptive statistics and how to calculate them in R
- How to use the summary() and sapply() functions to obtain descriptive statistics for a data frame or a list of data frames.
We have also seen examples of using these functions and packages on the iris dataset, a built-in dataset in R that contains the measurements of 150 iris flowers from three species. We have learned how to customize the appearance and content of the output and export or save the output as a report in various formats, such as HTML, Markdown, PDF, and Word.
Descriptive analysis is essential for any data analyst, as it helps us understand the data and decide further steps. Using R, we can perform descriptive analysis efficiently and effectively, producing high-quality results that can communicate our findings to others.
Frequently Asked Questions (FAQs)
What are the descriptive statistics in R?
In R, descriptive statistics are numerical metrics that summarise and characterize the major characteristics of a dataset. Mean, median, mode, standard deviation, variance, range, minimum, maximum, quartiles, and percentiles are some of R's most often used descriptive statistics.
What is descriptive statistics? Explain with an example?
Descriptive statistics is a subset of statistics that deals with organizing, summarising, and presenting data comprehensibly. It delivers a succinct data summary and assists us in comprehending its qualities. Assume we have a dataset that contains the heights (in inches) of a group of people. Descriptive statistics may compute the mean height, the group's average height. This yields a single value that sums up the total height dispersion.
What are descriptive statistics on the dataset?
Descriptive statistics on a dataset involve analyzing and summarizing the data to gain insights into its characteristics. It comprises metrics for the dataset's central tendency (median, mean, mode), dispersion (standard deviation, range), and form (kurtosis, skewness).
What are descriptive statistics of DataFrame?
A data frame in R is a two-dimensional tabular data structure that organizes information into rows and columns. Descriptive statistics of a data frame involve calculating various statistical measures for each column or variable within the data frame. These measures summarize the distribution and properties of the data in the DataFrame.
What are the three basic descriptive statistics?
The three basic descriptive statistics are:
- Mean: A dataset's average value is obtained by adding all values and dividing by the number of observations.
- When the values in a dataset are sorted in ascending or descending order, the median is the value in the center.
- The standard deviation quantifies the spread of data around the mean. It offers information on the dataset's variability.
Which two are examples of descriptive statistics?
Two examples of descriptive statistics are mean and standard deviation. The Mean provides information about the central tendency of the dataset, while the standard deviation gives insights into the variability or spread of the data.
What is the main purpose of descriptive statistics?
The primary goal of descriptive statistics is to summarise and characterize the major characteristics of a dataset. It assists in comprehending the data's central tendency, dispersion, and distribution. Descriptive statistics give useful insights and allow for successful data analysis and decision-making.
What are the 5 descriptive statistics?
The five common descriptive statistics are Median, Mean, Mode, Range, and Standard Deviation.
What is descriptive statistics used for?
Data is summarised, described, and analyzed using descriptive statistics. They give information on a dataset's features, such as its central tendency, variability, and distribution. Descriptive statistics aid in the understanding of trends, the comparison of data, the identification of outliers, and the support of decision-making.
Which is the best definition of descriptive statistics?
The best definition of descriptive statistics is using numerical measurements to summarise and characterize the key characteristics of a dataset. Descriptive statistics give a succinct description of the data and aid in its interpretation and analysis.
How do you analyze descriptive statistics?
To analyze descriptive statistics, you must execute computations and investigate the dataset's central tendency, dispersion, and distribution metrics. This includes determining the mean, median, mode, standard deviation, range, and other pertinent statistics. Data visualization techniques such as histograms, box plots, and scatter plots can be employed to visually interpret the data.
What is the use of descriptive statistics in machine learning?
In machine learning, descriptive statistics are used to understand and analyze the characteristics of datasets. They provide insights into data patterns, distributions, and relationships between variables. Descriptive statistics help preprocess data, identify outliers, select features, and gain initial insights before applying machine learning algorithms.
What are data visualization and descriptive statistics?
Data visualization represents data in visual forms like charts, graphs, and plots. It complements descriptive statistics by visually representing the data, making patterns and trends easier to understand. Descriptive statistics provide numerical measures, while data visualization enhances the interpretation and communication of those measures.
What are descriptive and inferential statistics on the dataset?
Inferential statistics includes making inferences and drawing conclusions about a population based on a sample, whereas descriptive statistics involves summarising and explaining the key elements of a dataset. Descriptive statistics shed light on the observed data, whereas inferential statistics aid in generalizing those conclusions to a larger population.
Is correlation a descriptive statistic?
A statistical metric that measures the link between two variables is a correlation. While it gives insight into the relationship's strength and direction, it is not considered a descriptive statistic. Correlation is classified as inferential statistics since it assists in making conclusions about the population based on a sample.
Is standard deviation a descriptive statistic?
Yes, the standard deviation is a descriptive statistic. It measures the spread or dispersion of data around the mean and provides insights into the variability of the dataset. Standard deviation helps understand how individual data points deviate from the average, giving a sense of the data's distribution.
What are the advantages of descriptive statistics?
The advantages of descriptive statistics include:
- Summarizing complex data into meaningful measures.
- Providing insights into the central tendency, variability, and distribution of data.
- Facilitating comparisons between different datasets or subgroups.
- Detecting outliers or unusual observations.
- Supporting decision-making by presenting data concisely and understandably.
Is Chi-Square a descriptive statistic?
No, Chi-Square is not a descriptive statistic. It is a statistical test used to determine if there is a significant association between categorical variables. Chi-Square falls under inferential statistics, as it tests hypotheses and makes inferences about the population based on sample data.
Is ANOVA descriptive or inferential?
ANOVA (Analysis of Variance) is an inferential statistical technique used to compare means between multiple groups. It tests whether there are any significant differences among the group means. Therefore, ANOVA is not considered a descriptive statistic but rather an inferential statistic.
Is the t-test a descriptive statistic?
The t-test is an inferential statistical test used to assess whether or not there is a significant difference in the means of two groups. It is not a descriptive statistic since it requires hypothesis testing and drawing conclusions about the population based on sample data.
Is SPSS a descriptive statistic?
SPSS (Statistical Package for the Social Sciences) is a software package used for statistical analysis. While it provides tools and functionality to calculate descriptive statistics, it is not a descriptive statistic. SPSS facilitates calculating, analyzing, and presenting various statistical measures, including descriptive statistics.
What are the limitations of descriptive statistics?
The limitations of descriptive statistics include the following:
- Descriptive statistics provide a summary of the data but may not capture the full complexity or nuances of the dataset.
- Descriptive statistics cannot establish causation or explain underlying relationships between variables.
- Descriptive statistics may be influenced by outliers or extreme values in the data.
- More than descriptive statistics may be required when dealing with large or complex datasets that require more advanced analytical techniques.
What are the methods of descriptive analysis?
Methods of descriptive analysis include calculating measures of central tendency (mean, median, mode), measures of dispersion (standard deviation, range), analyzing shape or distribution (skewness, kurtosis), and using data visualization techniques (charts, graphs, plots) to present the data in a meaningful way.
Are descriptive statistics nominal or ordinal?
Descriptive statistics can be applied to both nominal and ordinal data. Nominal data represents categories or groups with no inherent order, while ordinal data represents categories with a specific order or ranking. Descriptive statistics help summarize and describe the characteristics of both types of data.
What is the z-score in descriptive statistics?
The z-score, also known as the standard score, is a measure in descriptive statistics that indicates how many standard deviations a data point is away from the mean. It allows for standardizing and comparing values from different distributions by transforming them into a common scale.
What are descriptive variables?
Descriptive variables are variables that are used to describe or summarize a dataset. They provide information about the data's characteristics, attributes, or properties. Descriptive variables can be categorical (e.g., gender, occupation) or numerical (e.g., age, income) and are used in descriptive statistics to analyze and understand the dataset.
What are descriptive statistics, and how are they used in R & Rstudio?
Descriptive statistics are numerical or visual summaries of the important characteristics of a dataset. In R & Rstudio, you can use functions like summary(), mean(), median(), and sd() to compute various summary statistics for your data.
How can I compute summary statistics for a variable in a data frame using R?
You can use the summarise() function from the dplyr package to compute summary statistics for a specific variable in a data frame. This function allows you to calculate various statistics like mean, median, standard deviation, and more.
What is the significance of handling missing values when computing descriptive statistics?
Missing values can affect the accuracy of your summary statistics calculations. In R, you can handle missing values using functions like na.omit() to exclude rows with missing data points before computing summary statistics.
How can I create a contingency table for two categorical variables in R?
You can use the table() function in R to create a contingency table that displays the frequency counts of the combinations of two categorical variables.
How do I compute summary statistics by group in R?
To compute summary statistics by group in R, you can use functions like group_by() and summarize() from the dplyr package. By grouping your data based on a specific variable and summarizing the results, you can obtain summary statistics for each group.
How can I calculate the median absolute deviation for numerical variables in R?
You can calculate the median absolute deviation (MAD) in R using the mad() function. MAD is a robust measure of the variability or dispersion of a dataset that is less sensitive to outliers than standard deviation.
Need a Customized solution for your data analysis projects? Are you interested in learning through Zoom? Hire me as your data analyst. I have five years of experience and a PhD. I can help you with data analysis projects and problems using R and other tools. You can visit this link and fill out the order form to hire me. You can also contact me at info@data03.online for any questions or inquiries. I will be happy to work with you and provide you with high-quality data analysis services.
Do you need help with a data analysis project? Let me assist you! With a PhD and ten years of experience, I specialize in solving data analysis challenges using R and other advanced tools. Reach out to me for personalized solutions tailored to your needs.