Contingency Tables in R: Insights from a PhD

Key points

  • A contingency table is a way to show how often different categories of two or more variables occur together. 
  • You can make a two-way contingency table in R with the table() function. You can also add the totals and percentages of each category with the addmargins() and prop.table() functions.
  • You can make a mosaic plot in R with the plot() function. A mosaic plot is a picture of a contingency table that shows the size and color of each category. 
  • Make a three-way contingency table in R with the ftable() function. A three-way contingency table is a way to show how three variables are related to each other. 
  • You can test if two variables are independent or not in R with the chisq.test() function. It compares how often each category actually occurs with how often it would occur by chance. 
  • You can measure how strong and in what direction the association between two variables is in R with the cor() function. It gives you a number between 0 and 1 that tells you how much one variable changes when another changes. 
Secrets of R Contingency Tables Revealed A PhD’s Experience

Function

Description

data()

Loads a built-in dataset in R

str()

Displays the structure of an object

table()

Creates a contingency table from a subset of data

addmargins()

Adds row and column sums to a contingency table

prop.table()

Converts frequencies in a contingency table into proportions or percentages

plot()

Creates a mosaic plot from a table object

ftable()

Creates a flat contingency table from a subset of data

as.data.frame()

Converts an array into a data frame

chisq.test()

Performs a chi-squared test of independence on a table object

cor()

Calculates the phi coefficient or Cramer’s V from a table object

Hi, I am Zubair Goraya, a PhD Scholar, Certified data analyst and freelancer with 5 years of experience. I’m also a contributor to Data Analysis, a website that provides tutorials related to Rstudio. 

I am writing this article based on my PhD research paper, where I faced many challenges in analysing categorical data and found contingency tables to be a useful tool for exploring and testing the relationship between variables. In this article, I’ll show you how to create and interpret a contingency table in R, a useful tool for analysing the relationship between two or more categorical variables.

What is a Contingency table?

A contingency table, also known as a cross-tabulation or crosstab, is a table that displays the frequency distribution of the categories of two or more variables. It can help you to summarize and compare the proportions of different groups, test hypotheses about the independence or association of the variables, and measure the strength and direction of the relationship.

Why use contingency tables?

R Contingency tables are a simple and effective way to summarize and compare the frequency distribution of two or more categorical variables. You can also mention some of the benefits of using contingency tables, such as:
  • Help you to identify patterns and trends in the data
  • Help you to test hypotheses about the independence or association of the variables
  • Help you to measure the strength and direction of the relationship between the variables
  • Help you to visualise the data using mosaic plots or other graphical methods

How to Create a One-Way Contingency Table in R

For this tutorial, I’ll use the Titanic dataset that is available in R. This dataset contains information about the passengers on board the Titanic, such as their survival status, class, sex, age, and number of siblings/spouses/parents/children on board. You can load this dataset by using the data() function. It will create a four-dimensional array named Titanic in your workspace. You can view its structure by typing str.

data(Titanic)
str(Titanic)

The output will look something like this:

Structure of the titanic data set

The Titanic array has four dimensions: Class, Sex, Age, and Survival. Each dimension has two or more levels (e.g., Class has four levels: 1st, 2nd, 3rd, and Crew). The values in the array are the frequencies of each combination of levels (e.g., there were 35 male children in the first class who did not survive).

You can access any slice of this array using square brackets and specifying your desired levels. For example, if you want to see the frequencies of survival by sex and class, you can type:

xtabs(Freq ~ Sex + Class + Survived, data = as.data.frame(Titanic))

The output will look something like this:

Analysis of Survival Frequencies on the Titanic

The table displays the survival frequencies by sex, class, and age on the Titanic. It is divided into four sections, each representing a different combination of age and survival status: Child/No, Adult/No, Child/Yes, and Adult/Yes.

For the "Child, Survived = No" section, we observe that no children (both male and female) in the 1st and 2nd classes did not survive, but 35 children in the 3rd class did not survive. There were no children from any class in the crew category who did not survive.

In the "Adult, Survived = No" section, more adult males did not survive in all classes, with the largest count in the crew class (670). For adult females, the numbers are significantly lower, especially in the 1st and 2nd classes.

For the "Child, Survived = Yes" section, we see that some children did survive in all classes. The survivor count for males and females is generally lower than for adults, with the highest count in the 3rd class.

In the "Adult, Survived = Yes" section, many adult males survived, especially in the crew class. Adult females in the 1st class had the highest number of survivors.

It shows that there were no children in the first class or crew who survived, while all female children in the second class survived. Among adults, females had a higher survival rate than males in all classes.

How to Create a Two-Way Contingency Table in R

A two-way contingency table is a table that shows the frequency distribution of two categorical variables. For example, if you want to see how survival status and sex are related in the Titanic dataset, you can create a two-way contingency table by using the table() function:

Titanic<-as.data.frame(Titanic)
table(Titanic[, c("Sex", "Survived")])

The output will look something like this:

Sex

Survived

No

Yes

Male

8

8

Female 

8

8

In each category, "Male" and "Female," there are eight individuals who did not survive ("No") and eight individuals who survived ("Yes"). This symmetrical distribution suggests an equal survival rate for males and females in the dataset. 

How to Add Margins and Proportions to a Contingency Table

You can add the row and column sums to a contingency table using the addmargins() function. This can help you to compare the frequencies of different groups more easily. For example, if you want to add margins to the previous contingency table, you can type:

addmargins(table(Titanic[, c("Sex", "Survived")]))

The output will look something like this:

Sex

Survived

Sum

No

Yes

Male

8

8

16

Female

8

8

16

Sum

16

16

32

The table provides a breakdown of survival by gender on the Titanic. In both the "Male" and "Female" categories, there are eight individuals who did not survive ("No") and eight individuals who survived ("Yes"). This balanced distribution results in 16 survivors and 16 non-survivors across both genders, constituting 32 individuals in the dataset. 

It appears that, in this specific dataset, the survival rate is equal for both males and females, without any gender bias in survival outcomes. However, this analysis does not consider other variables affecting survival rates.

Proportions or Percentages in a Contingency Table

You can also convert the frequencies in a contingency table into proportions or percentages by using the prop.table() function. This can help you to compare the relative frequencies of different groups more easily. For example, if you want to see the proportions of survival by sex, you can type:

prop.table(table(Titanic[, c("Sex", "Survived")]))

The output will look something like this:

Proportions or Percentages in a Contingency Table

The results showed that in both the "Male" and "Female" categories, 25% of individuals did not survive ("No"), while 25% survived ("Yes").


Proportions by Row, Column, or Total using prop.table()

You can also specify whether you want to see the proportions by row, column, or total by using the margin argument in the prop.table() function. For example, if you want to see the proportions of survival within each sex group, you can type:

a<-prop.table(table(Titanic[, c("Sex", "Survived")]), margin = 1)
addmargins(a)

The output will look something like this:

Proportions by Row, Column, or Total using prop.table()
In both the "Male" and "Female" categories, the proportion of individuals who did not survive ("No") is 0.5, and the proportion of individuals who survived ("Yes") is also 0.5. The "Sum" row shows that these proportions add up to 1.0 for each gender, indicating that the entire dataset is accounted for. The "Sum" row at the bottom shows that the overall proportions are also 1.0, reflecting the complete dataset.
Related Posts

How to Create a Mosaic Plot of a Contingency Table

A mosaic plot is a graphical representation of a contingency table that shows the frequencies or proportions of each combination of levels as rectangles with areas proportional to their values. It can help you to visualize the relationship between two or more categorical variables more easily. 

To create a mosaic plot in R, you can use the plot() function with a table object as an argument. For example, if you want to create a mosaic plot of survival by sex, you can type:

plot(table(Titanic[, c("Sex", "Survived")]), main="Mosaic plot of a Contingency table")
The output will look something like this:

Mosaic plot of a Contingency table

Customize the Appearance of the Mosaic plot 

You can also customize the appearance of your mosaic plot by using additional arguments in the plot() function, such as main (title), xlab (x-axis label), ylab (y-axis label), col (color), border (border color), and shade (shading). For example, if you want to create a mosaic plot with a title, labels, different colors, no borders, and shading based on the chi-squared residuals, you can type:

plot(table(Titanic[, c("Sex", "Survived")]), main = "Survival by Sex on Titanic", 
     xlab = "Sex", ylab = "Survived", 
     col = c("pink", "lightblue"), border = NA)
The output will look something like this:

How to Create a Three-Way Contingency Table in R

A three-way contingency table is a table that shows the frequency distribution of three categorical variables. For example, if you want to see how survival status, sex, and class are related in the Titanic dataset, you can create a three-way contingency table by using the ftable() function:{

ftable(Titanic[, c("Class", "Sex", "Age", "Survived")])

The output will look something like this:

Survived

Class

Sex

Age

No

Yes

1st

Male

Child

1

1

Adult

1

1

Female

Child

1

1

Adult

1

1

2nd

Male

Child

1

1

Adult

1

1

Female

Child

1

1

Adult

1

1

3rd

Male

Child

1

1

Adult

1

1

Female

Child

1

1

Adult

1

1

Crew

Male

Child

1

1

Adult

1

1

Female

Child

1

1

 

Adult

1

1

The provided table presents the survival counts on the Titanic, categorized by class, gender, and age group. Remarkably, it shows that for each combination of these factors, the number of passengers who did not survive ("No") is exactly equal to the number who survived ("Yes"), resulting in a perfectly balanced distribution. 

This balance suggests that in the given dataset, there is no differentiation in survival outcomes based on class, gender, or age group. Each class (1st, 2nd, 3rd, and Crew) displays the same number of survivors and non-survivors, regardless of gender (Male or Female) and age group (Child or Adult). 

This symmetrical pattern raises some intriguing questions about the data collection process. It's unusual to find such perfectly balanced survival rates across these categories. It's possible that the dataset is artificially constructed or that some information is missing, leading to this uniform distribution. 

In a real-world scenario, we expect to see variations in survival rates based on these factors, with certain groups having higher or lower survival chances. Therefore, it's essential to consider the reliability and completeness of the data when interpreting these results, as they may not accurately represent the actual events that transpired on the Titanic. Further investigation and data validation may be needed to draw meaningful conclusions about survival patterns in this context.

How to Test the Independence of Two Categorical Variables

One of the most common questions that arise when analyzing a contingency table is whether the two categorical variables are independent or not. Independence means that there is no relationship between the variables and that the frequency distribution of one variable does not depend on the value of the other variable. For example, if sex and survival status are independent on the Titanic, then we would expect that the proportion of survivors would be the same for males and females.

To test the independence of two categorical variables, we can use the chi-squared test of independence. It compares the observed frequencies in a contingency table with the expected frequencies under the assumption of independence and calculates a chi-squared statistic and a p-value. 

The chi-squared statistic measures how much the observed frequencies deviate from the expected frequencies, and the p-value measures how likely it is to observe such a deviation by chance. If the p-value is less than a significance level (usually 0.05), then we can reject the null hypothesis of independence and conclude that there is a significant association between the variables.

To perform a chi-squared test of independence in R, you can use the chisq.test() function with a table object as an argument. For example, if you want to test whether sex and survival status are independent on the Titanic, you can type:

chisq.test(table(Titanic[, c("Sex", "Survived")]))

The output will look something like this:

Pearson's Chi-squared test

This shows that the chi-squared statistic is very large (0) and the p-value (1). It means we can accept the null hypothesis and conclude that there is no significant association between sex and survival status on the Titanic.

How to Measure the Strength and Direction of the Association Between Two Categorical Variables

Another question when analyzing a contingency table is how strong and in what direction is the relationship between two categorical variables. Strength means how much variation in one variable can be explained by another, and direction means whether the relationship is positive or negative.

For example, if sex and survival status are strongly associated with the Titanic, then we would expect that knowing one’s sex would help us predict one’s survival status better than guessing at random. If the association is positive, we expect higher values of one variable (e.g., female) would correspond to higher values of another variable (e.g., survived). If the association is negative, we would expect that higher values of one variable would correspond to lower values of another variable.

To measure the strength and direction of the association between two categorical variables, we can use various measures of association, such as the phi coefficient and Cramer’s V. These measures are based on the chi-squared statistic and range from 0 to 1, where 0 indicates no association, and 1 indicates a perfect association. The direction of the association can be determined by looking at the sign of the correlation coefficient or by inspecting the contingency table.

To calculate the phi coefficient and Cramer’s V in R, you can use the cor() function with a table object as an argument. For example, if you want to measure the strength and direction of the association between sex and survival status on the Titanic, you can type:

cor(table(Titanic[, c("Sex", "Survived")]))
The output will look something like this:
NA

What analysis can we do after the contingency table?

Some of the possible analyses that you can perform on a contingency table in R, such as:

  • You can calculate descriptive statistics, such as mean, median, mode, standard deviation, variance, range, etc., for each variable or group using the summary() or describe() functions.
  • You can perform inferential statistics, such as t-tests, ANOVA, chi-squared tests, correlation tests, etc., to compare the means or proportions of different groups or test the significance of the relationship between variables using the t.test(), aov(), chisq.test(), cor.test(), etc., functions
  • You can perform multivariate analysis, such as logistic regression, decision trees, cluster analysis, etc., to model the relationship between one or more dependent variables and one or more independent variables using the glm(), rpart(), kmeans(), etc., functions.

Tips and best practices 

Tips and best practices on how to create and interpret a contingency table in R, such as:

  • You should always check the quality and validity of your data before creating a contingency table. You should look for missing values, outliers, errors, inconsistencies, etc., and deal with them appropriately using the na.omit(), boxplot(), is.na(), etc., functions
  • It would be best if you always chose the appropriate level of measurement for your categorical variables. You should use nominal variables for categories with no inherent order or rank, such as sex or colour. It would help to use ordinal variables for categories with a natural order or rank, such as class or age group. It would help if you used numeric variables for categories that have a numerical value or scale, such as income or height. It would help if you used factors() or ordered() functions to create categorical variables from numeric variables.
  • You should always choose the appropriate type and size of contingency table for your analysis. You should use a two-way contingency table for two categorical variables, a three-way contingency table for three categorical variables, and so on. You should avoid creating too large or too small contingency tables that may be difficult to read or interpret. It would help if you used ftable() function to create flat contingency tables that are easier to display and manipulate
  • You should always interpret your contingency table with caution and context. You should not make causal claims based on correlation alone. Consider other factors that may influence or confound the relationship between variables. It would help if you used appropriate measures of association and significance tests to support your conclusions. You should report your results clearly and accurately using proper notation and terminology.

Conclusion

In this article, I showed you how to create and interpret a contingency table in R, a useful tool for analyzing the relationship between two or more categorical variables. You learned how to:

  • Create a two-way contingency table in R using the table() function
  • Add margins and proportions to a contingency table using the addmargins() and prop.table() functions
  • Create a mosaic plot of a contingency table using the plot() function
  • Create a three-way contingency table in R using the ftable() function
  • Test the independence of two categorical variables using the chisq.test() function
  • Measure the strength and direction of the association between two categorical variables using the phi coefficient and Cramer’s V.

I hope you found this article helpful and informative. If you have any questions or comments, please leave them below. If you want to learn more about data analysis in R, you can check out our website Data Analysis, where we provide tutorials on various topics related to Rstudio. You can also hire us if you need help with your data analysis projects. Thank you for reading, and happy coding!

Frequently Asked Questions (FAQs)

What is the difference between a two-way and a three-way contingency table?

A two-way contingency table shows the frequency distribution of two categorical variables, while a three-way contingency table shows the frequency distribution of three categorical variables.

What is the difference between the phi coefficient and Cramer’s V?

The phi coefficient and Cramer’s V are both measures of association between two categorical variables, but they differ in how they account for the degree of freedom of the contingency table. The phi coefficient is equal to Cramer’s V when the contingency table has only two rows and two columns, but Cramer’s V will be smaller than the phi coefficient when the contingency table has more than two rows or columns.

How can I interpret the p-value of the chi-squared test of independence?

The p-value of the chi-squared test of independence measures how likely it is to observe a deviation from the expected frequencies under the assumption of independence by chance. If the p-value is less than a significance level (usually 0.05), then you can reject the null hypothesis of independence and conclude that there is a significant association between the variables. If the p-value is greater than or equal to the significance level, then you cannot reject the null hypothesis of independence and conclude that there is no evidence of association between the variables.

How can I interpret the sign and magnitude of the correlation coefficient?

The sign and magnitude of the correlation coefficient indicate the direction and strength of the association between two categorical variables. The sign can be positive or negative, where positive means that higher values of one variable correspond to higher values of another variable, and negative means that higher values of one variable correspond to lower values of another variable. The magnitude can range from 0 to 1, where 0 indicates no association, and 1 indicates a perfect association.

How can I create and interpret a contingency table with more than three categorical variables?

You can create and interpret a contingency table with more than three categorical variables by using nested ftable() functions or by using other packages such as gmodels or vcd. However, these methods may not be very practical or intuitive, as they may produce very large or complex tables that are difficult to read or visualize. Consider other ways to analyze your data, such as using logistic regression or decision trees.


Need a Customized solution for your data analysis projects? Are you interested in learning through Zoom? Hire me as your data analyst. I have five years of experience and a PhD. I can help you with data analysis projects and problems using R and other tools. To hire me, you can visit this link and fill out the order form. You can also contact me at info@rstudiodatalab.com for any questions or inquiries. I will be happy to work with you and provide you with high-quality data analysis services.


About the author

Zubair Goraya
Ph.D. Scholar | Certified Data Analyst | Blogger | Completed 5000+ data projects | Passionate about unravelling insights through data.
-->

Post a Comment

Have A Question?We will reply within minutes
Hello, how can we help you?
Start chat...