Python and R are two of the most popular programming languages used for statistical analysis, data analysis and machine learning. Both languages offer a rich collection of tools and libraries for working with data and are widely used in academia, research and industry. R is more focused on statistical analysis and has a simpler syntax, while Python has versatility, which enables it to have a broader range of applications and libraries.

In this article, we will focus on the Best Pre-Installed R Datasets Commonly Used for Statistical Analysis, including classification, regression analysis, clustering and time series analysis.

Pre-installed datasets are datasets that come with a piece of software or a platform. In R, these datasets provide a convenient way for users to get started with statistical analysis and machine learning without having to spend time searching for or creating their datasets.

Ultimate List of Pre-Installed R Datasets

1. Mtcars

This dataset includes information about various car models and their performance characteristics. The mtcars dataset is also derived from the 1974 Motor Trend US magazine and comprises 32 observations on 11 variables.

The variables include:

The Mtcars dataset can be loaded into R by typing data(mtcars), or it can be downloaded by clicking here.

2. ChickWeight

The ChickWeight, or chickwts, dataset includes information about the weight of chickens over time. The dataset also has 578 observations on 4 variables.

The variables include:

The ChickWeight dataset can be downloaded into R by typing data(ChickWeight), or it can be downloaded by clicking here.

3. CO2

The CO2 dataset includes measurements of atmospheric carbon dioxide (CO2) concentrations at the Mauna Loa Observatory in Hawaii, taken from March 1958 to December 2001. The dataset also has 468 observations on 2 variables.

The variables include:

The CO2 dataset can be downloaded into R by typing data(CO2), or it can be downloaded by clicking here.

4. Iris

This dataset includes measurements of the sepal length, sepal width, petal length and petal width of 150 iris flowers, which belong to 3 different species: setosa, versicolor and virginica. The iris dataset has 150 rows and 5 columns, which are stored as a dataframe, including a column for the species of each flower.

The variables include:

The Iris dataset can be loaded into R by typing data(iris), or it can be downloaded by clicking here.

5. Boston Housing

The Boston Housing dataset includes housing prices and related factors in the Boston area. The dataset was obtained from information collected by the U.S. Census Service concerning housing in the area of Boston, Massachusetts. The dataset also comprises 506 observations on 14 variables.

The variables include:

The Boston Housing dataset can be loaded into R by typing data(Boston), or it can be downloaded by clicking here.

6. Airquality

This dataset includes daily air quality measurements in New York, between May to September 1973. It also consists of 153 observations on 6 variables.

The variables include:

The Airquality dataset can be loaded into R by typing data(air quality), or it can be downloaded by clicking here.

7. Titanic

The Titanic dataset includes information about the passengers aboard the Titanic, which includes whether they survived or not. The dataset also contains 891 rows and 12 variables and is also based on the passenger list of the ill-fated maiden voyage of the Titanic, which sank in the North Atlantic Ocean on April 15th,1912, after colliding with an iceberg.

The variables include:

The Titanic dataset can be loaded into R by typing data(titanic), or it can be downloaded by clicking here.

8.  Faithful

This dataset includes measurements of the eruption and waiting times between eruptions for the Old Faithful geyser in Yellowstone National Park. The Faithful dataset also contains 272 observations on 2 variables.

The variables include:

The Faithful dataset can be loaded into R by typing data(faithful), or it can be downloaded by clicking here.

9. Orange

The Orange dataset includes growth measurements of orange trees. The dataset also contains 35 observations on 3 variables.

The variables include:

The Orange dataset can be loaded into R by typing data(Orange), or it can be downloaded by clicking here.

10. PlantGrowth

This dataset includes the results of an experiment on the effect of fertilizer on plant growth. The dataset also contains 30 observations on 2 variables.

The variables include:

The PlantGrowth dataset can be loaded into R by typing data(PlantGrowth), or it can be downloaded by clicking here.

11. Swiss

The Swiss dataset includes socio-economic data for 47 French-speaking provinces of Switzerland during the early 1880s. It also consists of 47 observations across 6 variables.

The variables include:

The Swiss dataset can be loaded into R by typing data(swiss), or it can be downloaded by clicking here.

12. Women

This dataset includes the heights of mothers and their daughters for American women aged 30-39. The dataset also consists of 15 observations on 2 variables.

The variables include:

The Women dataset can be loaded into R by typing data(women), or it can be downloaded by clicking here.

Common Use Cases for Pre-Installed R Datasets

Mtcars - The Mtcars dataset is used for regression analysis and exploratory data analysis to study the relationship between car specifications and fuel efficiency.

ChickWeight - The ChickWeight dataset is used for analyzing longitudinal growth data, such as the effects of diet on the growth of chicks.

CO2 - The CO2 dataset is used for analyzing the relationship between atmospheric CO2 concentration and plant growth.

Iris - The Iris dataset is used for exploratory data analysis and classification analysis to study the relationship between iris flower species and their physical attributes.

Boston Housing - The Boston Housing dataset is used for regression analysis to study the relationship between housing prices and various factors, such as crime rate and accessibility to public transportation.

Airquality - The Airquality dataset is used for exploratory data analysis and regression analysis to study the relationship between air pollution and various weather factors.

Titanic - The Titanic dataset is used for classification analysis and survival analysis to study the factors that influenced the survival of passengers on the Titanic.

Faithful - The Faithful dataset is used for exploratory data analysis and modelling to study the patterns of eruption time and waiting time of the Old Faithful geyser in Yellowstone National Park.

Orange - The Orange dataset is used for regression analysis and growth modelling to study the growth of orange trees.

PlantGrowth - The PlantGrowth dataset is used for hypothesis testing and ANOVA analysis to study the effect of different types of fertilizer on plant growth.

Swiss - The Swiss dataset is used for exploratory data analysis and hypothesis testing to study the relationship between fertility and socio-economic indicators in the provinces of Switzerland.

Women - The Women dataset is used for exploratory data analysis and hypothesis testing to study the relationship between the height and weight of women.

Final Thoughts

These datasets can also be accessed through R packages like "tidyverse", "ggplot2" and "data.table”

They are also available in public repositories such as Kaggle, GitHub and the UCI Machine Learning Repository.


The lead image of this article was generated via HackerNoon's AI Stable Diffusion model using the prompt 'R Programming'.

More Dataset Listicles:

  1. Hugging Datasets
  2. PyTorch Datasets
  3. Power BI Datasets