For instance, there is only one big setosa flower, while there are 49 small setosa flowers in the dataset. The contingency table gives the number of cases in each subgroup. Or with the xtabs() function: xtabs(~ dat$Species + dat$size) # dat$size We now create a contingency table of the two variables Species and size with the table() function: table(dat$Species, dat$size) # Here is a recap of the occurrences by size: table(dat$size) # We create the variable size which corresponds to small if the length of the petal is smaller than the median of all flowers, big otherwise: dat$size <- ifelse(dat$Sepal.Length < median(dat$Sepal.Length), The dataset iris has only one qualitative variable so we create a new qualitative variable just for this example. Table() introduced above can also be used on two qualitative variables to create a contingency table. Note that the variable Species is not numeric, so descriptive statistics cannot be computed for this variable and NA are displayed. You can have even more statistics (i.e., skewness, kurtosis and normality test) by adding the argument norm = TRUE in the previous function. sc(dat) # Sepal.Length Sepal.Width Petal.Length Petal.Width Species Regarding plots, we present the default graphs and the graphs from the well-known : library(pastecs) See the different variables types in R if you need a refresh. Length and width of the sepal and petal are numeric variables and the species is a factor with 3 levels (indicated by num and Factor w/ 3 levels after the name of the variables). The dataset contains 150 observations and 5 variables, representing the length and width of the sepal and petal and the species of 150 flowers. # $ Species : Factor w/ 3 levels "setosa","versicolor".: 1 1 1 1 1 1 1 1 1 1. This dataset is imported by default in R, you only need to load it by running iris: dat <- iris # load the iris dataset and renamed it datīelow a preview of this dataset and its structure: head(dat) # first 6 observations # Sepal.Length Sepal.Width Petal.Length Petal.Width Species We use the dataset iris throughout the article. See online or in the above mentioned article for more information about the purpose and usage of each measure. In this article, we focus only on the implementation in R of the most common descriptive statistics and their visualizations (when deemed appropriate). Location measures give an understanding about the central tendency of the data, whereas dispersion measures give an understanding about the spread of the data. There exists many measures to summarize a dataset. If well presented, descriptive statistics is already a good starting point for further analyses. It allows to check the quality of the data and it helps to “understand” the data by having a clear overview of it. Descriptive statistics is often the first step and an important part in any statistical analysis.
#Statistical tools for data analysis r series
To briefly recap what have been said in that article, descriptive statistics (in the broad sense of the term) is a branch of statistics aiming at summarizing, describing and presenting a series of values or a dataset.
#Statistical tools for data analysis r how to
To learn more about the reasoning behind each descriptive statistics, how to compute them by hand and how to interpret them, read the article “ Descriptive statistics by hand”. This article explains how to compute the main descriptive statistics in R and how to present them graphically.