Organizing data
● select()– include or exclude certain variables (columns)
● filter()– include or exclude certain observations (rows)
● arrange()– change the order of rows
● summarise()– produce descriptive statistics
● group_by()– organize observation into groups
● mutate()– create new columns
● Inner_join – combines two dataframes on specified columns
● gather()– change data from wide to long format
● pull()– takes variable out of data frame and converts it into a vector
Data visualization
● ggplot(dataframe, aes(x=variable))+geom_histogram()
● ggplot(dataframe,aes(x=variable, y= dependent variable)) + geom_point
● ggplot(dataframe,aes(x=variable, y= dependent variable)) +geom_boxplot()
○ geom_violin() → used to visualise distribution of data
● ggplot(dataframe,aes(x=variable, y= dependent variable)) + geom_bar(stat=”identity”)
○ fill=as.factor → for each occasion
● geom_errorbar(aes(ymin= Mean – SD, ymax=Mean + SD)
● facet_wrap(~variable1 + variable 2) → split the figure by certain category
● tibble() → creates new dataframe
○ tibble(first_column_name= column_contents, second_column_contents…)
● n() → counts up number of rows
Probability → use binomial distribution to display probability because data variables are discrete
● Chance of 1 girl in a 3 child family when the prob of a girl each time is 0.5
○ dbinom(x=1, size=3,prob=0.5)
● Chance of getting 1 or fewer girls in a family of 3
○ pbimom(q=1, size=3, prob=0.5)
● Chance of getting at least 2 girls in a family of 3
○ pbimom(q=1, size=3, prob=0.5, lower.tail=FALSE)
● How few number of heads in 5 flips should people get to win?
○ qbinom(p=0.1, size=5, prob=0.5)
■ p= desired number
Descriptive statistics
● Standard deviation<-standard error * sqrt(samplesize)
● dnorm()– density function for normal distribution
○ x= values of variable
○ mean= mean
○ sd= standard deviation
● pnorm() – probability or distribution function; gives prob that a value will be above a cut-off
point
, ○ Probability of a woman shorter than 150cm
■ pnorm(q=150, mean= m_fem, sd= sd_fem)
● qnorm() – proportion of values below or above a given-cutoff
○ How short does a woman need to be to be in the bottom 10%?
■ qnorm(p=0.1, mean= x, sd= y)
● rnorm() – simulate data from random normally distributed variable
○ rnorm(n=5, mean= x, sd= y)
● na.rm=TRUE – when summarising data and the columns have missing values
● select()– include or exclude certain variables (columns)
● filter()– include or exclude certain observations (rows)
● arrange()– change the order of rows
● summarise()– produce descriptive statistics
● group_by()– organize observation into groups
● mutate()– create new columns
● Inner_join – combines two dataframes on specified columns
● gather()– change data from wide to long format
● pull()– takes variable out of data frame and converts it into a vector
Data visualization
● ggplot(dataframe, aes(x=variable))+geom_histogram()
● ggplot(dataframe,aes(x=variable, y= dependent variable)) + geom_point
● ggplot(dataframe,aes(x=variable, y= dependent variable)) +geom_boxplot()
○ geom_violin() → used to visualise distribution of data
● ggplot(dataframe,aes(x=variable, y= dependent variable)) + geom_bar(stat=”identity”)
○ fill=as.factor → for each occasion
● geom_errorbar(aes(ymin= Mean – SD, ymax=Mean + SD)
● facet_wrap(~variable1 + variable 2) → split the figure by certain category
● tibble() → creates new dataframe
○ tibble(first_column_name= column_contents, second_column_contents…)
● n() → counts up number of rows
Probability → use binomial distribution to display probability because data variables are discrete
● Chance of 1 girl in a 3 child family when the prob of a girl each time is 0.5
○ dbinom(x=1, size=3,prob=0.5)
● Chance of getting 1 or fewer girls in a family of 3
○ pbimom(q=1, size=3, prob=0.5)
● Chance of getting at least 2 girls in a family of 3
○ pbimom(q=1, size=3, prob=0.5, lower.tail=FALSE)
● How few number of heads in 5 flips should people get to win?
○ qbinom(p=0.1, size=5, prob=0.5)
■ p= desired number
Descriptive statistics
● Standard deviation<-standard error * sqrt(samplesize)
● dnorm()– density function for normal distribution
○ x= values of variable
○ mean= mean
○ sd= standard deviation
● pnorm() – probability or distribution function; gives prob that a value will be above a cut-off
point
, ○ Probability of a woman shorter than 150cm
■ pnorm(q=150, mean= m_fem, sd= sd_fem)
● qnorm() – proportion of values below or above a given-cutoff
○ How short does a woman need to be to be in the bottom 10%?
■ qnorm(p=0.1, mean= x, sd= y)
● rnorm() – simulate data from random normally distributed variable
○ rnorm(n=5, mean= x, sd= y)
● na.rm=TRUE – when summarising data and the columns have missing values