Exam (elaborations)

STAT 404 - MIDTERM 2 Terms in this set (44) Why do we need functions? Data structures tie related values into one object Functions tie related commands into one object In both cases: easier to understand, easier to work with, easier to build int

Rating

Sold

Pages

Grade

A+

Uploaded on

03-08-2024

Written in

2024/2025

STAT 404 - MIDTERM 2 Terms in this set (44) Why do we need functions? Data structures tie related values into one object Functions tie related commands into one object In both cases: easier to understand, easier to work with, easier to build into larger things Function structure - Name - Arguments - Body - Return The structure of a function has three basic parts: Inputs (or arguments): what should a user provide to the function? Body: code that is executed Output (or return value): what is the side effect of your function What should be a function? Things you're going to re-run, especially if it will be re-run with changes to arguments Chunks of code which are small parts of bigger analyses Chunks of code which are very similar to other chunks1 Best practices when creating a function Test code outside a function first Put code in a function and test Replace hard coded values with arguments and test again Use parenthesis liberally Only include in the function what can be repeate What is the default return value? With no explicit return() statement, the default is just to return whatever is on the last line How and when to use default inputs A function can also specify default values for the inputs (if the user doesn't specify an input in the function call, then the default value is used) Calling functions we define works just like calling built-in functions: named arguments, default. Inputs can be called by name, or without Inputs can be called by partial names (if uniquely When inputs aren't specified, default values are Named inputs can go in any order How are argument values matched to arguments when a function is called? While named inputs can go in any order, unnamed inputs must go in the proper order (as they are specified in the function's definition). When calling a function with multiple arguments, use input names for safety, unless you're absolutely certain of the right order for (some) inputs10 How to return more than one output When creating a function in R, though you cannot return more than one output, you can return a list. This (by definition) can contain an arbitrary number of arbitrary objects What is a side effect? A side effect of a function is something that happens as a result of the function's body, but is not returned.Examples: Printing something out to the console Plotting something on the display Saving an R data file, or a PDF, etc. Interfaces control what the function can see (arguments, environment) and change (its internals, its return value) Interfaces mark out a controlled inner environment for our code Interact with the rest of the system only at the interface Advice: arguments explicitly give the function all the information - Reduces risk of confusion and error - Exception: true universals like π Likewise, output should only be through the return value What do we mean by functions are objects? In R, functions are objects, just like everything else This means that they can be passed to functions as arguments and returned by functions as outputs as well What are the four higher order classifications of functions? Beyond regular functions, we can use Functionals: functions that take another function as an argument (like the apply family). Most likely to use Function factories: functions that create functions (like ecdf()). Less common. Function operators: functions that take functions as input and output a function In what situations are functionals considered the most useful? We often want to do very similar things to many different functions The procedure is the same, only the function we're working with changes So write one function to do the job, and pass the function as an argument Because R treats a function like any other object, we can do this simply by: invoking the function by its argument name in the body We have already seen examples apply(), sapply(), etc.: Take a function and use it on all of these objects curve(), surface(): Evaluate a function over a range, and plot the results Variable creation in function factories Function factories take vector inputs and output a function Variables other than the arguments to the function are fixed by the environment of creation of the outputted function Side effects: bad Not all side effects are desirable. One particularly bad side effect is if the function's body changes the value of some variable outside of the function's environment Summary of Functions Function: formal encapsulation of a block of code; generally makes your code easier to understand, towork with, and to modify Functions are absolutely critical for writing (good) code for medium or large projects A function's structure consists of three main parts: inputs, body, and output R allows the function designer to specify default values for any of the inputs R doesn't allow the designer to return multiple outputs, but can return a list Side effects are things that happen as a result of a function call, but that aren't returned as an output In R, functions are objects, and can be arguments to other functions - Use this to do the same thing to many different functions - Separates writing the high-level operations and the first-order functions - Use sapply (etc.), wrappers, anonymous functions as adapters Functions can also be returned by other functions - Variables other than the arguments to the function are fixed by the environment of creation Why do we simulate? Often, simulations can be easier than hand calculations Often, simulations can be made more realistic than hand calculations Sampling from a given vector with and without replacement To sample from a given vector, use sample() Without replacement, the default sample(x = letters, size = 10) With replacement sample(x = letters, size = 10, replace = TRUE) Four main function prefixes for statistical distributions To sample from a normal distribution, we have the functions: rnorm(): generate normal random variables pnorm(): normal distribution function, Φ(x) = P (Z ≤ x) dnorm(): normal density function, φ(x) = Φ′(x) qnorm(): normal quantile function, q(y) = Φ−1(y), i.e., Φ(q(y)) = y Replace "norm" with the name of another distribution, all the same functions apply. E.g., "t", "exp","gamma", "chisq", "binom", "pois", etc What is pseudorandom? Random numbers generated in R (in any language) are not "truly" random; they are what we call pseudo-random These are numbers generated by computer algorithms that very closely mimic "truly" random numbers The study of such algorithms is an interesting research area in its own right! The default algorithm in R (and in nearly all software languages) is called the "Mersenne Twister" Type ?Random in your R console to read more about this (and to read how to change the algorithmused for pseudorandom number generation, which you should never really have to do, by the way Setting seeds What happens if we set the same seed? different seed? multiple seeds? All pseudorandom number generators depend on what is called a seed value This puts the random number generator in a well-defined "state", so that the numbers it generates, from then on, will be reproducible The seed is just an integer, and can be set with () The reason we set it: so that when someone else runs our simulation code, they can see the same—albeit, still random—results that we do Why do we repeat simulations? One single simulation is not always trustworthy (depends on the situation, of course) In general, simulations should be repeated and aggregate results reported—requires iteration! What is reproducibility? To make random number draws reproducible, we must set the seed with () More than this, to make simulation results reproducible, we need to follow good programming practices Gold standard: any time you show a simulation result (a figure, a table, etc.), you have code that can be run (by anyone) to produce exactly the same result What is the recommendation for structuring functions in an iterated simulation project? Writing a function to complete a single run of your simulation is often very helpful This allows the simulation itself to be intricate (e.g., intricate steps, several simulation parameters),but makes running the simulation simple Then you can use iteration to run your simulation over and over again Good design practice: write another function for this last part (running your simulation many times)2 Summary of Simulations Running simulations is an integral part of being a statistician in the 21st century R provides us with a utility functions for simulations from a wide variety of distributions To make your simulation results reproducible, you must set the seed, using () There is a natural connection between iteration, functions, and simulations Saving and loading results can be done in two formats: rds and rdata formats What are the two R file types for saving objects? Reading R file types? How do they differ? readRDS(), saveRDS(): functions for reading/writing single R objects from/to a file load(), save(): functions for reading/writing any number of R objects from/to a file What are the primary functions for reading and writing tabular data files? () and () What are the helpful arguments for ()? The following inputs apply to either () or () (though these two functions actually have different default inputs in general—e.g., header defaults to TRUE in () but FALSE in ()) header: boolean, TRUE is the first line should be interpreted as column names sep: string, specifies what separates the entries; empty string " " is interpreted to mean any whitespace quote: string, specifies what set of characters signify the beginning and end of quotes; empty string" " disables quotes altogether Other helpful inputs: skip, , . You can read about them in the help file () How do () and () compare? Have a table full of data, just not in the R file format? Then () is the function for you. It works as in: (file=, sep=" "), to read data from a local file on your computer , assuming (say) space separated data (file=, sep="t"), to read data from a webpage up at ,assuming (say) tab separated data Most common file extensions .csv, .dat, .txt Main function: () - Presumes whitespace -separated fields, one line per row - Main argument is the file name or URL - Returns a data frame - Lots of options for things like field separator, column names, forcing or guessing column types, skipping lines at the start of the file. . . The function () is just a shortcut for using () with sep=",". (But note: these twoactually differ on some of the default inputs!) How can we reorder a data frame? Sometimes it's convenient to reorder our data, say the rows of our data frame (or matrix). Recall: The function order() takes in a vector, and returns the vector of indices that put this vector in increasing order Set the input decreasing=TRUE in order() to get decreasing order We can apply compute an appropriate vector of indices, and then use this on rows of our data frameto reorder all of the columns simultaneously What does source() do? source() reads a .R script and runs in current session therefore saving objects in global environment Use case: store workflow functions in a script and source in analysis scripts What are the different cases we can have when merging data frames? Suppose you have two data frames X, Y, and you want to combine them Simplest case: the data frames have exactly the same number of rows, that the rows represent exactlythe same units, and you want all columns from both; just use, (X,Y) Next best case: you know that the two data frames have the same rows, but you only want certain columns from each; just use, e.g., (X$col1,X$col5,Y$col3) Next best case: same number of rows but in different order; put one of them in same order as the other, with order(). Alternatively, use merge()• Worse cases: different numbers of rows . . . hard to line up rows . . Compare and contrast the two methods to combine data frames The merge() function tries to merge two data frames according to common columns, as in: merge(x, y,by.x="SomeXCol", by.y="SomeYCol"), to join two data frames x, y, by matching the columns "SomeXCol"and "SomeYCol" Default (no by.x, by.y specified) is to match all columns with common names Output will be a new data frame that has all the columns of both data frames If you know databases, then merge() is doing a JOIN Using order() and manual tricks versus merge() Reordering is easier to grasp; merge() takes some learning Reordering is simplest when there's only one column to merge on; merge() handles many columns Reorderng is simplest when the data frames are the same size; merge() handles different sizes auto-matically Summary of Reading Data Read in data from a previous R session with readRDS(), load() Read in data from the outside with (), () Can sometimes be tricky to get arguments right in (), () Helps sometimes take a look at the original data files to see their structure Read in Excel spreadsheets with () Read in other R code with source() For reordering data, use order(), rev(), and proper indexing For merging data, use merge(); but can do it manually using reordering tricks Components of an exploratory data analysis (EDA) Before pursuing a specific model, it's generally a good idea to look at your data. When done in a structured way, this is called exploratory data analysis. E.g., you might investigate: What are the distributions of the variables? Are there distinct subgroups of samples? Are there any noticeable outliers? Are there interesting relationship/trends to model? Visualizations for EDA Visualizing relationships among variables, with pairs() Can easily look at multiple scatter plots at once, using the pairs() function. The first argument is written like a formula, with no response variable Numerical summaries for EDA summary() This tells us: The quantiles of the residuals: ideally, this is a perfect normal distribution The coefficient estimates Their standard errors P-values for their individual significances (Adjusted) R-squared value: how much variability is explained by the model? F-statistic for the significance of the overall mode Statistical linear regression model The linear model is arguably the most widely used statistical model, has a place in nearly every application domain of statistics Given response Y and predictors X1, . . . , Xp, in a linear regression model, we posit: Y = β0 + β1X1 + . . . + βpXp + ε, where ε ∼ N (0, σ2) Goal is to estimate parameters, also called coefficients β0, β1, . . . , βp Fitting linear regression model Fitting a linear regression model with lm() We can use lm() to fit a linear regression model. The first argument is a formula, of the form Y ~ X1 +X2 + ... + Xp, where Y is the response and X1, . . . , Xp are the predictors. These refer to column names of variables in a data frame, that we pass in through the data argument Formula notation handy shortcuts "on-the-fly" calculations Here are some handy shortcuts, for fitting linear models with lm() (there are also many others): No intercept (no β0 in the mathematical model): use 0 + or - 1 on the right-hand side of the formula, as in: summary(lm(lpsa ~ 0 + lcavol + lweight, data = )) summary(lm(lpsa ~ -1 + lcavol + lweight, data = )) Include all predictors: use . on the right-hand side of the formula, as in: summary(lm(lpsa ~ ., data = )) Include all predictors but some: use - <var> on the right-hand side of the formula, as in: summary(lm(lpsa ~ . -lcavol, data = )) Include interaction terms: use : between two predictors of interest, to include the interaction between them as a predictor, as in: summary(lm(lpsa ~ lcavol + lweight + lcavol:svi, data = )) summary(lm(lpsa ~ lweight + lcavol:svi, data = )) summary(lm(lpsa ~ lweight + lcavol*svi, data = ) Utility functions: what are they? why are they advocated for over manual extraction? Utility functionsLinear models in R come with a bunch of utility functions (methods for generics), such as coef(), fitted(),residuals(), summary(), plot(), predict(), for retrieving coefficients, fitted values, residuals, producing summaries, producing diagnostic plots, making predictions, respectively These tasks can also be done manually, by extracting at the components of the returned object from lm(),and manipulating them appropriately. But this is discouraged, because: - The manual strategy is more tedious and error prone - Once you master the utility functions, you'll be able to retrieve coefficients, fitted values, make predictions, etc., in the same way for model objects returned by glm(), gam(), and many others Fitting a logistic regression model Fitting a logistic regression model with glm() We can use glm() to fit a logistic regression model. The arguments are very similar to lm() The first argument is a formula, of the form Y ~ X1 + X2 + ... + Xp, where Y is the response and X1, . . . ,Xp are the predictors. These refer to column names of variables in a data frame, that we pass in through thedata argument. We must also specify family="binomial" to get logistic regression Summary of Fitting Models Fitting models is critical to both statistical inference and prediction Exploratory data analysis is a very good first step and gives you a sense of what you're dealing withbefore you start modeling Linear regression is the most basic modeling tool of all, and one of the most ubiquitous lm() allows you to fit a linear model by specifying a formula, in terms of column names of a given data frame Utility functions coef(), fitted(), residuals(), summary(), plot(), predict() are very handy and should be used over manual access tricks Logistic regression is the natural extension of linear regression to binary data; use glm() with family="binomial" and all the same utility functions

Show more Read less

Institution

CGFO - Certified Government Finance Officer

Module

CGFO - Certified Government Finance Officer

Whoops! We can’t load your doc right now. Try again or contact support.

Report Copyright Violation

Written for

Institution: CGFO - Certified Government Finance Officer
Module: CGFO - Certified Government Finance Officer

Document information

Uploaded on: August 3, 2024
Number of pages: 9
Written in: 2024/2025
Type: Exam (elaborations)
Contains: Questions & answers

Subjects

stat 404 midterm 2 terms in this set 44

Content preview

8/3/24, 2:01 PM

STAT 404 - MIDTERM 2
Jeremiah

ama

Terms in this set (44)

Data structures tie related values into one object

Functions tie related commands into one object
Why do we need functions?

In both cases: easier to understand, easier to work with, easier to build into larger
things

The structure of a function has three basic parts:
Function structure
- Name Inputs (or arguments): what should a user provide to the function?
- Arguments
- Body Body: code that is executed
- Return
Output (or return value): what is the side effect of your function

Things you're going to re-run, especially if it will be re-run with changes to arguments

What should be a function? Chunks of code which are small parts of bigger analyses

Chunks of code which are very similar to other chunks1

Test code outside a function first

Put code in a function and test

Best practices when creating a function Replace hard coded values with arguments and test again

Use parenthesis liberally

Only include in the function what can be repeate

With no explicit return() statement, the default is just to return whatever is on the last
What is the default return value?
line

A function can also specify default values for the inputs (if the user doesn't specify an
How and when to use default inputs
input in the function call, then the default value is used)

1/9

, 8/3/24, 2:01 PM
Inputs can be called by name, or without

Calling functions we define works just like Inputs can be called by partial names (if uniquely
calling built-in functions: named arguments,
default. When inputs aren't specified, default values are

Named inputs can go in any order

While named inputs can go in any order, unnamed inputs must go in the proper order
(as they are specified in the function's definition).
How are argument values matched to
arguments when a function is called?
When calling a function with multiple arguments, use input names for safety, unless
you're absolutely certain of the right order for (some) inputs10

When creating a function in R, though you cannot return more than one output, you
How to return more than one output
can return a list. This (by definition) can contain an arbitrary number of arbitrary objects

A side effect of a function is something that happens as a result of the function's body,
but is not returned.Examples:

Printing something out to the console
What is a side effect?

Plotting something on the display

Saving an R data file, or a PDF, etc.

Interfaces mark out a controlled inner environment for our code

Interact with the rest of the system only at the interface
Interfaces control what the function can see
(arguments, environment) and change (its Advice: arguments explicitly give the function all the information
internals, its return value) - Reduces risk of confusion and error
- Exception: true universals like π

Likewise, output should only be through the return value

In R, functions are objects, just like everything else

What do we mean by functions are objects?
This means that they can be passed to functions as arguments and returned by
functions as outputs as well

Beyond regular functions, we can use

Functionals: functions that take another function as
an argument (like the apply family). Most likely to
use
What are the four higher order classifications
of functions?
Function factories: functions that create functions
(like ecdf()). Less common.

Function operators: functions that take functions as
input and output a function

2/9

$5.99

Get access to the full document:

100% satisfaction guarantee

Immediately available after payment

Both online and in PDF

No strings attached

Get to know the seller

Denyss

5.0

(2)

Also available in package deal

Get to know the seller

Denyss Teachme2-tutor

View profile

Sold

Member since

1 year

Number of followers

Documents

6307

Last sold

1 week ago

Classic Writers

I am a professional writer/tutor. I help students with online class management, exams, essays, assignments and dissertations. Improve your grades by buying my study guides, notes and exams or test banks that are 100% graded

5.0

2 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their exams and reviewed by others who've used these revision notes.

Didn't get what you expected? Choose another document

No problem! You can straightaway pick a different document that better suits what you're after.

Pay as you like, start learning straight away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and smashed it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller Denyss. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $5.99. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 46231 documents were sold in the last 30 days Founded in 2010, the go-to place to buy revision notes and other study material for 15 years now

STAT 404 - MIDTERM 2 Terms in this set (44) Why do we need functions? Data structures tie related values into one object Functions tie related commands into one object In both cases: easier to understand, easier to work with, easier to build int

Written for

Document information

Subjects

Content preview

Also available in package deal

Get to know the seller

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning straight away

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?