Introduction to Posit, R and RStudio IDE

Workshop Guide

Authors

Kamarul Imran Musa

Jason Ng

Published

November 6, 2024

Introduction to Posit Cloud, R and RStudio IDE

Posit Cloud

Posit Cloud lets you access Posit’s powerful set of data science tools right in your browser–no installation or complex configuration required.

Register for free Posit Cloud account and access our course

Steps:

  • Click POSIT Cloud folder for our workshop at this link
  • Proceed with registration if required

Free registration

Join the space

Join the space
  • Go to Members area

Your workspace
  • Create new RStudio project

Create a new RStudio project
  • RStudio interface

RStudio interface

R Programming Language

There are some terms we have to be familiar:

  • packages
  • codes
  • parameters
  • argument

R packages

R codes are contained with a package. CRAN provides the list of all R packages.

  • It is accessible at CRAN webpage
  • There at (of 10 July 2023) almost 20000 R packages listed on CRAN packages webpage
  • Look at CRAN TaskViews if you want to know R packages that are grouped based on certain topics such as Bayesian, Causal Inference, Epidemiology and many others.
  • There are also many more packages that are not listed on CRAN but hosted on other repositories such as GitHub.

If you need to use certain R packages, you have to install it first. There are two ways of installing R packages into your R IDE:

  • Writing codes in R Console: Type install.packages("nlme", dependencies = TRUE), then click ENTER

Install R package
  • Using GUI in the Package pane

Install R package
  • install these packages

    • fpp3

    • tidyverse

Install required packages

R functions

We have to write R scripts to perform desired task in R. R scripts consists a set of R codes. Users write R scrips in the Console pane or in R Script file or other R editor such as Quarto or R Markdown document.

Some simple codes as examples:

ChickWeight[1:20,]

lm(weight ~ Time, data = ChickWeight)

ChickWeight[1:20,]
   weight Time Chick Diet
1      42    0     1    1
2      51    2     1    1
3      59    4     1    1
4      64    6     1    1
5      76    8     1    1
6      93   10     1    1
7     106   12     1    1
8     125   14     1    1
9     149   16     1    1
10    171   18     1    1
11    199   20     1    1
12    205   21     1    1
13     40    0     2    1
14     49    2     2    1
15     58    4     2    1
16     72    6     2    1
17     84    8     2    1
18    103   10     2    1
19    122   12     2    1
20    138   14     2    1
lm(weight ~ Time, data = ChickWeight)

Call:
lm(formula = weight ~ Time, data = ChickWeight)

Coefficients:
(Intercept)         Time  
     27.467        8.803  

Arguments in R functions

Now, let’s type a question mark in front of a function.

For example, type a question mark infront of lm. Then click ENTER.

?lm

A window will appear. It will describe the detailed information about the function. You can see the default paramaters for such function, such as:

lm(formula, data, subset, weights, na.action, method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL, offset, ...)

You will see arguments for each parameters:

  • formula
  • data
  • subsets
  • weight

and others

RStudio IDE

RStudio is an integrated development environment (IDE) for R and Python. It includes a console, syntax-highlighting editor that supports direct code execution, and tools for plotting, history, debugging, and workspace management. RStudio is available in open source and commercial editions and runs on the desktop (Windows, Mac, and Linux).

There are a few ways to use RStudio:

  • Install it on your computer
  • Use it on the Cloud (POSIT, Amazon, Microsoft, Saturn Cloud)
  • Install and use it on a server (RStudio Workbench)

Hands on with RStudio IDE

  • Prepare the R environment

Steps:

  • Create a new RStudio project
  • Load required R libraries
library(tidyverse) #data wrangling and data visualization
library(haven) #read statistical data
library(broom) #tidier results
library(gtsummary) #to perform EDA
library(fpp3) #for time series and forecasting

If you received error there is no package called XXXXXX then install the package first then run the library function again

Error when package not yet installed
  • Read data

First, we read data that we want to analyze into R. In this example, we will read a file named imm23.dta that sits in a folder named datasets.

R then will convert the data into an object named as imm. You are free to write any name to represent the data.

dataset <- read_csv('health_dataset.csv')

You may check the columns and rows of the datasets:

  • columns are variables
  • rows are observations
glimpse(dataset)
Rows: 1,000
Columns: 12
$ hba1c       <dbl> 8.346606, 8.711193, 8.745453, 8.971826, 9.852680, 12.52782…
$ fbs         <dbl> 5.910906, 6.670592, 10.785029, 7.362169, 7.497362, 11.1446…
$ sex         <chr> "male", "male", "male", "female", "female", "male", "male"…
$ age         <dbl> 36.79013, 41.92743, 35.97902, 51.27069, 56.20355, 66.27214…
$ sbp         <dbl> 102.87252, 116.03028, 102.60260, 104.39317, 118.13104, 105…
$ dbp         <dbl> 85.05267, 65.43385, 62.18899, 73.11391, 77.92691, 74.12425…
$ weight      <dbl> 77.42964, 58.26995, 83.36551, 54.74778, 76.34981, 82.72246…
$ whr         <dbl> 0.12138468, 0.15365110, 0.04139635, 1.22958802, 0.19269123…
$ chol        <dbl> 5.259782, 4.243214, 6.836742, 6.531657, 3.523359, 6.291934…
$ tg          <dbl> 5.2650048, 6.1474392, 3.6183344, -0.5153024, 2.5157862, 2.…
$ obese       <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "no",…
$ historyofdm <chr> "no", "no", "no", "no", "no", "yes", "no", "yes", "no", "n…
  • Explore data

You must always explore your data. Examine each variable and each observation.

dataset |> 
  tbl_summary(by = sex)

Characteristic

female
N = 484

1

male
N = 516

1
hba1c 9.54 (8.74, 10.38) 9.45 (8.59, 10.37)
fbs 7.29 (5.83, 8.83) 7.17 (5.71, 8.57)
age 45 (39, 52) 45 (38, 52)
sbp 110 (103, 117) 109 (103, 116)
dbp 75 (70, 79) 75 (70, 80)
weight 67 (59, 75) 67 (59, 75)
whr 0.77 (0.42, 1.16) 0.84 (0.49, 1.20)
chol 4.43 (3.65, 5.33) 4.50 (3.65, 5.46)
tg 4.62 (3.30, 5.94) 4.60 (3.17, 5.98)
obese 239 (49%) 267 (52%)
historyofdm 234 (48%) 262 (51%)
1

Median (Q1, Q3); n (%)

Example: Revision on linear regression

  • Estimation from linear regression in R

Let’s show you how to run a linear regression model. From the analysis, we will be able to estimate the regression parameters.

  • Constant only model

In the constant only model, we choose only the dependent variable. And we only have one dependent variable match.

mod0 <- lm(hba1c ~ 1, data = dataset)
summary(mod0)

Call:
lm(formula = hba1c ~ 1, data = dataset)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.3492 -0.8376  0.0059  0.8815  4.1506 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  9.49864    0.04187   226.8   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.324 on 999 degrees of freedom

Univariable model

In univariable model, we specify

  • one dependent variable
  • one covariate or independent variable
mod1 <- lm(hba1c ~ fbs, data = dataset)
summary(mod1)

Call:
lm(formula = hba1c ~ fbs, data = dataset)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.3801 -0.7773 -0.0080  0.8146  3.8961 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8.03422    0.13070   61.47   <2e-16 ***
fbs          0.20235    0.01722   11.75   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.242 on 998 degrees of freedom
Multiple R-squared:  0.1215,    Adjusted R-squared:  0.1206 
F-statistic:   138 on 1 and 998 DF,  p-value: < 2.2e-16

Multivariable model

Now, we let’s estimate a multivariable model. A multivaribale model has one dependent variable and more than one independent variables. For example:

  • One dependent variable: math
  • Two covariates: fbs, bmi
mod2 <- lm(hba1c ~ fbs + obese, data = dataset)
summary(mod2)

Call:
lm(formula = hba1c ~ fbs + obese, data = dataset)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2965 -0.7541 -0.0249  0.8425  3.9800 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.96066    0.13521  58.877   <2e-16 ***
fbs          0.20113    0.01721  11.689   <2e-16 ***
obeseyes     0.16282    0.07846   2.075   0.0382 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.24 on 997 degrees of freedom
Multiple R-squared:  0.1253,    Adjusted R-squared:  0.1235 
F-statistic: 71.38 on 2 and 997 DF,  p-value: < 2.2e-16

Reflection

  • Could you access our course?
  • What do you think would be your main challenges in running R and RStudio?
  • Are you interested to run R on your machine?
  • How would you get help if stuck with analysis?
  • Have you tried generative AI such as chatGPT?
  • What is an R package?
  • What is an R code?
  • What is a function?
  • What contains in a function?