Skip to content

Latest commit

 

History

History
248 lines (199 loc) · 17.6 KB

new_to_r.md

File metadata and controls

248 lines (199 loc) · 17.6 KB

New to R?

What is R? RStudio?

R is a free, open source language that is specifically focused on statistical data analysis. It is increasing in market share among researchers (second only to SPSS and outpacing SAS, Stata, JMP, Matlab, and other solutions you may have used) and has multiple advantages over "point and click" statistical data analysis.

RStudio is a free tool made by Posit (formerly RStudio) that makes using R much simpler than using R alone. RStudio is provided as a tool in Arcus labs and is nearly identical to the RStudio you might be accustomed to using in CHOP's HPC, on CHOP's RStudio Connect server, or on your computer. It allows you to analyze data stored in your lab for your research project.

""

In R, you write scripts. Scripts are computer code that record a series of operations you want to perform on your data. Operations could include things like:

  • Ingesting data (bringing it into R) from an outside source like a .csv or a database
  • Cleaning data (say, removing rows in which not every likert scale question was answered)
  • Performing statistical tests (like a T-test on subjects and controls)
  • Visualizing data (for example, creating an ROC)
  • And much more!

Why is R Popular?

By using a script, you simply execute the code that could have multiple steps, such as combining data, de-identifying and cleaning data, performing analysis and statistical tests, and creating visualizations. If more data get added, you simply run the script again. You already did the hard work of writing the script, so now all you have to do is essentially hit "run".

If you realize that your workflow needs a bit of tweaking toward the beginning, you can update that part of your script and leave the rest untouched. Again, you just run the script with your changes, and you've saved yourself a lot of time compared to when changing something far upstream of your analysis meant hours of manual cleaning of data or re-creation of new files.

What Makes R Difficult?

Unless you recently left graduate school, you probably learned a different paradigm of data analysis, one that depended on point-and-click software. Or perhaps you don't work directly with data but hire a statistician to do that work for you. Learning to "DIY" is rewarding but can definitely feel frustrating, especially when you already have a system that works. It is difficult to transition from using a system you're comfortable with to one that you're less adept at. We do think that the gains of using R, in terms of research reproducibility, greater publication options, and more fine-grained control over things like visualizations, outweigh the annoyances of having to learn to write code.

CHOP Has an R User Group!

""

CHOP has a vibrant R User Group made up of employees from all over the institution who use R for many different use cases. This is a great place to start connecting with other people, asking for help, and seeking advice.

Please fill out this form to join the CHOPR User Group! This will add you to the Outlook distribution list for emails as well as give you instructions on how to add yourself to our Slack workspace, where people ask coding questions (and answer them!).

Joining the R User Group means you'll be informed about periodic intro to R workshops, R User Group talks, and other resources you'll find useful. Especially if you're the only person in your lab who uses R, it can be important to find a community of practice that can help guide you. The R User Group, along with Arcus, provides introduction to R training periodically. To learn more about the Introduction to R for Clinical Data course, visit the R101 course website.

As you gain expertise, we also invite you to participate by leading an R User Group meeting! You don't have to be an expert for years in order to share your skills. Even if you only know a little, you know more than some people, and you can share pitfalls to avoid and the routes to success for data analysis tasks you conducted on your type of research data.

Arcus-Specific R Training

Arcus On-Ramp

If you're already an Arcus user (you've signed our Terms of Use and completed CITI training), you can sign up for our Arcus On-Ramp webinars. In these webinars, you work in a real Arcus lab analyzing CHOP's electronic health record (EHR) to replicate an actual published study. Workshops focus either on exploring the data and defining a query for your study using SQL, or running the analysis in R/Python. No coding experience is required to attend. Registration closes one week before each workshop so we have time to add registered attendees as users in the webinar training lab. To sign up, please visit https://arcus.chop.edu/education/webinar-signup/. This link is only available for Arcus customers on the CHOP network.

Lab Training Videos

""

For an example of how to use R in your Arcus lab, start with the training videos on your lab's landing page.

These are very introductory, but help you understand specifically how to work with your Arcus lab.

We strongly encourage you to watch all of the videos, in order, even the ones that don't refer to R specifically. It's only about an hour of your time, and we think it will answer many of your questions and save time in the long run.

Additional Resources

Arcus training is a great place to get started with your R education, but you will probably want to continue your education on your own, growing in skills that are specific to your own research goals or career needs.

You have several options when it comes to growing in your R skills.

There are a number of university classes, online courses and live workshops that go in depth about how to use R. Simply search for courses at the university or MOOC (e.g. Coursera) you prefer to use.

If you prefer something a bit more "just in time", however, we suggest the R modules from the DART (Data and Analytics for Research Training) program.

DART includes dozens of data science modules that are each 1 hour or less in duration and with a narrow focus and clear learning objectives. They are asynchronous and you can take them at any time!

Arcus Education's DART modules are the result of a study funded by an NIH grant aimed at educating biomedical researchers. The active research phase of this program is complete, so we are no longer recruiting learners to be our subjects. However, if you'd like to receive updates about publications or applications of this research, please email us at dart@chop.edu.

Training modules:

To begin learning R, there are a couple of options with regard to the DART self-guided tutorial modules.

If you want a comprehensive curriculum of nearly twenty modules, you might enjoy our Suggested Pathway 4: Analysis in R curriculum, which includes overview materials about reproducible research and data organization, introductory material in R, and some advanced topics you'll need as a biomedical researcher. While you're there, check out the other suggested pathways, too!


Expand to see a sneak preview of Suggested Pathway 4: Analysis in R!


Order Module Description Estimated Time
1 Reproducibility, Generalizability, and Reuse This module provides learners with an approachable introduction to the concepts and impact of research reproducibility, generalizability, and data reuse, and how technical approaches can help make these goals more attainable. 60 min
2 How to Troubleshoot Learning to use technical methods like coding and version control in your research inevitably means running into problems. Learn practical methods for troubleshooting and moving past error codes and other difficulties. 30 min
3 R Basics: Introduction Introduction to R and hands-on first steps for brand new beginners. 60 min
4 R Basics: Visualizing Data With ggplot2 Learn how to visualize data using R's ggplot2 package. 60 min
5 R Basics: Transforming Data With dplyr Learn how to transform (or wrangle) data using R's dplyr package. 60 min
6 Tidy Data Tidy is a technical term in data analysis and describes an optimal way for organizing data that will be analyzed computationally. 45 min
7 Directories and File Paths In this module, learners will explore what a directory is and how to describe the location of a file using its file path. 15 min
8 R Basics Practice Use the basics of R coding, data transformation, and data visualization to work with real data. 60 min
9 Reshaping Data in R: Long and Wide Data A module that teaches how to reshape tabular data in R, concentrating on some typical shapes known as "long" and "wide" data. 60 min
10 Missing Values in R A practical demonstration of how missing values show up in R and how to deal with them. Note that this module does not cover statistical approaches for handling missing data, but instead focuses on the code you need to find, work with, and assign missing values in R. 45 min
11 Summary Statistics in R Learn to calculate summary statistics in R, and how to present them in a table for publication. 30 min
12 Data Visualization in Open Source Software Introduction to principles of data visualization and typical data visualization workflows using two common open source libraries: ggplot2 and seaborn. 20 min
13 Data Visualization in ggplot2 This module includes code and explanations for several popular data visualizations, using R's ggplot2 package. It also includes examples of how to modify ggplot2 plots to customize them for different uses (e.g. adhering to journal requirements for visualizations). 60 min
14 Introduction to Null Hypothesis Significance Testing This is an introduction to NHST for biomedical researchers. 40 min
15 Statistical Tests in Open Source Software This module provides an overview of the most commonly used kinds of statistical tests and links to code for running many of them in both R and python. 20 min
16 R Practice Use the basics of R coding, data transformation, and data visualization to work with real data. 60 min
17 Demystifying Machine Learning An approachable and practical introduction to machine learning for biomedical researchers. 60 min
18 Understanding the Bias-Variance Tradeoff The bias-variance tradeoff is a central issue in nearly all machine learning analyses. This module explains what the tradeoff is, why it matters for machine learning, and what you can do to manage it in your own analyses. 20 min




If these pathways are close, but not quite right, you can also build your own pathway through these materials using our prototype curriculum development tool at https://learn.arcus.chop.edu.

If you're in a hurry and you want to just get a bit of specific R instruction, we recommend starting with these modules:

Additionally, beyond the NIH grant, we have other articles and miscellany we suggest, whether those are resources we've created in Arcus, or things we recommend from the larger R community.

Compendia of Resources:

  • Our "R 101" Guide includes links to articles, webinars, and other materials on a variety of topics.

Other Resources: