Storage & Analysis�(& Security)

What a re my optio ns for effectively organizing, storing, securing, computing, and analyzing my research data?

Goals:

Know what resources are available

Understand best practices

Know where to get help

Storage & Analysis: Questions

What speaks to you about Data Security?

What storage resources are available at HBS?

How might you collaborate with other when using data?

What compute resources do you have to use??

Storage & Analysis

This is the most difficult & time-consuming RDM stage
Likely need to perform, rinse, & repeat
So..
Should be effortless if one has planned well…
- 5Ps: Proper Planning Prevents Poor Performance
…and if done well 1st time around
Security is just as important during these steps!

Resources for Storage

This will vary based on data sensitive data level and indicated by DUA, IRB, or Data Security plan
See IT Security handout for appropriate considerations
May often be directed by faculty or RC Center member
Consult local research computing center / environment
- Research storage associated with a compute cluster
- Database server
- School and HU collaboration tools (E.g. SharePoint, OneNote)
HBS:
IT-issued desktops / laptops storage (usually SSD)
Collaboration or project folders on research storage for group work
- Associated with HBSGrid cluster
- \\hbsfiles storage
Other schools:
Lab folders offer equivalent functionality

https://researchdatamanagement.harvard.edu /storage-analysis-computation

Other storage resources?

FASRC's compute environment*
IQSS' compute environment
Cloud providers: Mass OpenCloud, AWS, Azure, GCP, etc*
3rd party-licensed providers
- Qualtrics, Zotero, etc
- DropBox, OneDrive, Box, etc
See websites for data transfer options:
- https://grid.rcs.hbs.org/transferring-data
- https://security.harvard.edu/secure-file-transfer

*Some costs may be associated with use. Please contact RCS first

Collaboration Tools: Data Security & Privacy

The University has determined that the Zoom cloud does not have the appropriate controls to protect Level 4 data. This means that it cannot be used to record research interviews, as recordings include conversations which could cause social harm to the participants should they be obtained by individuals with ill intent, which is considered to be Level 4 data even if the full scope of the video is not intended to be used . Unfortunately, at this time, the University has not approved any cloud-based solutions for video recording research interviews.

What does "consumer" mean? A "consumer" account is a service which you have signed up for on your own. Even if it is being paid for with a Harvard credit card, it is considered a consumer account, unless it is protected by a Harvard contract. Consumer Versions of cloud software not recommended for University business.

https://security.harvard.edu/collaboration-tools-matrix

Resources for Analysis

Local computing environment:
- HBS-issued desktop / laptops (data-intensive work – please talk to RCS/RSS)
- Home computer, with appropriate security measures
Remote environments
- HBSGrid compute cluster , FASRC Cannon* cluster, IQSS' RCE, HMS O2
- Be thoughtful and strategic about use and efficiency
  - Offload long-running work to the compute cluster
  - If something isn't running as expected, troubleshoot or ask for help
Cloud commercial vendors*:
- Amazon Web Services (AWS), Google Cloud, Microsoft Azure
- Please sign-up under Harvard contract (tenant)
- They provide support for secure storage & compute, BUT ensure they meet your security requirements (storage location, sufficient security)
Open-source Cloud systems (not vetted)
- OpenStack, OpenNebula, Mass. OpenCloud
National Supercomputing Centers
- XSEDE umbrella of compute resources

*some costs may be associated with use

https://researchdatamanagement.harvard.edu /storage-analysis-computation

Storage and Analysis

Storage & Analysis: Questions

How might one use Version Control?

Describe an example of good project organization?

Why are workflow tools important?

Doing the work…

We organize our recommendations into the following topics ( Box 1 ):

Data management: saving both raw and intermediate forms, documenting all steps, creating

tidy data amenable to analysis.

Software: writing, organizing, and sharing scripts and programs used in an analysis.

Collaboration: making it easy for existing and new collaborators to understand and contribute to a project.

Project organization: organizing the digital artifacts of a project to ease discovery and understanding.

Tracking changes: recording how various components of your project change over time.

Manuscripts: writing manuscripts in a way that leaves an audit trail and minimizes manual merging of conflicts.

https://doi.org/10.1371/ journal.pcbi.1005510

https://drivendata.github.io/cookiecutter-data-science/

Project Setup & Organization, part 1

Put each project in its own directory, which is named after the project
Create folders that will separate your code and data
In your data folder, ensure that your raw data are separated from any data you have processed (i.e., your clean datasets)

Create additional folders as needed for project. E.g., report folder for output; references folder for reference material such as survey instrument. Create a "README" file that outlines basic information about the project and the folder/file structure. Name files in a way that their content or function can be easily identified. Use relative addressing to make the project portable

Languages & Tools Recommendations*

Programming languages for processing and analyzing data in research:
- Most used: Python (Spyder as editor) and R (RStudio as editor)
- Others: Scala, Java, Julia
Statistical packages:
- Stata, R, SAS, & SPSS
Big Data tools:
- Spark (Hadoop), Kubernetes/containers
Data Visualization tools:
- ggplot2, Tableau, D3, Shiny, Plotly, Pandas, WorldMap (from HU's CGA)

*This list is not meant to be exhaustive!

Documenting workflows & analyses

Whatever tool you use, document all steps in your analysis and data transformations. Some tools to help with that:
RMarkdown and RMarkdown Notebooks (used with R)
Jupyter Notebooks: support for most languages (Python, R, Stata, MATLAB)
Dyndoc (notebook for Stata)
Templates with OneNote (or EverNote)
Workflow/Pipeline Tools
- These help document and track process order:
- Consider Drake (R); SnakeMake, doit or py-Make (Python); and make for other systems
- See https://github.com/pditommaso/awesome-pipeline for a full list of options
Be sure to update your data dictionary/codebook as you make changes to your data
Your future self will thank you!

Versioning and version control

Incredibly important given the duration and lifespan of projects
What may start out as small, test idea may grow organically into multi-person and multi-site research project
Two approaches given, for small to large…
Manual versioning
- In most cases, data will be organized in files under directories:
  - Use phase title, unique identifiers, and descriptive filenames
  - Prefix by date (yyyy, yyyy.mm.dd, yyyy_mmdd, yymmdd)
  - Reserve / display 3-letter file extension for file format, such as .txt, .pdf, or .csv.
- Note all changes in a ReadMe.txt or Changes.txt document
Use a version control system via Git or Github.com
- Use your judgement, and talk to your faculty advisor; BUT their non-use does not prevent your use
- Utilize Github.com web interface for external (non-HU) collaboration, and code.harvard.edu for internal-only use
- Command-line (terminal/shell) Git or Git-GUI like GitKraken
NB ! HBS/IQSS Version Control Class offered each semester

Source : PHD Comics. 2012. __Piled Higher and Deeper. __ http://phdcomics.com/comics/archive.php?comicid=1531

Databases for (un)structured data

Storing and use data in a database for easy, fast queries!
When data contain complex relationships or relating data to multiple sets of files
"Structured" data: SQL databases (MySQL, PostgreSQL, MariaDB)
Textual/"unstructured" data: Non-SQL databases (MongoDB, Cassandra)
Benefit: data is read-only, unless explicitly changed
Consider versioning the data / databases also
Update your data dictionary describing data, types, use, etc.
- Consider storing it, as well as change information, as a part of the database
HBS RCS provides MariaDB as a part of the RC environment:
- Provision and guidance on data modeling and DB development
- Advise on best practices and performance tuning
FASRC & HMS offer equivalent resources
Cloud vendors have similar offerings

Training Opportunities

RC Centers & HU Libraries offers tool & analysis environment training, both on-campus and remote
- E.g. Intro to R / Python, Automating Work, Version Control, etc.
Data collections & analytical methods
- Web scraping, causal inference, natural language processing
Offered fall and spring, and announced through newsletters, websites, and Harvard Training Portal (Category: Research Computing)

Storage, Security, & Analysis Exercise

You have transferred the data and are storing it on the _ HBSGrid _ along with the older data. You notice that the data and code from five years ago are a bit...disorganized.

Recall that project materials will need to be easily accessible to a collaborator from USC. Who could you contact to ensure that a collaborator can access the data and code?

How would you go about organizing and documenting the previous files?

How will you organize the new code and files?

How will you document your processes?

What type of version control system will you employ?

Storage, Security, & Analysis: Recap

Many resources are available for storage and computation (desktop, laptop, HBS grid, cloud), but the storage must be appropriate for the security level.

Important to organize code/files and document processes

RCS offers consultations and trainings on storage and analysis.

References:

Baker Library Research Data Program: https://www.library.hbs.edu/Services/Research-Data-Program

HU Libraries Data Networks: https://hlrdm.library.harvard.edu /network

HU Working Remotely: https://www.harvard.edu /coronavirus/work-remotely

http://security.harvard.edu/dct

https://github.com/pditommaso/awesome-pipeline

Github.com

Github Enterprise @ Harvard: http://code.harvard.edu

Cookie Cutter Data Science: https://bit.ly/2NXTVGI

Wilson G, Bryan J, Cranston K, Kitzes J, Nederbragt L, Teal TK (2017) Good enough practices in scientific computing. PLoS Comput Biol 13(6): e1005510. https://doi.org/10.1371/journal.pcbi.1005510

https://researchdatamanagement.harvard.edu/storage-analysis-computation

Harvard Training Portal: Research Computing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

03_storage_and_analysis.md

03_storage_and_analysis.md

Storage & Analysis�(& Security)

Storage & Analysis: Questions

Storage & Analysis

Resources for Storage

Other storage resources?

Collaboration Tools: Data Security & Privacy

Resources for Analysis

Storage and Analysis

Storage & Analysis: Questions

Doing the work…

Project Setup & Organization, part 1

Languages & Tools Recommendations*

Documenting workflows & analyses

Versioning and version control

Databases for (un)structured data

Training Opportunities

Storage, Security, & Analysis Exercise

Storage, Security, & Analysis: Recap

Storage and Analysis

Files

03_storage_and_analysis.md

Latest commit

History

03_storage_and_analysis.md

File metadata and controls

Storage & Analysis�(& Security)

Storage & Analysis: Questions

Storage & Analysis

Resources for Storage

Other storage resources?

Collaboration Tools: Data Security & Privacy

Resources for Analysis

Storage and Analysis

Storage & Analysis: Questions

Doing the work…

Project Setup & Organization, part 1

Languages & Tools Recommendations*

Documenting workflows & analyses

Versioning and version control

Databases for (un)structured data

Training Opportunities

Storage, Security, & Analysis Exercise

Storage, Security, & Analysis: Recap

Storage and Analysis