Tracks SURF Research Bootcamp

Overview of the tracks

TRACKS 

Track 1 High Performance Computing, if your laptop is not sufficient

Track 2 Introduction to Machine Learning 

Track 3 Analytics using R

Track 4 Data management and computational workflows

Track 1 High Performance Computing, if your laptop is not sufficient

Lykle Voort, Ander Astudillo and Nuno Ferreira (SURFsara)

Sometimes your laptop is not sufficient when doing your research, or it takes too much time performing your analyses. In this course you learn how to make the step to High-Performance Computing systems, e.g. a batch system or the HPC Cloud.

1. Introduction to Linux 
You will get acquainted with some basic Linux commands and SSH needed to start using data, compute, and network facilities.

2. Cluster computing
You will learn how the national compute cluster Lisa and the national supercomputer Cartesius are set up and work hands‐on on one of these systems, using short and easy to follow examples.

3. SURF Cloud computing
Computing in the cloud allows you flexible and easy access to computing and data resources that you would otherwise have to host yourself. SURFsara runs the HPC Cloud providing an “Infrastructure as a Service” (IaaS) model. You will get a general introduction to cloud computing, learn about HPC Cloud characteristics and how to use it hands‐on so that you can start using high performance computing in your research.

Training objectives

  • To work with a Linux terminal/shell and use some basic commands
  • To prepare, submit and analyse a batch job on the national cluster Lisa / supercomputer Cartesius
  • To use the HPC Cloud
  • To understand and apply different scaling models for parallel computing
  • To build (clusters of) Virtual Machines

Prerequisites
Some affinity with programming or scripting

Hard-/software required
Please install beforehand the following tool on your laptop, depending on your operating system:
Windows – Install MobaXterm (http://mobaxterm.mobatek.net)
Linux, Mac OSX – nothing needed

Before the day starts, you will receive a username and password to login to the compute systems. To login, a public/private key pair is also needed. In the ‘Introduction to Linux’ session you will learn how to create such a public/private key pair. 

 

Track 2 Introduction to Machine Learning 

Valeriu Codreanu and Sagar Dolas (SURFsara) (pdf)

This track discusses a number of machine learning techniques at an introductory level. In particular, it distinguishes unsupervised and supervised learning techniques. The most basic unsupervised learning techniques are K-means clustering and dimensionality reduction. These are discussed as examples. Deep learning/neural networks are discussed as an example of supervised learning techniques.

Both topics include hands-on exercises that will be performed in a Jupyter notebook. The examples are in Python, but since templates are provided, Python and/or general programming knowledge are not required

The track starts with the presentation of an actual use case by Michiel Punt (VU, HU): "How to predict falls in older adults using time series from sensors? The importance of feature creation".

Training objectives

  • To learn the basics of Machine Learning
  • To learn the supervised and unsupervised learning techniques
  • To work with Jupyter notebook (basics)

Prerequisites
Some affinity with programming or scripting.

Hard-/software required
Please bring a laptop with a recent web browser. You will need it to access the Jupyter notebook service of SURFsara.

Before the day starts, you will receive a username and password to login to the Jupyter notebook service of SURFsara. 

 

Track 3 Analytics using R

Marc Teunis (HU) and Jonathan de Bruin (UU)

The use of R for data science has been booming in recent years. Thanks to its broad application domain and a very active user community, R is a very strong tool that goes beyond a mere statistical tool; it allows you to do much more with your data. 

This track consists of a hands-on tutorial that helps you getting started with R quickly. It focuses on how R can support you in conducting reproducible research. No prior knowledge of R is required, but the track will also be useful for more experienced users. 

You will get introduced to RStudio, a user-friendly working environment for R, learn about version control software gitand the github platform for sharing R analyses. In addition, you will work withtidyverse, a collection of very useful R packages with a host of handy tools for data cleaning and analyses andrmarkdown, for easy documentation of your analyses and so-called literate programming.

The track starts with two R use case presentations provided by University Medical Center Utrecht (Karin Hagoort) and Open Analytics, Antwerp (Tobias Verbeke).

Training objectives

  • To get familiar with RStudio
  • To learn how to use gitand the GitHub platform 
  • To learn how to cleanse and transform data with tidyverse
  • To learn how to document analyses with rmarkdown
  • To get introduced to interactive R apps with Shiny*

*Tentative, might not be covered

Prerequisites
A basic understanding of the steps involved in data analysis

Hard-/software required
Please bring a laptop with a recent web browser. It will be needed for accessing the RStudio server.

You will receive a username and password for RStudio server a few days in advance of the bootcamp.

 

Track 4 Data management and computational workflows

Arthur Newton and Christine Staiger (SURFsara) (pdf)

With the advance of new technologies, data volumes and numbers of files are constantly increasing. Good data-management is therefore an essential part of data-driven research. Incorporating data-management into computational workflows, is not always straightforward. In this course we will introduce how to efficiently manage data with the Integrated Rule-Oriented Data System (iRODS), how to build computational pipelines in compute environments employing the research data and how to enable the full audit trail of data between storage and compute resources.

Topics in this course include:

  • Data Life Cycle and FAIR principles
  • Data management concepts viewed in iRODS
  • Metadata and searching for data in iRODS
  • Building a computational pipeline that draws on data managed in iRODS

The track starts with the presentation of a data management tool by Ton Smeele (UU): “Using YODA to manage and publish research data”. After an introduction the training will purely be hands-on and will be taught in the style of software and data carpentry, i.e. live-coding sessions with exercises.

Training objectives

  • To learn about the concept of iRODS’s resource abstraction
  • To steer data flows across storage and compute resources
  • To work with metadata attached to data and how to query for data using this metadata in compute workflows

Prerequisites
Some affinity with programming or scripting.

Hard-/software required
Please install beforehand the following tool on your laptop, depending on your operating system:
Windows – Install MobaXterm (http://mobaxterm.mobatek.net)
Linux, Mac OSX – nothing needed

Before the day starts, you will receive a username and password to login to a Linux system (in this case Lisa compute cluster and national supercomputer Cartesius). 

Latest modifications 19 Nov 2018