The well-known top 3 cloud providers Amazon Web Services (AWS), Microsoft Azure and Google Cloud are some of the largest companies of the world.
The grand scale at which the public clouds operate allows them to offer these services in a cost-efficient manner. Their pay-as-you-go model can be used to reduce costs dramatically by cleverly selecting services that minimize running costs and switch of the services when not in use. As such, we see great potential in using these hyperscale capabilities for research purposes.
The challenges of using the public cloud are well-known. One of these risks is that the companies behind it become too powerful and customers find themselves in a vendor lock-in. Having an exit strategy ready is, in our opinion, a prerequisite for going to the cloud. Security is another of these challenges. Security in public clouds, like in any architecture, requires a focus directly from the design phase onwards.
Call for pilots
The public clouds provide a great opportunity for research as many of these new services and their speed of development can accelerate research. Also, we feel that the only way to cope with the afore mentioned challenges, is to start using public clouds. Therefore, last year, we started a first public cloud call that resulted in 4 successful projects. As a result, we have decided to organize a second call for researchers who share our interest in the public cloud and the great opportunities it can bring.
Throughout the year we will have a quarterly call in which researchers can apply. The application deadline for each round is the first day of the calendar quarter. The maximum number of projects depends on the available human and financial resources. We are most experienced developing solutions on AWS, but we are also interested in developing on Microsoft Azure as a lot of SURF research members use the latter. SURF will provide design and implementation support (for a maximum of 160 hours per selected project) and the public cloud resources are subsidized by NWO (fair use principle).
This call is really meant for pilot projects. We will help to develop a Proof of Concept (PoC) for your research environment in the public cloud and advice you on how to continue after the funded pilot project. Projects should preferably have a need for data intensive processing, possibly in combination with data analytics, data storage, data sharing or streaming data. Past and present examples of the SDA group are solutions to process the data of IoT-devices, traffic data, large amounts of Twitter data or a machine learning project, but we are open to explore other application areas as well. Below we have described a few of our past projects in more detail.
Prepare first meeting and conditions to apply
Are you working on a data-intensive research project that could benefit from the public cloud services and do you like what you read? Please contact us by sending your proposal, with as subject “Public cloud call”, to email@example.com.
Make sure that your proposal addresses the following points:
- The project is a PoC in which we demonstrate a component or service of a public cloud that can be used by your research project. The proposal must describe your research and the high-level requirements for the service you want to develop.
- Briefly describe the data intensive component of your project and the application area of your research.
- You accept and will actively collaborate in the creation of a use case after completion of the pilot project which SURF may use to demonstrate value of public clouds among Dutch and international researchers.
In a first meeting we will discuss your idea and work it out in more detail with the help of the points below. Please prepare them if possible.
- The project can take up to 3 months maximum. Give a rough planning in the proposal.
- Sufficient participation of the applicant: give details of available resources (number of individuals, available workforce, skillsets and available time) and preferably resources with at least basic cloud skills to actively co-develop with us.
- We help creating components or services, but will not maintain them after the project. If you have an idea, explain your vision how to continue the components or service after the project.
- The total budget for public cloud resources is approximately 5K per project. To meet budget restraints, we may develop the PoC in a scaled down version. Explain, if possible, your estimated need for compute and storage during the project such that it is possible to derive cost from it.
- Use of AWS or MS Azure public cloud standard-, managed - or serverless services. If you have an idea, describe the services you want to use.
- Projects will start as soon as it has been approved and resources are available.
- Preferable the components or services of the PoC results are reusable. Describe if applicable which parts of your research environment might be of interest to other researchers.
Furthermore, the same criteria apply as for small NWO Computing Time on National Computer facilities applications. Researchers may submit an application if they are employed (i.e. hold a salaried position) at one of the following organizations:
- Universities established in the Kingdom of the Netherlands;
- University Medical Centers;
- NWO and KNAW institutes;
- Dutch universities of applied sciences;
- the Netherlands Cancer Institute;
- the Max Planck Institute for Psycholinguistics in Nijmegen;
- the DUBBLE beam line at the ESRF in Grenoble;
- the Princess Máxima Centre for paediatric oncology;
- NCB Naturalis;
- the institutes participating in the SURF Cooperative: KNMI, RIVM, TNO, the National Archives, the National Library, University of Humanistic Studies and the Police Academy;
- All TO2-institutes: TNO, NLR, Marin and Deltares and WUR/DLO.
Note: applications from researchers with a temporary position may need a signature as guarantee from a supervising staff member with a permanent contract. By signing the application, the supervising staff member declares that they are responsible for the awarded computing time after the expiration date of the project.
Procedure for selecting the projects
Projects will be approved if they meet the above-described criteria and financial and human resources are available.
Analyzing billions of Dutch tweets to find out who tweeted what, where and when. In the TwiXL project we designed a solution where Dutch tweets are collected and made available to researchers and students for large-scale analysis. On the TwiXL platform you can look up words and find out where, when and how often they are used, by whom and with what other words they frequently occur together. Over the years the data size of this Twitter archive has grown considerably to several terabytes. The analysis infrastructure scales automatically with it to keep the performance level of the analysis constant.
Scraping the text data from 30.000 start-up companies, analyze the data and identify which of these companies are developing products or services to limit CO2 emissions. In the Crunchbase project we designed a cost-efficient and scalable solution to collect website text data from 30.000 start-up companies. During the collection of the data the text data was automatically filtered and transformed, which reduced the data size and helped to speed up the analysis of the researcher. For the data analysis, compute infrastructure was deployed in an automated fashion, which enables easy rescaling if more compute power is needed. It also enables future reuse if more analysis is required for the Crunchbase project or a similar project that requires similar infrastructure.
Automated Machine Learning (AutoML)
Automatically select which machine learning algorithm to use, how to tune them in a data driven way. The process of benchmarking machine learning models requires a complex orchestration of hundreds of high compute task on a large infrastructure stack. In the AutoML project we co-developed a more cost-effective deployment of the AutoML benchmark framework on public cloud resources. This helped to reduce the total runtime of the benchmark runs and will save infrastructure costs. In addition, we explored other tools and mechanisms available in the public cloud, that could further improve the AutoML benchmarking deployment on the public cloud.
Expertise at SURF
The Scalable Data Analytics Group uses cloud-native solutions to process and analyze (streaming) Big Data for research and data science. We have certified AWS solution architects and expertise and experience with open-source tools such as Docker containers, Kubernetes (container orchestration), Apache Kafka and Spark. With these and other tools, we built platforms to process, store and/or analyze (streaming) data. These solutions allow for maximum portability and can either be deployed on SURF’s own private cloud or on any of the public clouds.