Big Data ServicesResearchers whose work involves analysing and processing big data can now access one of the largest Hadoop clusters in the Netherlands, enabling them to use several large public data sets and multiple frameworks, such as Apache Spark, Hive, Pig and HBase. They can also do data analyses using NoSQL.
Big data analysis
Big data refers to collections of data so large or complex that standard data management and processing resources no longer suffice. These data collections require a different set of knowledge and tools. SURFsara offers such tools in the form of Hadoop and NoSQL, which enable researchers to study complex, structured or unstructured data sets and are particularly useful for research in the fields of linguistics, bioinformatics and the social sciences.
Hadoop shows major promise as solution for big data analysis. Originally based on the MapReduce framework developed by Google, Hadoop is easy to program and enables users to quickly come to grips with even extremely large structured and unstructured data sets. SURFsara's Hadoop cluster is among the biggest in the Netherlands, comprising 170 nodes, 1,400 cores and 2 petabytes of data storage (as at March 2015). Cluster users also gain access to several public data sets, including Wikipedia (English and Dutch) and the large CommonCrawl data set of Internet pages. These data sets are accessible locally.
YARN resource management
In 2014 SURFsara upgraded to Hadoop 2.0 – or YARN, short for Yet Another Resource Negotiator. The cluster can now run multiple frameworks, such as Apache Spark, thereby enabling real time data analysis and the use of SQL. Other available frameworks are Pig (high-level language on top MapReduce), Hive (SQL on MapReduce) and Giraph (interactive graph analyses).
NoSQL is the umbrella name for a new approach to databases. The ‘No’ stands for ‘not only’, referring to the fact that relational databases are not always the best option. NoSQL is particularly useful for researchers who work with large volumes of data with no definite structure and who are doing so without clear-cut predetermined research questions. SURFsara's NoSQL service is tailored to users based on their own particular needs. For more information, please contact us at email@example.com.
Data storage on demand
As a Hadoop user you get a standard 200 gigabytes of storage capacity to use for your own files (home file system), which are backed up daily. Besides this, you also have access to 8 terabytes of temporary storage capacity. These files are not backed up and are deleted after 2 weeks. If you require additional storage, this can also be arranged.
Support & consultancy
When you use our services, you can also turn to us for support. For example, if you need advice on the architecture and application of technologies. Our team can ensure you get the most out of our services. We can also organise introductory courses on the use of big data services upon request.
Our helpdesk is available by telephone and email, but can also assist you in person. If you have any questions, or to report a problem, please send an email to firstname.lastname@example.org or phone +31-208001400. The helpdesk can be reached during office hours (9:00–17:00).
For specific advice about optimising your code for improved performance, contact our consultancy service.
You can find more information about big data research in the following best practices:
- Meertens Institute: Two billion tweets used as research materials
- Informatics Institute (University of Amsterdam): Intelligent search engines