eTEC-BIG Scalable Machine Learning
An important application of AI is helping scientists deal with massive amounts of data. We aim to develop a tool to make this easier and more efficient, even for extremely large datasets.
Why are we working on this project?
Astronomical amounts of data
Astrophysics and high-energy physics are called big science for a reason: modern telescopes and supercomputers produce astronomical amounts of data. But this doesn’t make it easier to understand the phenomena out there in space. Just handling the data is already a huge challenge. It would be prohibitively expensive to crunch all these data in search of answers.
Solving the inverse problem
One typically common and difficult problem in astrophysics and high-energy physics is the inverse problem: given an observation, how do we understand its cause? For example, Albert Einstein famously predicted that the light from a remote galaxy will be distorted by the gravity of a foreground object. This phenomenon is known as “gravitational lensing”. If we already know the properties of the foreground object, it will be straightforward to calculate the effects of gravitational lensing. But from the telescope we can only observe the lensed results. So how do we understand the properties of the foreground object that caused them?
What are we doing in this project?
Building a highly efficient AI model
In this eTEC-BIG project, researchers at the University of Amsterdam and NIKHEF collaborate with SURF and the Netherlands eScience Center to build a highly efficient AI model to tackle this problem. They use neural networks to directly learn marginal likelihood-to-evidence ratios from a sequence of simulations.
This approach is called Truncated Marginal Neural Ratio Estimation (TMNRE) It is highly effective as it avoids the heavy compute load of ‘traditional’ Monte-Carlo simulations, which have the extra disadvantage that their results are usually impossible to convert into useful algorithms.
Big compute infrastructure for big problems
However, this is easier said than done. Despite the effectiveness of the TMNRE algorithm, the approach still requires vast computing power. And the data are very diverse, which adds to the challenge. So SURF and NLeSC are helping the researchers to optimize their algorithms to leverage the power of the national supercomputing infrastructure. This involves developing a sophisticated workflow to carry out simulations and scaling up the neural networks on the supercomputer – first Cartesius, and then Snellius. Making the tool run as efficiently as possible requires further computational expertise from SURF.
Another challenge is generalization: the researchers also aim to solve the inverse problem in other physical sciences. Therefore, part of the project is making the tool capable of tackling a family of problems. The developers have put in a lot of effort to make the software as modular as possible, making it easy for scientists to adapt it to their own research. They are drafting two papers to show and explain this further.
What are the main activities?
- Development of the neural network tool
- Development of the data generation and handling pipeline
- Two papers: one explaining the method of the tool and potential applications; the other more technical, showing programmers how to use the code for specific purposes.
- Testing and optimizing the tool, working with real data from the researchers.
Who are we collaborating with?
- Radboud Universiteit