Machine-Learning Models from Large Cross-CPP Data

Machine-Learning Models from Large Cross-CPP Data

The Cross-CPP Data Analytics Toolbox provides specialised functions for particular tasks that service providers identified as the most critical for their respective interests. On the other hand, there are also many analytics functions that cannot be simply performed by a pre-defined method and that would be generally applicable on a large set of problems. Such functions need to combine various analytics techniques, to test and evaluate alternative methods, and to keep their freedom in the selection of the best method for each particular case. The need for examining a large pool of potential methods available in a toolbox can be also motivated by the extreme pace of the research and development in advanced data analysis from big data sets.

The most successful methods for particular problems are often based on machine learning (ML) concepts that build complex analytics models from available data (annotated or plain). Data scientists do not usually implement ML-based solutions from scratch. They employ sophisticated libraries and packages to initialize and construct underlying ML structures and to train ML models. Python is used as the primary language to define the data analytics task and build and interact with advanced ML systems. (Even though the core training functionality can be implemented in a system language such as C, there are often Python bindings or specialised interfaces making advantage of the elegant syntax and expressing power of the high-level dynamic language). For example, the Python SciKit-Learn package is one of the most popular frameworks, implementing a wide range of traditional ML methods.

In accordance with many other disciplines, the field of data analytics of very large data has also recently become dominated by deep learning methods. Large companies such as Google, Facebook, Microsoft, or Baidu, as well as various research teams made available sophisticated frameworks (TensorFlow, Torch, etc.) that facilitate definition, training, and experimenting with various neural network architectures and settings. The training phase can then be accelerated using the super-fast implementation of linear algebra operations on modern GPUs. Some algorithms also benefit from the distributed nature of computation nodes in today’s data centres. Consequently, the biggest cloud providers such as Amazon, Microsoft, or Google offer special subscriptions for running ML tasks (using the above-mentioned frameworks) in their computing facilities.
To support the mentioned modern trends, the Cross-CPP Data Analytics Toolbox defines a generic interface that can potentially access any function available in the available frameworks or libraries. To demonstrate the functionality, our realisation interfaces three existing ML and data analytics libraries within the Cross-CPP project (SciKit-Learn, TensorFlow, PyTorch) and specifies protocols necessary to interlink other popular solutions. Furthermore, final project demonstrators will demonstrate applicability and scalability of the implemented solution by means of selected analytics functions operating on the data collected within the project (from cars, buildings, weather forecast, etc.).

The functionality of the ML model connector needs to distinguish between the phase of building an analytical model (for example, training a deep neural network) and the application of the model on new data. This is also reflected in the functional schema of the initial steps of the analytics model preparation shown in the following figure.

The scheme demonstrates the process of model initialization (training) from the provided data. The service invocation needs to fully specify the method to be used (for example, the implementation of the Stochastic Gradient Descent algorithm from the SciKit-Learn package), a set of parameters necessary to start the process (for example, the structure, the batch size and the learning rate for a feed-forward neural network), and the way training data can be obtained (for example, an URI corresponding to a query run in the Big Data Marketplace and joint with additional metadata).

The ‘model ready’ response to the status check indicates, that the model can be used for the analysis of new data. The invocation specifies the model ID and an URI of the data to be processed. The caller has to take care of the format of sent data which has to correspond to the data used in the model training phase. The analysis result matches the way expected results have been provided in the training data. The module also implements additional functionality enabling future updates of existing models (for example, further training of a neural network on newly acquired data), as well as simple tasks of model management (removal of previous models that are not needed any more, stopping/killing the training process in the case of an identified issue in the data etc.).