The Cross-CPP Data Analytics Toolbox provides specialised functions for particular tasks that service providers identified as the most critical for their respective interests. On the other hand, there are also many analytics functions that cannot be simply performed by a pre-defined method and that would be generally applicable on a large set of problems. Such functions need to combine various analytics techniques, to test and evaluate alternative methods, and to keep their freedom in the selection of the best method for each particular case. The need for examining a large pool of potential methods available in a toolbox can be also motivated by the extreme pace of the research and development in advanced data analysis from big data sets.
The most successful methods for particular problems are often based on machine learning (ML) concepts that build complex analytics models from available data (annotated or plain). Data scientists do not usually implement ML-based solutions from scratch. They employ sophisticated libraries and packages to initialize and construct underlying ML structures and to train ML models. Python is used as the primary language to define the data analytics task and build and interact with advanced ML systems. (Even though the core training functionality can be implemented in a system language such as C, there are often Python bindings or specialised interfaces making advantage of the elegant syntax and expressing power of the high-level dynamic language). For example, the Python SciKit-Learn package is one of the most popular frameworks, implementing a wide range of traditional ML methods.
In accordance with many other disciplines, the field of data analytics of very large data has also recently become dominated by deep learning methods. Large companies such as Google, Facebook, Microsoft, or Baidu, as well as various research teams made available sophisticated frameworks (TensorFlow, Torch, etc.) that facilitate definition, training, and experimenting with various neural network architectures and settings. The training phase can then be accelerated using the super-fast implementation of linear algebra operations on modern GPUs. Some algorithms also benefit from the distributed nature of computation nodes in today’s data centres. Consequently, the biggest cloud providers such as Amazon, Microsoft, or Google offer special subscriptions for running ML tasks (using the above-mentioned frameworks) in their computing facilities.
To support the mentioned modern trends, the Cross-CPP Data Analytics Toolbox defines a generic interface that can potentially access any function available in the available frameworks or libraries. To demonstrate the functionality, our realisation interfaces three existing ML and data analytics libraries within the Cross-CPP project (SciKit-Learn, TensorFlow, PyTorch) and specifies protocols necessary to interlink other popular solutions. Furthermore, final project demonstrators will demonstrate applicability and scalability of the implemented solution by means of selected analytics functions operating on the data collected within the project (from cars, buildings, weather forecast, etc.).
The functionality of the ML model connector needs to distinguish between the phase of building an analytical model (for example, training a deep neural network) and the application of the model on new data. This is also reflected in the functional schema of the initial steps of the analytics model preparation shown in the following figure.
The scheme demonstrates the process of model initialization (training) from the provided data. The service invocation needs to fully specify the method to be used (for example, the implementation of the Stochastic Gradient Descent algorithm from the SciKit-Learn package), a set of parameters necessary to start the process (for example, the structure, the batch size and the learning rate for a feed-forward neural network), and the way training data can be obtained (for example, an URI corresponding to a query run in the Big Data Marketplace and joint with additional metadata).
The ‘model ready’ response to the status check indicates, that the model can be used for the analysis of new data. The invocation specifies the model ID and an URI of the data to be processed. The caller has to take care of the format of sent data which has to correspond to the data used in the model training phase. The analysis result matches the way expected results have been provided in the training data. The module also implements additional functionality enabling future updates of existing models (for example, further training of a neural network on newly acquired data), as well as simple tasks of model management (removal of previous models that are not needed any more, stopping/killing the training process in the case of an identified issue in the data etc.).
The last time, we introduced you to the fundamental building blocks of a weather model, and demonstrated how important it was to have a high-resolution, fine-grained weather model in order to be able to project underlying topography.
Another important aspect is that the quality of a weather model is highly dependent on what often gets referred to as “ground truth” .
Ground truth describes actual measurements and observations of weather stations that are used for initialization of the weather model, so it “knows” where to start from. Besides ground weather stations, other measurement methods like weather balloon soundings, radars or satellites are used to complement the initialization data set.
The more measured data the weather model has from its point of initialization, the better the forecast will be as there will be less inaccuracy to begin with. This is especially true for the first forecast hours of very high resolution models because they are able to reflect small-scale features. If the starting point of the model is already a poor guess, any forecast resulting from it will be likely even worse than that.
The density of weather observations across the world is very diverse with some countries having hundreds of weather stations and countries with regions that are almost unknown regarding measured meteorological parameters.
Cross-CPP can help with that, as it aims to provide a platform where service providers like us can buy weather data from sensors, that origin from cars, buildings or other technology. These sensors need a different handling than regular weather station data and must undergo a special plausibility check, that we develop within the project. However, because of their density, their data will still help with the model initialization process, especially in regions and areas where “ground truth” is currently rare.
Individualized weather forecasts
Another aspect of access to different data sources is that we are able to enhance our services for the data provider themselves. For example, imagine a building owners who want to automate the operation of window blinds or optimize energy usage (heating, cooling, etc.): these owners would benefit greatly from a tailored weather forecast for their buildings, that takes the special local meteorological characteristics into account and adjusts for them. Smart buildings usually own a weather station located on the roof of a building, whose data we can use to refine our forecast for that specific building after having collected at least ~1 year of measurements from this station.
Weather sensors and weather data can be derived from various sources and transformed into a variety of use cases. In our next blog, we will show you some other products we are working on like weather-based navigation and a car sensor derived precipitation map!
Thanks for reading and stay with us 🙂
Your Meteologix Team and Cross-CPP consortium partners
We all have experience of online shopping on Amazon, and probably (binge) watching videos on YouTube or Netflix. However, we hardly think about the suggestions that these kind of online platforms create for us which are completely or partially tailor-made for us.
In Data Science and AI, we can usually distinguish between two types of recommendation systems: collaborative and content based. Collaborative systems predict what individual users might like based on data gathered over other users on what they have liked, and content based systems predict what individual user liked in the past. Both of those approaches have their pros and cons, thus we come up with context related recommendation system. For example it would be great to have music recommendation systems consider the places of interest based on location and time, activity the user is performing and even on extreme side consider the mood of the listener. Here comes the value of knowledge modelling or context modelling. In Cross-CPP we have developed an ontology based algorithm where we make suggestions about data that can be relevant for our data customers when they are out looking for CPP data.
How does this work?
The first step was to develop an ontology detailing the relationships among CPP sensors and signals which models the environment of entire CPP system. We took the simple concept of binding sensor signals with relationships, dependencies and made an informative context model which has all the relationships information about the sensor signals that are hosted by Cross CPP Big data Marketplace. The second step involved using the ontology to develop an algorithm that receives the data customer’s selection of signals and uses the ontology to suggest what kind of further signals could be also selected.
The figure below shows the part of vehicle-specific ontology model which contains sensor measurement values and static information or as we call it basic CPP information gathered from vehicles. The relations we introduced primarily indicate the relationships between different sensor measurement values, which is of high importance for generating the recommendation for the data customer. As you might see two very frequently used relationships are standing out: the “affectedBy”-relationship, which indicates that a specific sensor value may be influenced by another sensor value – e.g. the indoor temperature may be influenced by the air conditioning mode. On the other side, we use the “relatedWith” – relationship to indicate that there is a correlation between two sensor values – e.g. there is a direct correlation between engine RPM and engine coolant temperature.
Figure 1 Vehicle specific context model
Based on the user’s current interest we then query this model, i.e. what kind of sensor signals should be recommended to him/her and which can additionally be chosen from and added to the shopping basket. A similar model creation process is under-construction for sensor signals collected by smart buildings.
In the next context related blog, we will give a more detailed look on the context monitoring and extraction aspects of our Context Analysis journey, so stay tuned!
Siemens service scenario is related to e-mobility which is of a great business interest today. Our desired service is built on data exchange between vehicles and buildings. It provides end-users (vehicle drivers) most suitable e-chargers according to their needs and building capabilities.
The idea of this service is to send simple information about the presence of a charging station inside a building or outside (public parking lots, airports, hospitals) to the vehicle
Using real-time data in the communication with a car/building about the occupancy of e-charges in a way that the vehicle would send out its own information about its capacity of the battery. This together with its current position and speed could possibly calculate the time of arrival and to reserve an e-charger for this specific car. Within the scope of the project, such a reservation will be done manually by the vehicle driver via the application.
Besides the actual connection to the socket and battery status of the car, there is also considered the energy performance of the building. It is not possible to unlimitedly charge vehicles without considering own power consumption of the building, expected power load or input power.
Energy outcome has to be decentralized in order to sufficiently satisfy multiple users. We have to avoid the situation that on the one hand customer fully charges the vehicle and on top is blocking the socket for other users and on the other hand there is not enough energy to charge rest of the customers. Energy has to be divided accordingly and energy flow controlled systematically.
Putting together vehicle information and requirements, energy will be distributed based on a scheme chosen by the service provider (first comes, first serves – charges vs. fair distribution to those in need – 2% battery left take precedence over 80% charged, etc.)
Indication about finishing the charging will be available in the marketplace coming from vehicle based on its movement and as well there will be information from the building, that the e-charger is unplugged.
After evaluation of this other and requisite, e-charger will be again considered for next customer as available.
The future of mobility is our biggest passion. We are determined to make come alive for you and – more importantly – usable! The Cross-CPP project will make numerous new applications and services conceivable!
For example, could you imagine that your vehicle contributes to predicting the weather for others? When you are on the road with your Volkswagen, your vehicle provides various information: You are informed about outside temperature and a rain sensor detects the intensity of rainfall so that your wipers adapt accordingly. Even the electronic stabilization program (ESP) might the activated automatically. This kind of information sure come in handy for you personally, but they are even more useful when combined with other vehicle data!
In doing so, applications can create weather reports for your planned route ahead. And in case transmitted data of others indicates weather risk, the system can suggest alternatives to your prior planned route. A real milestone in autonomous driving!
Sounds convenient? Applications that result from the EU-funded Cross-CPP project could look quite similar: The project links data of various industry representatives. Besides Volkswagen, companies like Siemens participate in the funded project, providing information that exemplarily circles around the automatization of buildings or charging infrastructure.
The graphic above illustrates the corresponding process:
- Product manufacturers and additional partners generate and transmit data.
- Data is harmonized into CPP-Format.
- Data is saved and at disposal on a Cloud.
- Depending on the desired service, data is made accessible and usable.
- That way new applications and services develop, benefiting all of us.
The usage of data through different cooperation partners and the standardized format of data offer many possibilities within the range from “helpful” to “entertaining”. Locating charging infrastructure for electric vehicles and notifications for rest charging times are only a few examples for further possible applications.
The collection and integration of data coming from different sources will probably be one of the key elements of many future markets and services, and is indeed the vision buttressing all efforts being made in Cross-CPP. Still, as any analyst would tell, having data is only a necessary condition, not a sufficient one: what is also required are capabilities to analyse and manipulate those data. This realisation was the seed behind the introduction of the CPP Data Analytics Toolbox, a suit of modules designed to simplify the analysis of data, covering from basic statistical functions to complex predictive models.
Yet, could the ambition be to provide a toolbox able to solve any foreseen (and yet to be foreseen) analytics, over data whose nature will evolve and change, and satisfying the needs of services yet to be specified? This is clearly beyond the reach of any 3-years research project. Furthermore, some service providers will prefer to resort to their in-house algorithms and models, especially when these are part of their core business – to illustrate, a weather forecast company would not rely on external models to predict tomorrow’s rain. Instead, the project decided to follow a different strategy: provide basic, yet comprehensive tools that would allow service providers to fast develop prototypes and test ideas.
The Data Analytics Toolbox is based on a modular structure, with different components offering different types of analysis; yet, all of them share the same way of communicating with the user, and of retrieving data from and returning results to the system. We here start reviewing these modules, by focusing on two of them, respectively for trajectories and network analysis.
Trajectories Analysis Component.The concept of “trajectory analysis” is a very general one, encompassing many different analyses on data representing a spatio-temporal evolution. With the exception of buildings, all CPPs composing the Cross-CPP system will be expected to move, at some point of their life. With these concepts in mind, this component aims at providing a set of basic tools to simplify the handling and manipulation of this mathematical object. On one hand, this includes a set of functions to analyse trajectories in an individual fashion, i.e. without considering their interconnections. On the other hand, a second level deals with the analysis of multiple trajectories by taking into account the relationships between them, for instance to detect groups of similar trajectories, or the presence of causal relationships between them.
Network analysis.Sensors in the Cross-CPP ecosystem are organised in complex interaction structures. These structures may be physical, as for instance sensors in a car can be connected through the CAN BUS, and can therefore directly share information. Yet, such structures can also be functional, i.e. the result of the fact that sensors are embedded in a common context. To illustrate, two temperature sensors in two different cars can be yielding the same (or very similar) time series, provided the two cars travel along similar paths. From a mathematical point of view, such connectivity networks can be analysed by means of complex network theory, a statistical physics understanding of the classical graph theory. Complex networks have been used, for instance, to assess and reduce the vulnerability of the resulting communication patterns, or the optimisation of the spread of a new information in the system. This component provides several functions to both manage and analyse networks, like the extraction of metrics or the identification of groups of strongly connected objects.
The project Cross-CPP deals with cross-sectorial Cyber Physical Products – CPPs in short – such as vehicles and smart buildings. CPPs can have many sensors that are collecting information about the CPP environment and their use.
The project offers a big data marketplace as “One-Stop-Shop” to data customers who want to tap into the enormous opportunity that arises from collecting data from various cross-sectorial CPPs. But is this enough, just to collect data from CPP and use it for different applications? Can we be sure that the data coming from CPP is not influenced by other factors such as weather, the geographical location or simply the color of a car?
Have you ever heard about word context? According to Oxford dictionary, context is “The circumstances that form the setting for an event, statement, or idea, and in terms of which it can be fully understood.” In the artificial intelligence domain, the concept of context is usually defined as the “generalization of a collection of assumptions”. For Cross-CPP, “Context can be a set of information which characterizes the situation under which sensor data are obtained (e.g. situation under which the data from temperature sensor in a car is obtained)”. Sounds difficult or? Well, let’s take a simple example to understand what context means for a vehicle. Do you know, that for modern vehicles mobile sensor networks can produce over 4000 signals per second per vehicle? This is a huge amount of data isn’t it? Now imagine if this raw sensor data comes with additional information, such as the circumstances under which the data has been collected, or the factors that can influence the sensor measurements that are being observed from vehicles? Such answers can be provided by context. The context information is an additional information that data customers get when they are looking for data collection from Big Data Marketplace. Still not quite clear?
Let’s say we have a black car equipped with an exterior temperature sensor: Wouldn’t it be great if we could retrieve data from this temperature sensor to provide it to a data customer who might build a new service making use of this data? We also know that many factors influence the value measured by the sensor: this could be the colour of car (black), the current location of the car, like altitude, the height of sensor installed in a car, what time of the day or year it is, and many other factors. All this information that is either the car’s metadata or can be measured with other sensors, from now on we will call enhanced monitored data. Furthermore, we can deduce certain situations for the temperature value based on this enhanced monitored data, which certainly defines the context of this car. Such a situation can be that the temperature value measured by the black colour car with sensor located on 20cm above ground level, was standing at mid-day in a summer day in south of France is not very reliable.
We hope that above example is clear enough for understanding concept of context involved in Cross-CPP project. We would also like to use context not only on data collection side but also on security aspect of Cross-CPP modules and usage of services, to provide the CPP user/owner with a flexible (context based) protection for his CPP information.
For Cross-CPP modules, Context will be extracted as specified by the request from the data customer or as needed by internal modules such as the Cross-CPP security module. And as we learned above, in order to extract context the extractor will use an enhanced monitored data (combination of metadata for particular CPP and raw sensor data) together with rules and defined context models.
In case you are wondering, how this is all going to be realised in scope of the project step by step, we are offering several blogs on context topic and we will make sure that you get enough insights to work with context! In the following blogs, we will explain how data customers can work with enhanced monitored and contextual data. We will explain how a context tool can extract context data and how it can be useful for the data customer to make informative decisions. Furthermore we will explain context based security for Cross-CPP modules, where we will learn how context can help to improve security for CPP owners and last but not least we will also provide insights for context related tools to give service providers like Meteologix a toolkit which they can use to improve their innovative services. All of these interesting topics will be provided as series of subsequent blogs, so …
Stay tuned 🙂
Your ATB Team and Cross-CPP consortium partners
The Cross-CPP (Cross-Cyber Physical Products) Project and its consortium partners aim to build a cross-sectorial marketplace, that offers data from various sources.
Service providers can then use this data to enhance their services and offer them for example back to the data owner, e.g. if you are driving a car and opt to share the outside temperature data of it.
We Meteologix as a meteorological service provider can use this data to enhance our own “SwissHD” forecast-system, and in turn provide you with a tailored and even better weather forecast for your car and travel.
To understand this whole process, it might be helpful to dig a little bit into the theory, how modern weather forecast is done in the first place.
Modern forecast-systems are highly complex computer programs, that consist of thousands of lines of code needed to compute a forecast for a specific location at a certain point in time with the help of algorithms,
that process vast amounts of data for these grid points around the world.
What’s a grid point then?
Imagine laying a mesh around the globe – then each node within this mesh is a grid point.
For each grid point a forecast is calculated, that takes the height and other geographical features of this specific location into account. Of course, you can also get a forecast for any other location that is not a grid point: this is achieved by interpolation between nearby grid points.
Thus, the farther away the grid points are from each other and the more coarse-meshed a weather model is, the poorer is its resolution and the more interpolation is needed and vice versa.
There are a lot of weather models on the market and they differ tremendously in resolution, the probably most famous and widely used Global-forecast-system (GFS) has a grid point only every ~22km in mid-latitudes. The use of its data is free, which is why it is the basis of a lot of (low quality) weather apps.
You can observe the problems that arise from low resolution easily in the following comparison of pictures of the terrain in Liechtenstein, that each model can “see” and differentiate with their grid point densities.
Let’s take a look how well these different model resolutions reflect the topography of Liechtenstein:
The first one is a model with grid points every 22km, then one with grid points every 13km, then a ~7km grid, and the last one is our Meteologix Swiss HD 1km model. The differences are quite obvious: the coarse-meshed models only capture two to four different terrain heights as they get averaged and smoothed out. Meaning these models do only take these few different regional features into account, when computing their forecast, which leads to very biased weather predictions. The two more fine-grained models differentiate the regional ground features much better.
Of course there is more to a weather model than just the density of the grid points, its inner logic and formulas are very important as well, but if the mesh is too broad, the underlying topography cannot be projected realistically. The same applies to forecasts of small-scale weather events, such as showers and thunderstorms where higher-resolution models can predict their evolution more accurately than coarser models. Thus, all mathematical sophistication does not help, when the weather model does not “know” for what kind of terrain it calculates the forecast for.
Hence, it is important to have a high-grained weather model to begin with in order to make reasonable forecasts, although it is also important to have as much “ground truth” as possible to enhance the model’s forecasts.
What exactly is meant by “ground truth” and how this Cross-CPP project aims to help with that, so that you as a consumer can get the best weather predictions as possible, we will explain in our next weather blog post.
Stay tuned 🙂
Your Meteologix Team and Cross-CPP consortium partners
The Cross-CPP project defines a new concept of identification services that enable users to share their identity and the identity of related entities with service providers (for example, to be able to get a cheaper vehicle insurance plan if the insurance company is allowed to monitor user’s driving behaviour) but, at the same time, let the user have a full control on the information that does not directly identify an entity (such as a geo-located temperature measurement) but could reveal user’s identity when combined with other data (for example, a regular travel from a distant place in a specific time). The following figure describes an overall schema of the system and positions it in the context of other Cross-CPP modules.
Identification services primarily interact with the CPP Cloud storage and CPP Big data marketplace and interlink the data with additional information. Service providers or, potentially, Cyber-physical products can ask for particular functions by invoking relevant services and reading results. The data access policy is managed by the Cross-CPP Security module but the policy can also specify that the only way a particular service provider could receive a data is in a privacy-aware transformed form (for example, data aggregated for relevant map tiles rather than exact GPS locations). Similarly, a rule for data filtration can employ a context (by invoking the Context awareness module) to deliver only a relevant subset of the data agreed upon a contract between the data owner (for example, a building operator or a vehicle owner) and a service provider (for example, a weather forecast service asking only for plausible measurements of the outside temperature). The implemented functionality will help Cross-CPP guarantee privacy-aware data sharing in the CPP Big Data Marketplace.
Discuss on Twitter
Discuss on Twitter @CrossCPP