Introduction
With the fast spreading integration of renewable energy sources, it is now a widely accepted fact that flexible loads capable of absorbing the volatility in energy generation are essential. In addition, the ability to steer the consumption of these loads while adhering to the comfort constraints of the users involved is important. Smart control of heating in residential and industrial buildings is one of the most researched topics in this context. To be able to control a building’s heating, one has to first characterise the thermal mass of a building. While several methodologies exist for doing so, in recent times data driven building characterization solutions have gained popularity. The key-word that stands out here is data; these characterization methods rely on data that is collected from buildings, and that is where it is necessary to have sensors (e.g., indoor temperature sensors, heat meters) that provide valuable information about the building. The optimal placement of sensors is thus an essential element of the entire setup.
We can look at the placement of sensors from two points of view. We could place sensors in a “one-to-many” way, where sensors are placed in the most “informative” position one after another. Or, in a “many-to-one” way, where a number of sensors are already placed in a building, and only the “important” ones are scheduled for future maintenance (e.g. battery replacement). This second approach is handy in a situation where the sensors are inexpensive, but their eventual maintenance is expensive. To determine the sensor positions using the one-to-many approach, one needs a very detailed model that can quantify the gain from adding the sensor at a specific position and this could be building dependent. However, if one is looking for a generic and expert free approach, the many-to-one approach is more appropriate. Going forward, we focus on the latter.
There are again many ways to think about how to assess the importance of sensors that are available. We could use the underlying building models, study their performance in the presence/absence of a sensor, and decide on the relevance of each sensor. In contrast, one could look at statistical tests, which are agnostic of the model itself, to select the most informative set of sensors. We are interested in the latter for the very reason that it can be used across various choices of building models. We will look at statistical tests to study the (in)dependence of the readings from the various sensors place in the probabilistic sense. These tests will tell us how the (pairwise) readings from sensors influence/follow each other, and the goal will then be to retain the set of sensors that convey the most information. For instance, if out of three sensors two are dependent on each other but independent from the third, the choice would be to retain the independent sensor and only one sensor from the pair of dependent ones.
Methodology
The goal is to identify if the probability distributions of the data from various sensors are dependent. The simplest criterion to use here would be the covariance or Pearson correlation coefficient:
The problem with using the above metric is that it captures only linear relationships. In particular, it falls short in giving information about more complex non-linear relationships and is also sensitive to outliers. These limitations are illustrated with several datasets.
Examples
See Figure 1, where there is clear dependence of variable y on variable x, however the correlation coefficient between the variables is only 0.02, indicating negligible dependence.
Also, the data sets shown in Figure 2 and Figure 3 are vastly different on manual inspection. The variables in Data set 2 seem to be uncorrelated, while the variables in Data set 3 are very correlated. However the correlation coefficient of both these data sets is 0.812. Hence, the Pearson correlation coefficient is not a reliable metric in trying to understand the dependence between various variables in a dataset.
The HSIC (The Hilbert-Schmidt Independence) Criterion
To overcome the shortcoming of the Pearson correlation coefficient we can use another metric called the HSIC (The Hilbert-Schmidt Independence Criterion). The HSIC criterion is a measure for the (in)dependence of variables, i.e. the higher the score the more dependent the variables, that can detect relations between variables beyond the linear case.
In general, measuring the HSIC alone is not enough; for instance, with one single value, a conclusion cannot be made on whether or not the value represents a dependence that is statistically significant. For this, we perform statistical hypothesis testing. The hypothesis in the HSIC test are as follows (see : http://papers.nips.cc/paper/3201-a-kernel-statistical-test-of-independence.pdf for more details ):
- The null hypothesis is that the readings from a pair of given sensors are independent, that is P(XY) = P(X)P(Y).
- The alternate hypothesis is that P(XY) != P(X)P(Y), suggesting that there is some dependence.
To perform the test, the following steps are carried:
- We compute the HSIC for readings from two sensors. This represents the real HSIC between the two sensors.
- Then, we compute the distribution of the HSIC under the null hypothesis. For that, we compute the hypothetical HSIC distribution between the two variables assuming that the two variables are independent. For that, we permute the readings of one sensor, fix the readings of the other, and compute the HSIC for several permutations. By permuting one of the sensors, we implicitly make the two variables independent.
- Finally, the p_value of the test is the percentage of cases where the HSIC of the original signal is less than the distribution of the HSIC under the null hypothesis, i.e. percentages of cases where the original HSIC is less than the values computed assuming that the variables are independent.
- If p_value <0.05, the null hypothesis is rejected and the variables are likely to be dependent with 95% confidence, and
- Else, we cannot reject the null hypothesis, and hence it cannot be stated that the variables are dependent.
To illustrate the usefulness of the above tests, we apply it to the datasets for which the correlation coefficient falls short.
For the data set in Figure 1, the p-value is 0, which suggests that the null hypothesis can be rejected and hence the variables are dependent. The non-linear dependence in the data set is thus well captured by the statistical test with HSIC, in contrast with the Pearson correlation coefficient.
For the data set in Figure 2, the p-value is 0.905, which suggests that the null hypothesis cannot be rejected. This in turns shows that this test is resistant to outliers that give false positives for dependence.
For the data set in Figure 3, the p-value is 0, which suggests that the null hypothesis should be rejected and hence the variables are dependent. This is also a test for robustness against outliers, as the p-value is not affected by the single outlier.
Some final remarks:
The hypothesis test explained above can similarly applied pair wise to readings from all sensor data available. As stated earlier, there are several advantages of adopting this method. First, it is more informative than the naïve correlation coefficient tests. It can be used for across building model choices. Finally, it can be applied for the data from any building, without specific knowledge about the building itself.
Footnote : Study part of research conducted under the FHP project ( http://fhp-h2020.eu/ )