A sensor network serves as a vital source for collecting raw sensory data. Sensor data are later processed, analyzed, visualized, and reasoned over with the help of several decision making tools. A decision making process can be disastrously misled by a small portion of anomalous sensor readings. Therefore, there has been a vast demand for mechanisms that identify and then eliminate such anomalies in order to ensure the quality, integrity, and/or trustworthiness of the raw sensory data before they can even be interpreted.
Prior to identifying anomalies, it is essential to understand the various anomalous behaviors prevalent in a sensor network deployment. Therefore, we begin this work by providing a comprehensive study of anomalies that exist in a sensor network deployment, or are likely to exist in future deployments. After this thorough systematic analysis, we identify those anomalies that, in fact, hinder the quality and/or trustworthiness of the collected sensor data.
One approach towards the reduction of the negative impact of misleading sensor readings is to perform off-line analysis after storing a large amount of sensor data into a centralized database. To this end, in this work, we propose an off-line abnormal node detection mechanism rooted in machine learning and data mining. Our proposed mechanism achieves high detection accuracy with low false positives. The major disadvantage of a centralized architecture is the tremendous amount of energy wasted while communicating the sensor readings. Therefore, we further propose an on-line distributed anomaly detection framework that is capable of accurately and rapidly identifying data-centric anomalies in-network, while at the same time maintaining a low energy profile. Unlike previous approaches, our proposed framework utilizes a very small amount of data memory through on-line extraction of few statistical features over the sensor data stream. In addition, previous detection mechanisms leverage sensor datasets obtained from an earlier deployment or use synthetic data to test their effectiveness. Our framework, on the other hand, has been entirely implemented in TinyOS as a prototype readily deployable into existing sensor networks, alongside other essential protocols such as sensor data collection protocols. An advantage of our system is the fact that it relies on supervised learning. Supervised machine learning algorithms usually achieve higher accuracy than their unsupervised counterparts given a highly representative common ground truth. Thus, in this work, we also design highly expressive anomaly models that may be leveraged to inject anomalous readings into existing sensor network deployments. In order to do so, we have developed a tool called SNMiner which enables us not only to inject anomalies into a network of sensors, but also to extract important statistical features and evaluate the accuracy of a number of supervised machine learning algorithms.