posted on 2020-07-21, 09:44authored byYasemin Asan Kalaz
In recent years, many emerging technologies, such as radio-frequency identification (RFID) networks and wireless sensor networks have produced a large amount of uncertain data. This brings great attention to uncertain data. As the pattern mining problem is studied a lot in certain data, it is also quite an important problem in uncertain data. Probabilistic databases are a commonly used framework to model uncertain databases. There are many studies on uncertain databases, however, most of them use the independence assumption. In this thesis, first, we propose a correlated tuple model that enables us to define dependencies between tuples for tuple level uncertain databases. As an improvement to this model, we define a general model that can capture existing dependencies in uncertain dependent databases. However, finding the support of an item set on such a model is an NP-complete problem. Instead, we propose a restricted version of this model. We also define a dynamic program to efficiently find frequent itemsets. Finally, we propose a pattern matching problem on transcription factor binding profiles. We generate uncertain dependent sequence data, to which we apply a mining algorithm to find frequent sub-sequences. After frequent sub-sequences have been found for each motif, whose family is already known, we use the Jaccard index to compare them with each other. Then, we apply the distance measure to the Jaccard similarity values to identify the right family for each motif. We validated our solutions through extensive experiments and discuss potential future research directions for mining patterns over dependent uncertain databases.