Divulgação - Defesa Nº 200

Aluna: Anderson Vinicius Alves Ferreira

Título: “Identifying Relevant Subspaces Using Subspace Search”.

Orientador: Prof. Carmelo José Albanez Bastos Filho

Data-hora:  30/Setembro/2019 (15:00h)
Local: Escola Politécnica de Pernambuco – SALA - I 4


Resumo:

High-dimensional data impose challenges in many different domains, ranging from optimization to machine learning, from numerical analysis to databases. These challenges are related to irrelevant or correlated attributes, exponential complexity with an increasing number of dimensions, or even the concentration effect of distance measures. Unsupervised learning methods, such as clustering algorithms, often rely on the underlying structure of the data for learning about unknown data patterns. As they do not make use of prior knowledge or rewards to control the learning process, unsupervised methods tend to be more sensitive to the effects of high-dimensional data. These aspects lead to the development of methods for learning in subspaces. Subspace clustering tries to address high-dimensional data by searching for relevant subspaces where clusters may reside. Some algorithms interleave the subspace search and the clustering processes in a way that one depends on the other, while other algorithms decouple those two tasks and consider them as independent processes. Either way, the search for relevant subspaces is a challenging open research problem. The notion of correlation is commonly used as an indication of the existence of patterns, and it guides the search of subspaces. However, merely using correlation as a measure of subspace quality might neglect how variables are associated with each other and ignore important information about the structure of the data. In this work, we present a subspace search process to try to identify relevant subspaces that might lead to more meaningful results. This search process builds upon the Greedy Maximum Deviation heuristic and incorporates the sense that the interconnection between dimensions is an important factor to discover indicative relationships. We illustrate the performance and the benefits of this technique on a data set of an Alzheimer's Disease Patient Population, on the well-known Wine data set from the UCI Repository, and on the data set of the 500 Cities Project. We show that the search process leads to more meaningful results than other methods, such as using full-dimensional data, feature selection with Principal Component Analysis, and Greedy Maximum Deviation itself.

 

Go to top Menu