Learning from very large data sets
- Limas, M.C. 1
- Ordieres Mere, J.B. 1
- Ciampi, A. 12
- Elias, F.A. 13
-
1
Universidad de León
info
-
2
McGill University
info
-
3
Universidad de La Rioja
info
ISSN: 1790-0832
Datum der Publikation: 2005
Ausgabe: 2
Nummer: 10
Seiten: 1641-1648
Art: Artikel
Andere Publikationen in: WSEAS Transactions on Information Science and Applications
Zusammenfassung
Knowing a process from its recorded data yields advantages of great interest. The number of firms that seek and find solutions to their productive problems by means of the analysis of their production data is increasing everyday. Current technology allows to routinely store in databases the control variables of special interest and command history of the processes. A later analysis of these databases provides a potentially precious source of high quality information of great help in decision making. Nevertheless, the analysis of such a database is frequently not a trivial task, needing the simultaneous use of tools of very varied origin, in what has now become a field of research in itself, known as 'data mining'. Amongst the first challenges the data miner must tackle is to summarize the complexity of the data into a number of distinct clusters, which represent 'interesting', often unexpected, behavior patterns of the process under analysis. In spite of the powerful computers and efficient clustering algorithms currently available, limits are typically exceeded when mining massive databases such as those arising from industrial processes. Thus arises the need of new clustering algorithms that directly address the problem of size. We present here one such algorithm, successfully applied to a variety of industrial processes, as well as to data sets of different nature whose origin rested in the medicine, biology and epidemiology fields. Our algorithm, named CiTree, yields a hierarchical structure of the clusters present in the process; thus providing a detailed representation of the relationships amongst sample units. As an example of the CiTree use, we also show a real case study whose analysis provided useful information.