Learning from very large data sets

Limas, M.C.; Ordieres Mere, J.B.; Ciampi, A.; Elias, F.A.

Learning from very large data sets

Limas, M.C. ¹
Ordieres Mere, J.B. ¹
Ciampi, A. ¹²
Elias, F.A. ¹³

1 Universidad de León

Universidad de León

León, España

ROR https://ror.org/02tzt0b78
2 McGill University

McGill University

Montreal, Canadá

ROR https://ror.org/01pxwe438
3 Universidad de La Rioja

Universidad de La Rioja

Logroño, España

ROR https://ror.org/0553yr311

Mostrar afiliaciones +

Revista:

WSEAS Transactions on Information Science and Applications

ISSN: 1790-0832

Año de publicación: 2005

Volumen: 2

Número: 10

Páginas: 1641-1648

Tipo: Artículo

SCOPUS: 2-s2.0-24344465900 GOOGLE SCHOLAR

Otras publicaciones en: WSEAS Transactions on Information Science and Applications

Resumen

Knowing a process from its recorded data yields advantages of great interest. The number of firms that seek and find solutions to their productive problems by means of the analysis of their production data is increasing everyday. Current technology allows to routinely store in databases the control variables of special interest and command history of the processes. A later analysis of these databases provides a potentially precious source of high quality information of great help in decision making. Nevertheless, the analysis of such a database is frequently not a trivial task, needing the simultaneous use of tools of very varied origin, in what has now become a field of research in itself, known as 'data mining'. Amongst the first challenges the data miner must tackle is to summarize the complexity of the data into a number of distinct clusters, which represent 'interesting', often unexpected, behavior patterns of the process under analysis. In spite of the powerful computers and efficient clustering algorithms currently available, limits are typically exceeded when mining massive databases such as those arising from industrial processes. Thus arises the need of new clustering algorithms that directly address the problem of size. We present here one such algorithm, successfully applied to a variety of industrial processes, as well as to data sets of different nature whose origin rested in the medicine, biology and epidemiology fields. Our algorithm, named CiTree, yields a hierarchical structure of the clusters present in the process; thus providing a detailed representation of the relationships amongst sample units. As an example of the CiTree use, we also show a real case study whose analysis provided useful information.

Learning from very large data sets

Universidad de León

McGill University

Universidad de La Rioja

Resumen