Learning from very large data sets

  1. Limas, M.C. 1
  2. Ordieres Mere, J.B. 1
  3. Ciampi, A. 12
  4. Elias, F.A. 13
  1. 1 Universidad de León
    info

    Universidad de León

    León, España

    ROR https://ror.org/02tzt0b78

  2. 2 McGill University
    info

    McGill University

    Montreal, Canadá

    ROR https://ror.org/01pxwe438

  3. 3 Universidad de La Rioja
    info

    Universidad de La Rioja

    Logroño, España

    ROR https://ror.org/0553yr311

Revista:
WSEAS Transactions on Information Science and Applications

ISSN: 1790-0832

Año de publicación: 2005

Volumen: 2

Número: 10

Páginas: 1641-1648

Tipo: Artículo

Otras publicaciones en: WSEAS Transactions on Information Science and Applications

Resumen

Knowing a process from its recorded data yields advantages of great interest. The number of firms that seek and find solutions to their productive problems by means of the analysis of their production data is increasing everyday. Current technology allows to routinely store in databases the control variables of special interest and command history of the processes. A later analysis of these databases provides a potentially precious source of high quality information of great help in decision making. Nevertheless, the analysis of such a database is frequently not a trivial task, needing the simultaneous use of tools of very varied origin, in what has now become a field of research in itself, known as 'data mining'. Amongst the first challenges the data miner must tackle is to summarize the complexity of the data into a number of distinct clusters, which represent 'interesting', often unexpected, behavior patterns of the process under analysis. In spite of the powerful computers and efficient clustering algorithms currently available, limits are typically exceeded when mining massive databases such as those arising from industrial processes. Thus arises the need of new clustering algorithms that directly address the problem of size. We present here one such algorithm, successfully applied to a variety of industrial processes, as well as to data sets of different nature whose origin rested in the medicine, biology and epidemiology fields. Our algorithm, named CiTree, yields a hierarchical structure of the clusters present in the process; thus providing a detailed representation of the relationships amongst sample units. As an example of the CiTree use, we also show a real case study whose analysis provided useful information.