Overview of the CORON System

Coron is a domain and platform independent, multi-purposed data mining toolkit, which incorporates not only a rich collection of data mining algorithms, but also allows a number of auxiliary operations. To the best of our knowledge, a data mining toolkit designed specifically for itemset extraction and association rule generation like Coron does not exist elsewhere. Coron also provides support for preparing and filtering data, and for interpreting the extracted units of knowledge.

In our case, the extracted knowledge units are mainly association rules. At the present time, finding association rules is one of the most important tasks in data mining. Association rules allow one to reveal "hidden" relationships in a dataset. Finding association rules requires first the extraction of frequent itemsets.

Currently, there exist several freely available data mining algorithms and tools. For instance, the goal of the FIMI workshops is to develop more and more efficient algorithms in three categories: (1) frequent itemsets (FI) extraction, (2) frequent closed itemsets (FCI) extraction, and (3) maximal frequent itemsets (MFI) extraction. However, they tend to overlook one thing: the motivation to look for these itemsets. After having found them, what can be done with them? Extracting FIs, FCIs, or MFIs only is not enough to generate really useful association rules. The FIMI algorithms may be very efficient, but they are not always suitable for our needs. Furthermore, these algorithms are independent, i.e. they are not grouped together in a unified software platform. We also did experiments with other toolkits, like Weka. Weka covers a wide range of machine learning tasks, but it is not really suitable for finding association rules. The reason is that it provides only one algorithm for this task, the Apriori algorithm. Apriori finds FIs only, and is not efficient for large, dense datasets.

Because of all these reasons, we decided to group the most important algorithms into a software toolkit that is aimed at data mining. We also decided to build a methodology and a platform that implements this methodology in its entirety. Another advantage of the platform is that it includes the auxiliary operations that are often missing in the implementations of single algorithms, like filtering and pre-processing the dataset, or post-processing the found association rules. Of course, the usage of the methodology and the platform is not narrowed to one kind of dataset only, i.e. they can be generalized to arbitrary datasets.