Text to Matrix Generator (TMG) is a MATLAB Toolbox that can be used for various Data Mining (DM) and Information Retrieval (IR) tasks. TMG uses the sparse matrix infrastracture of MATLAB that is especially suited for Text Mining (TM) applications where data are extremely sparse. Initially built as a preprocessing tool, TMG offers now a wide range of DM tools. In particular, TMG is composed of six Graphical User Interface (GUI) modules, presented in Figure 1 (arrows show modules dependencies).
|
InstallationInstallation of TMG is straightforward by means of the init_tmg script. In particular, the user has to perform the following steps:
TMG 6.0R7 requires the following third party software packages: ANLS, NNDSVD, PROPACK, SDDPACK and SPQR packages are included into TMG, while the user has to download MySQL and JRE. However, we note that MySQL related software is necessary only if the user intends to use the database support utilized by TMG. Ordinary TMG will run without any problem on a Matlab 7.0 environment without any other special software.
|
Indexing Module (tmg_gui)TMG can be used for the construction of new and the update of existing term-document matrices (tdms) from text collections, in the form of MATLAB sparse arrays. To this end, TMG implements various steps such as:
The resulting tdms can be stored as "mat" files, while text can also be stored to MySQL for further procesing. TMG can also update existing tdms by efficient incremental updating or downdating operations. Finally, TMG can also construct query vectors using the existing dictionary that can be used by the retrieval and classification modules.
See details and a demostration of tmg_gui. |
Dimensionality Reduction module (dr_gui)This module deploys a variety of powerful techniques designed to efficiently handle high dimensional data. Dimensionality Reduction (DR) is a common technique that is widely used. The target is dual: (a) more economical representation of data, and (b) better semantic representation. TMG implements six DR techniques.
DR data can be stored as "mat" files and used for further processing.
See details and a demostration of dr_gui. |
Non-Negative Factorizations module (nnmf_gui)This module deploys a set of Non-Negative Matrix Factorization (NNMF) techniques. Since these techniques are iterative, the final result depends on the initialization. A common approach is the random initialization of the non-negative factors, however new approaches appear to result in higher quality approximations. TMG implements four initialization techniques:
Resulting factors can be further refined by means of two NNMF algorithms:
See details and a demostration of nnmf_gui. |
Retrieval module (retrieval_gui)TMG offers two alternatives for Text Mining.
using a combination of any DR technique and Latent Semantic Indexing (LSI). Using the corresponding GUI, the user can apply a question to an existing dataset using any of the aforementioned techniques and get HTML response.
See details and a demostration of retrieval_gui. |
Clustering module (clustering_gui)TMG implements three clustering algorithms.
Regarding PDDP, TMG implements the basic algorithm as well as the PDDP(l) [15] and some hybrid variants of PDDP and kmeans [19].
See details and a demostration of clustering_gui. |
Classification module (classification_gui)TMG implements three classification algorithms.
All these algorithms can be combined with CLSI, CM and SVD DR techniques.
See details and a demostration of classification_gui. |
AcknowledgementsTMG was conceived after a motivating discussion with Andrew Knyazev regarding a collection of MATLAB tools we had put together to aid in our clustering experiments. We thank our collegues Ioannis Antonellis, Anastasios Zouzias, Efi Kokiopoulou and Constantine Bekas for many helpful suggestions, Jacob Kogan and Charles Nicholas for inviting us to contribute to [18], Elias Houstis for his help in the initial phases of this research and Michael Berry, Tamara Kolda Rasmus Munk Larsen, Christos Boutsidis and Haesun Park for letting us use and distribute SPQR, SDDPACK, PROPACK, NNDSVD and ANLS software respectively. Special thanks are due to many of the users for their constructive comments regarding TMG. This research was supported in part by a University of Patras "Karatheodori" grant. The first author was also supported by a Bodossaki Foundation graduate fellowship. |
References
[1] M. Berry, Z. Drmac, and E. Jessup, Matrices, vector spaces, and information retrieval, SIAM Review 41 (1998), 335–362. |