WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

Knowledge Discovery and Data Mining Research Group 
KDDRG

Research Projects on
Sequence Mining 

Description | Categical Sequences

------------------------------------------
GENERAL DESCRIPTION

Finding patterns in sequences is a challenging problem. In many domains, data are represented as sequences. In the medical domain, symptoms exhibited by a patient can be ordered according to their occurrence in time, and some patterns can be found that relate a certain subsequence of symptoms with a particular disease. Also, genetic analysis must take into account the sequential nature of DNA. In the financial domain, the daily price of a stock during say a quarter or a year can be naturally represented as a sequence of values. Finding patterns in stock sequences is valuable for predicting stock market prices. In the market analysis domain, finding patterns in the sequence of items (e.g. books or supermarket products) that a person buys will help us predicting the buying behavior of the person and target him/her for future sales of products. In the WEB domain, finding patterns in the sequence of webpages that a person visits, helps in predicting which pages the person will visit next. That is useful to recommend links to the person as well as for organizing web sites. These are just a few examples of domains in which sequences are the natural way of representing information and in which finding patterns on those sequences is of great importance.

------------------------------------------
TRANSFORM-BASED SIMILARITY METHODS FOR SEQUENCE MINING

Project Members

Project Description

We investigate transform-based algorithms for performing similarity queries in sequential databases. The transform-indexing method of Agrawal, Faloutsos, and Swami (1993) was evaluated, using both Fourier and Walsh transforms to extract features. We extended the method to enable it to handle categorical data as well as different notions of distance. Analytical and experimental evaluation of these methods was performed, using financial data and simulated genetic data.

------------------------------------------
EXPLORING TEMPORAL ASSOCIATIONS IN THE STOCK MARKET

Project Members

Project Description

This project explores data mining for temporal associations within the stock market. The system used was a combination of the WPI WEKA data mining system as well as an event identification pre-processing module implemented as an extension of an existing algorithm. This implementation allows for complex templates to be identified in a sequence of numerical data in a more efficient manner than previously described. Applying this system to the financial domain produced a wide variety of descriptive associations.

------------------------------------------
AN EFFICIENT AND INCREMENTAL SYSTEM TO MINE CONTIGUOUS FREQUENT SEQUENCES

Project Members

Project Description

Mining frequent patterns is an important component of many prediction systems. One common usage in web applications is the mining of users' access behavior for the purpose of predicting and hence pre-fetching the web pages that the user is likely to visit.

Frequent sequence mining approaches in the literature are often based on the use of an Apriori-like candidate generation strategy, which typically requires numerous scans of a potentially huge sequence database. In this paper we instead introduce a more efficient strategy for discovering frequent patterns in sequence databases that requires only two scans of the database. The first scan obtains support counts for subsequences of length two. The second scan extracts potentially frequent sequences of any length and represents them as a compressed frequent sequences tree structure (FS-tree). Frequent sequence patterns are then mined from the FS-tree. Incremental and interactive mining functionalities are also facilitated by the FS-tree. As part of this work, we developed the FS-Miner, an system that discovers frequent sequences from web log files. The FS-Miner has the ability to adapt to changes in users' behavior over time, in the form of new input sequences, and to respond incrementally without the need to perform full re-computation. Our system also allows the user to change the input parameters (e.g., minimum support and desired pattern size) interactively without requiring full re-computation in most cases.

We have tested our system using two different data sets, comparing it against two other algorithms from the literature. Our experimental results show that our system scales up linearly with the size of the input database. Furthermore, it exhibits excellent adaptability to support threshold decreases. We also show that the incremental update capability of the system provides significant performance advantages over full re-computation even for relatively large update sizes.

------------------------------------------
[Return to the WPI Homepage]  [Return to the CS Homepage]
ruiz@cs.wpi.edu