Friday, August 05, 2005

Stemming, information filtering

I've started work on an information filtering system to assist in returning only relevant documents. I'll be using my own implementation of probabilistic latent semantic analysis (PLSA) at first. A couple ideas of how to extend it have already entered my head, and a buddy of mine has started playing around with a probabilistic model based off least squares.

I'm doing all of this in Python and already found a decent word stemmer algorithm that's been written in Python. Not much to it, but it seems to work well. The next step is building a vocabulary; I've found a few with some Google searches, but I'm going to go ahead and write a Python module to ingest text files, clean out punctuation and digits, stem the remaining words, and write the result to a master vocabulary.


Post a Comment

<< Home