Bayesian Feed Filtering
About
The Bayesian Feed Filtering (BayesFF) project will be trying to identify those articles that are of interest to specific researchers from a set of RSS feeds of Journal Tables of Content by applying the same approach that is used to filter out junk emails.
We will develop and investigate the performance of a tool that will aggregate and filter a range of RSS and ATOM feeds selected by a user. The algorithm used for the filtering is similar to that used to identify spam in many email filters only in this case it will be “trained” to identify items that are interesting and should be highlighted, not those that should be junked.
An important element of the project is investigating whether the filtering is effective enough to be helpful to users (specifically, in this case, researchers looking at journal tables of content for interesting newly-published papers) and disseminating information about the potential of this approach within the JISC community. We appreciate that the potential applicability of the technique is much wider, it applies to any area where a user might want to monitor alerts from a wide range of sources in the knowledge that many of the items in the feeds will be irrelevant. Anyone who has subscribed to dozens of seemingly relevant feeds only to find that they are presented with more items than they can scan is familiar with this problem.
Aims & Objectives
Aims
- To test the potential of Bayesian filtering of RSS and ATOM feeds for providing a personalised alerting service; and,
- Should the filtering be shown to work, to raise awareness of the potential of this approach among the JISC community (developers, service managers, policy makers).
Objectives
- To develop a demonstrator service that can be used by an individual to aggregate selected RSS and ATOM feeds and which, when provided with sufficient information concerning the user's interests, will use a naïve Bayesian filtering algorithm to indicate which new items from the feeds being aggregated are likely to be of interest to the user.
- To test the ability of the recommender service to identify new journal papers of interest to researchers based on a knowledge of the papers which they have recently read.
- To raise awareness of the potential of this approach.
- A well managed project which delivers in a timely fashion.
Approach
The demonstrator service will be built, as far as is practicable, out of existing open source software modules, for example the Bayesian filtering routine used by sux0r, and the RSS aggregator and the user interfaces from sux0r and ticTOCs. All software will be developed as open source software, i.e. using open source applications such as Apache, mySQL, PHP, with code hosted on SourceForge or Google Code, and available through an open source licence. The API is intended to allow users to interact remotely with the filtering mechanism, i.e. by indicating which items are and are not relevant to their interests. A typical use for the API would be a widget to display those items that the system suggested as of interest on a site such as iGoogle or Netvibes, and through this widget to be able to indicate any items which actually weren't of interest.
We will guide a group of approximately 20 researchers through the use the system, training the Bayesian filter with information about their interests. RSS feeds for the tables of contents of journals which the researchers are interested in will be sourced from ticTOCs. Ideally, information about which items they find useful will come from those feeds, however the time scale for the project means that there may not be a sufficient number of interesting items in the journal issues for which table of contents feeds are available during the project. To allow for this, the system may be trained using text from the abstracts of papers that have been identified by the researchers as interesting, e.g. papers they have recently read, written or cited. The Bayesian filter will then be used to select items from subsequent journal TOC feeds and the researchers will provide feedback through interviews or questionnaires on the success of the filtering. Researcher will be recruited locally, from Heriot-Watt where possible, in order to facilitate easy interaction with them; the project budget includes a sum for a small incentive for researchers to take part in the trial.
Outputs
Development
We have created a local installation of the open source software sux0r in order to trial the sytem with researchers. Sux0r is a platform for blogging, bookmarking, sharing photos and reading RSS Feeds. Our intallation includes only the RSS Reader with Bayesian Filtering in order to simplify the experience for our users. http://icbl.macs.hw.ac.uk/sux0r210/
We have also developed an API for Sux0r to allow other applications to include Bayesian Feed Filtering functionality.
Related Blog Posts:
- About suxor http://bayesianfeedfilter.wordpress.com/2009/08/25/about-sux0r/
- New Features Planned for suxor http://bayesianfeedfilter.wordpress.com/2009/08/25/new-features-planned-for-sux0r/
- OAuth http://bayesianfeedfilter.wordpress.com/2009/09/15/oauth/
- Feature Implemented: Return RSS Items for a user http://bayesianfeedfilter.wordpress.com/2009/12/07/feature-implemented-return-rss-items-for-a-user/
- Features: ReturnVectors and ReturnCategories http://bayesianfeedfilter.wordpress.com/2009/12/09/features-returnvectors-and-returncategories/
User Trialling
We recruited 20 research staff and students from Heriot-Watt University as volunteers to trial Bayesian Feed Filtering. The trials consisited of five main stages.
- Initial Meeting: Each volunteer submitted a list of journals that they had interests in. An account was created on our local installation for each user. The users were shown how to use train sux0r to learn which articles were of interest to them and which articles were not of interest.
- Initial Questionnaire: Each volunteer completed a questionnaire to gauge their current methods of keeping up to date with journal articles, their experience of using RSS and what their expectations of a a filtering service would be.
- Training: The users spent between 4-6 weeks marking articles from the latest issues of journals they were interested in as "Interesting" or "Not Interesting" . The training could be topped up by adding older interesting articles, which they had either written, cited or that were of particular interest to their research to the system.
- Follow up meeting: Once training had concluded access to the system was blocked for 5 weeks in order for new articles to be published. The system automatically assigned the articles as either "Interesting" or "Not Interesting" to the users based upon the training. The new Return RSS Items For a User feature allowed us to provided two RSS feeds for each user, one with the interesting items, one with the not interesting items. Theses feeds were put into a version of thunderbird and the volunteers were asked to review each feed, to confirm whether the articles had been placed in the correct feed.
- Follow up satisfaction survey: The final stage of the trials was a questionnaire to guage how satisfied the users were with the filtering, whether they would consider using a similar tool in the future and what the advantages of such a tool would be.
Related Blog posts:
- Trialling of Bayesian Feed Filter http://bayesianfeedfilter.wordpress.com/2009/08/04/trialling/
- User Trialling http://bayesianfeedfilter.wordpress.com/2009/08/25/user-trialling/
- Preliminary Findings of User Trials http://bayesianfeedfilter.wordpress.com/2009/10/30/user-trials/
- User Activity http://bayesianfeedfilter.wordpress.com/2009/12/03/user-activity/
- Statistics of User Trial Results http://bayesianfeedfilter.wordpress.com/2009/12/07/statistics-of-user-trial-results/
- User Trials Follow Up Satisfaction Survey http://bayesianfeedfilter.wordpress.com/2009/12/09/user-trials-follow-up-satisfaction-survey/
Community Engagement
The project disseminated it's find through blog posts. The customised installation of sux0r was made available to allow external users to test. The installation will be sustained after the project is completed to allow these current users, the volunteers of the trial and potential future users use of the tool.
Related Blog Posts:
- BayesFF in 45 Seconds http://bayesianfeedfilter.wordpress.com/2009/09/03/bayesff-in-45-seconds/
- Idea: Extension to previous literature http://bayesianfeedfilter.wordpress.com/2009/08/11/idea-extension-to-previous-literature/
- About Sux0r http://bayesianfeedfilter.wordpress.com/2009/08/25/about-sux0r/
- New Featues Planned For Sux0r http://bayesianfeedfilter.wordpress.com/2009/08/25/new-features-planned-for-sux0r/
Project Mangement
We adopted the Feature Driven Development approach to project management. The development was broken down into a list of features which could be designed and built rapidly.
Related Blog Posts:
- Project Kicks Off http://bayesianfeedfilter.wordpress.com/2009/06/18/project-kicks-off/
- SWOT Analysis http://bayesianfeedfilter.wordpress.com/2009/07/29/swot-analysis/
- Final Post
Plan and Progress
Project proposal is available on scribd.
Who did this and who paid them
The project was managed by Phil Barker, of ICBL, Heriot-Watt University. Santiago Chumbe, of ICBL, was responsible for development and Lisa Rogers, also of ICBL conducted the user trials. Funding for the project was from the JISC as part of the Rapid Innovation program.