Information Gain is a well-known and empirically proven method for high-dimensional feature selection. The author found that it and other existing methods failed to produce good results on an industrial text classification problem. On investigating the root cause, the author found that a large class of feature scoring methods suffers a pitfall: they can be blinded by a surplus of strongly predictive features for some classes, while largely ignoring features needed to discriminate difficult classes. This paper demonstrates this pitfall hurts performance even for a relatively uniform text classification task. Based on this understanding, it presents solutions inspired by round-robin scheduling that avoid this pitfall, without resorting to costly wrapper methods.
Related white papers
Outbound Email and Data Loss Prevention in Today's Enterprise, 2008
How concerned are companies about the content of email leaving their organizations? And how do companies manage the legal and financial risks associated with outbound email? To find out, Proofpoint...
PEM Division of BHEL Improves Team Collaboration and Customer Satisfaction
Bharat Heavy Electricals Limited (BHEL), one of the largest manufacturing and engineering companies in India, has over 180 products in the energy and energy infrastructure sectors. The Project Engineering Management...
SQL Server 2005 Administration
Learn to administer and tune SQL Databases and Servers. Description: Get the knowledge and skills you need to maintain a Microsoft SQL Server 2005 database in this 5-day course. Learn how to...
Microsoft Exchange Server 2007
Learn about Exchange 2007 and how to leverage its features. Description: Exchange Server 2007 is the latest version of Microsoft's premier messaging application. In this hands-on course, you'll learn to install and...
Powerful Data Warehousing Performance With IBM Red Brick Warehouse
A data warehouse typically provides business intelligence value through an Relational DataBase Management System (RDBMS) optimized for OnLine Analytical Processing (OLAP), which deals with extracting and viewing data from different...
Metrics Guide for Knowledge Management Initiatives
The Department of Navy (DON) Chief Information Officer (CIO) has led the development of an Information Management/Information Technology (IM/IT) Strategic Plan to build a knowledge sharing culture and exploit new...
Oracle Streams - Replication Tips and Techniques
Oracle Streams is a uniquely flexible feature for information sharing within Database 10g. The three basic elements of Oracle Streams, capture, staging, and consumption, can be configured in a number...

