My research interests primarily lie in the field of text mining with the goal of discovering knowledge and structure from large-scale text databases. My research spans closely related fields such as Web Mining, Information Retrieval and Machine Learning. In recent years, an increasingly large amount of information, much of it in unstructured text format, has become available electronically. The sheer volume necessitates the development of automatic tools that help analyze the text. This is the primary goal of text mining and a key motivation for my research in this area. There has been growing interest in text mining in the last decade or so and a variety of successful solutions addressing different problems have been proposed. However, the field is still young with several open questions (I present a detailed review of a sample of existing research and outline some open problems in [3]. This is joint work with my advisor, Padmini Srinivasan, and graduate student Xin Ying Qiu.). My research focuses on developing new solutions that overcome some of the limitations of existing research while also exploring problems that have not been previously addressed. Another key motivation for my research stems from the fact that much of the existing research in text mining has been in the biomedical domain. Far less research has been done in other domains, especially the Web. Thus I am interested in developing general text mining algorithms that apply to different domains as well as specific algorithms for the web domain. Overall, I believe that the ability of text mining methods to process large volumes of text data coupled with their potential to provide access to interesting information makes this a very promising area of research. Within the overall context of text mining, I have focused on two distinct but related research areas. The first is knowledge discovery with the goal of automatically identifying novel hypotheses from the text. The second is automatically and accurately identifying sources of relevant information. Much of my prior research falls under biomedical text mining. My current focus is on more general text mining problems and is more web-centric. Knowledge Discovery In his seminal work in the mid 1980s, Don Swanson showed that it was possible to discover new knowledge from the existing literature by linking the information present in complementary but disconnected articles. Along with Neil Smalheiser he postulated a number of novel biomedical hypotheses, which were later verified by domain experts. They developed two approaches, known as Open and Closed discovery, for generating new hypotheses. However, their research required substantial manual input. Since then, a number of efforts have aimed to automate the discovery process. However, most still require substantial human intervention in some key steps and have not been as successful as the original efforts of Swanson and Smalheiser. My research in this area has primarily been in developing and implementing new algorithms for hypotheses generation. These algorithms are substantially (and in certain settings fully) automatic. In collaboration with my advisor, I have developed a web-based biomedical text mining system called Manjal (http://sulu.info-science.uiowa.edu/Manjal.html [4]). Manjal uses metadata profiles, derived from MeSH terms assigned to MEDLINE documents, to represent biomedical topics. The relationship between two topics is determined by the similarity of their profiles. Profiles may be limited to MeSH terms within the semantic categories of interest to a user. Manjal implements Open and Closed discovery on top of MeSH profiles. Manjal has been extensively by replication of many of the discoveries made by Swanson and Smalheiser (Padmini Srinivasan. Text Mining: Generating Hypotheses from MEDLINE. Journal of the American Society for Information Science And Technology, 55(5):396--413, 2004.). Manjal has also been used to propose a beneficial relationship between the dietary substance Curcumin Longa (or Turmeric) and different disorders such as retinal diseases and Crohn's disease [8]. Additionally, we have used MeSH profiles to explore connections between genes [2] and identify groups of related genes and drugs [9]. We are currently adding to the range of functions Manjal provides. E.g., analyzing relationships between topics in bipartite sets. Identifying Relevant Sources of Information The ability to distinguish between relevant and non-relevant sources of information is at the heart of research in Information Retrieval. It is also crucial for text mining methods as they significantly depend upon the quality of the underlying text. My research in this area focuses on gene queries and proteins under different application areas. Gene Queries MEDLINE is a large-scale database containing records for over 14 million biomedical research articles. The web-based interface to MEDLINE, PubMed, offers a sophisticated range of search functions within a Boolean framework. However, retrieved documents are only sorted chronologically and not by relevance. Ranking documents by relevance potential can be very significant, especially when large sets of documents are retrieved, as is the case with many gene queries. In recent research [6], we explored five information retrieval-based methods to rank documents retrieved for over 9000 human gene queries. We also addressed three different kinds of ambiguity in gene nomenclature (In a related study [7], we focus on one specific type of ambiguity, viz., gene terms that are also English words). Two of our strategies worked very well and offered significant improvements over our baseline. They did well even when there was ambiguity in the gene terms. We also developed an approach to successfully predict which strategy is more appropriate for a given gene (We present a more detailed study on predicting which strategy to use for a subset of these gene queries in [5]). We provide access to ranked lists of documents generated by our best strategy via a web-based system known as GeneDocs (http://sulu.info-science.uiowa.edu/genedocs/). Currently, we are expanding the system to display ranked lists for more genes and also making it more dynamic so that it automatically fetches and ranks relevant documents recently added to MEDLINE. Proteins In the summer of 2006 I had the opportunity to work at the Saier Lab in the Division of Biology at the University of California San Diego as a visiting student researcher. The Saier Lab has created and maintains the Transporter Classification Database (TCDB - http://www.tcdb.org), a widely used resource that provides information on transmembrane transporter proteins. Currently human experts update this database manually. Our goal was to automate this process, which would save significant time and effort, and dramatically increase the comprehensiveness of the TCDB. While our specific focus was on the TCDB, our research has more general application. We designed and compared two approaches that automatically identify sources of information to be included in the TCDB. The first is document-based that builds classifiers to automatically recognize relevant documents in MEDLINE. The second is record-based that uses a set of rules to automatically recognize relevant records in the SwissProt and TrEMBL databases. We also carefully analyzed how to evaluate the usefulness of these methods and performed experiments to test our approaches. Our experiments showed that the document-based approach provides more comprehensive coverage. However, the record-based approach may be more practical as information can be extracted more easily from structured database records than from documents. We believe that a machine learning approach to recognize relevant records will combine the strengths of both approaches. A paper describing this research has been submitted to BMC Bioinformatics [1]. Currently, my primary focus is on the research towards my Ph.D dissertation. The major theme of my dissertation research is the discovery of interesting and potentially novel associations and hypotheses from web data. This work builds on the work I have previously done with biomedical data. I have recently successfully presented my dissertation proposal to my committee. References: [1] Aditya K. Sehgal, Sanmay Das, Charles Elkan and Milton H. Saier Jr. Automatically Updating a Specialized Database: Comparing Literature-Based and Database-Based Approaches. Submitted to BMC Bioinformatics, November 2006. [2] Aditya K. Sehgal, Xin Ying Qiu and Padmini Srinivasan. Mining MEDLINE Metadata to Explore Genes and their Connections. In Proceedings of the SIGIR 2003 Workshop on Text Analysis and Search for Bioinformatics, July 2003. [3] Aditya K. Sehgal, Xin Ying Qiu and Padmini Srinivasan. Analyzing LBD methods using a General Framework. In: P. Bruza, J.M Owen and R. van Berckelaar (eds.). Edited volume on Literature-Based Discovery. Information Science and Knowledge Management Series, Springer Verlag. To appear, 2007. [4] Aditya K. Sehgal and Padmini Srinivasan. Manjal - A Text Mining System for MEDLINE (Demonstration). In Proceedings of the 28th Annual International ACM SIGIR, pp. 680, August 2005. [5] Aditya K. Sehgal and Padmini Srinivasan. Predicting Performance for Gene Queries. In Proceedings of the SIGIR 2005 Workshop on Predicting Query Difficulty, August 2005. [6] Aditya K. Sehgal and Padmini Srinivasan. Retrieval with Gene Queries. BMC Bioinformatics, 7:220, March 2006. [7] Aditya K. Sehgal, Padmini Srinivasan and Olivier Bodenreider. Gene Terms and English Words: An Ambiguous Mix. In Proceedings of the SIGIR 2004 Workshop on Search and Discovery for Bioinformatics, July 2004. [8] Padmini Srinivasan, Bisharah Libbus and Aditya K. Sehgal. Mining MEDLINE: Postulating a Beneficial Role for Curcumin Longa in Retinal Diseases. In Proceedings of the HLT-NAACL 2004 Workshop: BioLink 2004, Linking Biological Literature, Ontologies and Databases, pp. 33-40, May 2004. [9] Padmini Srinivasan and Aditya K. Sehgal. Mining MEDLINE for Similar Genes and Similar Drugs. Technical Report #03-02, Dept. of Computer Science, The University of Iowa, July 2003.