bottom.html

<html>
<head>
<title>William W. Cohen</title>
</head>
<body bgcolor="white">

<h3><a name="bio"></a>Areas of expertise</h3>

I have extensive experience in machine learning and discovery,
information retrieval, information extraction, and data integration.

<h3>Biography</h3>

William Cohen received his bachelor's degree in Computer Science from
<a href="http://www.duke.edu">Duke University</a> in 1984, and a PhD
in Computer Science from <a href="http://www.rutgers.edu">Rutgers
University</a> in 1990.  From 1990 to 2000 Dr. Cohen worked at AT&T <a
href="http://www.bell-labs.com/">Bell Labs</a> and later <a
href="http://www.research.att.com">AT&T Labs-Research</a>, and from
April 2000 to May 2002 Dr. Cohen worked at <a
href="http://www.whizbang.com">Whizbang Labs</a>, a company
specializing in extracting information from the web.  Dr. Cohen is
currently an action editor for the <a
href="http://www.jmlr.org"><i>Journal of Machine Learning
Research</i></a>, has served as an editor for the journal <a
href="http://www.cs.ualberta.ca/~holte/mlj/"><i>Machine
Learning</i></a> and the <a href="http://www.jair.org"><i>Journal of
Artificial Intelligence Research</i></a>, co-organized the 1994
International Machine Learning Conference, and has served on more than
20 program committees or advisory committees.  In addition to 
his position at CMU, Dr. Cohen also serves on the advisory board of 
<a href="http://www.intelliseek.com">Intelliseek</a>.

<p>
Dr. Cohen's research
interests include information integration and machine learning,
particularly text categorization and learning from large datasets. He
holds six patents related to learning, discovery, information
retrieval, and data integration, and is the author of more than 60
refereed publications.

<!-- <h3><a name="cv">Curriculum vita</cv></h3>

<ul>
<li><a href="cv.pdf">My c.v. in PDF.</a>
</ul>

-->

<h3><a name="sw">Software systems</a></h3>

<ul>
<li>My
latest baby is <a
href="http://minorthird.sourceforge.net">Minorthird</a>,
an open-source Java package of information extraction 
and text classification learning tools.

<li>
<a href="http://secondstring.sourceforge.net">SecondString</a> is
another open-source Java package, of approximate string matching
techniques.

<li><a href="slipper/">SLIPPER</a> and <a href="whirl/">WHIRL</a> are
now being distributed via Rutgers University.  They are free for research
purposes.

<li>Send me email to find out how to get a copy of RIPPER.  

As an alternative to that ancient code: I haven't used it myself, but
I've heard good things about
<a href="http://www.oefai.at/~alexsee/WEKA/doc/weka.classifiers.rules.JRip.html">
J-RIP</a>, a Ripper clone written for WEKA.
</ul>

<h3><a name="data">Datasets</a></h3>

The following datasets are available for anyone to use for research
purposes:
<ul>
<li><a href="classify.tar.gz">classify.tar.gz</a> (0.4Mb) contains
nine problems in which the goal is to classify short entity names.
This data was used in <i>Joins that Generalize: Text Classification
Using WHIRL</i> (KDD-98).

<li><a href="match.tar.gz">match.tar.gz</a> (0.7Mb) contains a suite of
<i>labeled</i> entity-name matching and clustering problems
(i.e. problems for which the correct matches/clusters are provided),
in a single consistent format. In most cases with WHIRL's
performance is given as a benchmark.

<li><a href="ranking-data.tar.gz">ranking.tar.gz</a> (8Mb) contains the
data used for the meta-search experiments in my JAIR paper <a
href="http://www.jair.org/abstracts/cohen99a.html">Learning to Order
Things</a> (with Rob Schapire and Yoram Singer).

<li><a href="http://www.cs.cmu.edu/~vitor/codeAndData.html">617
messages from 20 Newsgroups, annotated for reply bodies and
signatures</a>, prepared by my student <a
href="http://www.cs.cmu.edu/~vitor">Vitor Carvalho</a>

<li><a href="http://www.cs.cmu.edu/~einat/datasets.html">
Two subsets of the Enron data, annotated with person names</a>,
prepared by my student <a "http://www.cs.cmu.edu/~einat">Einat
Minkov</a>.

<li><a href="http://www.cs.cmu.edu/~enron">Enron email dataset</a>
(400Mb, once you get there) contains 800,000+ emails from 150 users+
organized into 4700+ folders.

<li><a href="repository.tgz">A collection of various extraction datasets
in Minorthird format</a> (6Mb), including about 1000 Enron emails tagged
for person names and temporal expressions.
</ul>

<h3><a name="talks">Recent talks and presentations</a></h3>


<p>
<ul>
<li>Tutorials:
<ul>

  <li><a href="ie-survey.ppt">Information extraction</a> (PowerPoint;
  4.8Mb), aimed at folks somewhat familiar with statistical NLP
  methods.  Two earlier versions of this are also available, both
  given with Andew McCallum at recent conferences, <a
  href="kdd2003-tutorial.ppt">KDD-2003</a>(PowerPoint; 6.8Mb) and <a
  href="nips-ie-tutorial.ppt">NIPS-2002</a>.

  <li><a href="text-cat-tutorial.ppt">Text classification</a>
  (PowerPoint; 3Mb), given at a recent CALD Summer Course.

  <li><a href="collab-filtering-tutorial.ppt">Collaborative
  filtering</a> (PowerPoint; 9.1Mb), given at a recent DIMACS workshop.

</ul> 

<p>
<li>A mini-course on record linkage and matching:
  <ul>
  <li><a href="Matching-1.ppt">Overview of record linkage methods</a>(PowerPoint; 250kb).
  <li><a href="Matching-2.ppt">Overview of distance metrics for strings</a>(PowerPoint; 530kb).
  <li><a href="Matching-2.ppt">Overview of using HMMs for normalizing
text in record linkage tasks</a>(PowerPoint; 640kb). <br>
  It's not a presentation, but I have also put together a <a
  href="matching/">short annotated bibliography of record linkage and
  matching papers</a>.
  </ul>

<p>
<li><a href="nips-2002.ppt">A presentation of my NIPS-2002 results</a>
on using bootstrapping techniques to improve web page classification,
given at CMU in October 2002. (PowerPoint; 3.2mb).
<li><a href="www-2002.pdf">A presentation of my WWW-2002 results</a>
on wrapper learning,
presented in April 2002. (PDF; 170kb).
<li><a href="whirl-talk.pdf">An overview of experiments with WHIRL.</a> (PDF; 800kb).
</ul>


<h3><a name="teach">Teaching</a></h3>

June 21,23,25: A mini-course on Minorthird.  
<p>

Materials:
<ul>

<li><a href="day1.tgz">Slides, notes, and sample files from first
day's lecture</a>.

<li><a href="day2.tgz">Slides, notes, and sample files from second
day's lecture</a>.

<li><a href="day3.ppt">Powerpoint slides from third
day's lecture</a>.

<li><a href="minorthird.jar">Jar file for minorThird</a>, if you
only want to run the code, not compile it or read it.
The installation process here is:
  <ol>
  <li>Install Java 1.4 or higher (actually, JRE is all you need).
  <li>Download the <a href="minorthird.jar">jar for minorThird</a>
    and stick it in some directory.
  <li>Optionally, download the <a href="repository.tgz">sample data
  repository</a> and unpack it into the same directory.
  <li>Change to that same directory and
  then run Minorthird with the command <br>
  <code>java -Xmx500M -jar minorthird.jar</code>

  <p>
  What will pop up will be a small launch pad that can be used to 
  start any of the UI programs.  You can also start a particular
  main by specifying minorthird.jar as your classpath, for 
  instance: <br>

  <code>java -Xmx500M -cp minorthird.jar edu.cmu.minorthird.ui.Help</code>
  </ol>

<li>If you want to do a real install here's the <a
href="http://minorthird.sourceforge.net">home page on Sourceforge</a>, and
a document on <a href="10-707/QUICKSTART.txt">how to do a CVS
install Minorthird</a>.
</ul>

<p>

From Spring 2004: <a href="10-707/">"Learning to Turn Words into Data:
Machine Learning Approaches to Information Extraction and Information Integration"</a>, CALD 10-707 and LTI 11-748.


<h3><a name="pubs">Publications</a></h3>

<ul>
<li><a href="pubs-s.html">Recent and selected publications</a>.  These
are some representative publications for which on-line copies can be
distributed.

<li><a href="pubs.html">All publications</a>. Here is an more-or-less
complete chronological list of my publications.  The bibliography
includes pointers to on-line versions when I can provide them, but
unfortunately copyright restrictions don't allow me to make all of my
publications available on-line.  Of course, reprints are always
available from me on request.

<li>Publications by topic:
  <ul>
   <li><a href="pubs-m.html">Matching/Data Integration</a>
   <li><a href="pubs-t.html">Text categorization</a>
   <li><a href="pubs-x.html">Information Extraction</a>
   <li><a href="pubs-r.html">Rule Learning</a>
   <li><a href="pubs-c.html">Collaborative Filtering</a>
   <li><a href="pubs-a.html">Applications</a>
   <li><a href="pubs-f.html">Formal Results</a>
   <li><a href="pubs-i.html">Inductive Logic Programming</a>
   <li><a href="pubs-e.html">Explanation-Based Learning</a>
  </ul>

</ul>

Recent papers I'm keeping in HTML or PDF (which requires <a
href="http://www.adobe.com/prodindex/acrobat/readstep.html">Adobe
Acrobat Reader</a> to view).  Older papers are mostly in Postscript.
For Windows, I use the <a
href="http://www.cs.wisc.edu/~ghost/gsview/">GSView</a> reader for
postscript.  Most of these papers are viewable in several formats in
<a href="http://www.researchindex.com">ResearchIndex</a>.

<h3><a name="buddies">Students</a></h3>

<!-- Students: -->
<ul>
<li><a href="http://www.cs.cmu.edu/~vitor">Vitor Rocha de Carvalho</a>
<li>Zhenzhen Kou 
(co-advised with <a href="http://www.andrew.cmu.edu/user/murphy/">Bob Murphy</a>) 
<li><a href=""http://www.cs.cmu.edu/~einat">Einat Minkov</a>
<li>Richard C. Wang
(co-advised with <a href="http://www.cs.cmu.edu/~ref/">Bob Frederking</a>)
<li><a href="http://www.cs.cmu.edu/~mazda">Noboru Matsoda</a>
(postdoc, co-supervised with <a href="http://pact.cs.cmu.edu/koedinger.html">Ken Koedinger</a>)
<li><a href="http://www.cs.cmu.edu/~eairoldi">Edoardo Airoldi</a>
(former student, co-advised with <a href="http://www.stat.cmu.edu/~fienberg/">Steve Fienberg</a>)
<li><a href="http://www.cs.cmu.edu/~pradeepr">Pradeep Ravikumar</a>
(former student, co-advised with <a href="http://www.stat.cmu.edu/~fienberg/">Steve Fienberg</a>)
</ul>

<h3><a name="contact">Contact Info</a></h3>

<p>
William Cohen</br>
Associate Research Professor</br>
Center for Automated Learning & Discovery</br>
Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213</br>
Wean Hall 5317 / 412-268-7664 (voice) / 412-268-3431 (fax) </br>

<p><a href="http://people.cs.cmu.edu/person/49142.html">Official CMU Contact Info</a>

<p>My preferred email address is: <font color=blue>wcohen AT cs DOT cmu DOT edu</font>


<h3><a name="misc">Other Stuff</a></h3>

<p>For those many friends whose research I have built on, be warned.
My full name, "William Weston Cohen", is an anagram of the phrase "I now
cite shallow men".

<p>I am often praised for my highly artistic and functional web site designs.
An example is the site for <a href="http://www.scindexing.com">SC Indexing,
a professional book indexer</a>.  However, I accept few clients - this
one happens to be my wife.

<p>Through my advisor, Alex Borgida, I can trace my <a
href="lineage.html">"academic lineage"</a> back to luminaries like
Leibniz and Alfred Whitehead.

<p><a href="hp.html">Poetry anyone?</a>
<hr>

</BODY>
</HTML>