-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathbottom.html
executable file
·302 lines (237 loc) · 11.3 KB
/
bottom.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
<html>
<head>
<title>William W. Cohen</title>
</head>
<body bgcolor="white">
<h3><a name="bio"></a>Areas of expertise</h3>
I have extensive experience in machine learning and discovery,
information retrieval, information extraction, and data integration.
<h3>Biography</h3>
William Cohen received his bachelor's degree in Computer Science from
<a href="http://www.duke.edu">Duke University</a> in 1984, and a PhD
in Computer Science from <a href="http://www.rutgers.edu">Rutgers
University</a> in 1990. From 1990 to 2000 Dr. Cohen worked at AT&T <a
href="http://www.bell-labs.com/">Bell Labs</a> and later <a
href="http://www.research.att.com">AT&T Labs-Research</a>, and from
April 2000 to May 2002 Dr. Cohen worked at <a
href="http://www.whizbang.com">Whizbang Labs</a>, a company
specializing in extracting information from the web. Dr. Cohen is
currently an action editor for the <a
href="http://www.jmlr.org"><i>Journal of Machine Learning
Research</i></a>, has served as an editor for the journal <a
href="http://www.cs.ualberta.ca/~holte/mlj/"><i>Machine
Learning</i></a> and the <a href="http://www.jair.org"><i>Journal of
Artificial Intelligence Research</i></a>, co-organized the 1994
International Machine Learning Conference, and has served on more than
20 program committees or advisory committees. In addition to
his position at CMU, Dr. Cohen also serves on the advisory board of
<a href="http://www.intelliseek.com">Intelliseek</a>.
<p>
Dr. Cohen's research
interests include information integration and machine learning,
particularly text categorization and learning from large datasets. He
holds six patents related to learning, discovery, information
retrieval, and data integration, and is the author of more than 60
refereed publications.
<!-- <h3><a name="cv">Curriculum vita</cv></h3>
<ul>
<li><a href="cv.pdf">My c.v. in PDF.</a>
</ul>
-->
<h3><a name="sw">Software systems</a></h3>
<ul>
<li>My
latest baby is <a
href="http://minorthird.sourceforge.net">Minorthird</a>,
an open-source Java package of information extraction
and text classification learning tools.
<li>
<a href="http://secondstring.sourceforge.net">SecondString</a> is
another open-source Java package, of approximate string matching
techniques.
<li><a href="slipper/">SLIPPER</a> and <a href="whirl/">WHIRL</a> are
now being distributed via Rutgers University. They are free for research
purposes.
<li>Send me email to find out how to get a copy of RIPPER.
As an alternative to that ancient code: I haven't used it myself, but
I've heard good things about
<a href="http://www.oefai.at/~alexsee/WEKA/doc/weka.classifiers.rules.JRip.html">
J-RIP</a>, a Ripper clone written for WEKA.
</ul>
<h3><a name="data">Datasets</a></h3>
The following datasets are available for anyone to use for research
purposes:
<ul>
<li><a href="classify.tar.gz">classify.tar.gz</a> (0.4Mb) contains
nine problems in which the goal is to classify short entity names.
This data was used in <i>Joins that Generalize: Text Classification
Using WHIRL</i> (KDD-98).
<li><a href="match.tar.gz">match.tar.gz</a> (0.7Mb) contains a suite of
<i>labeled</i> entity-name matching and clustering problems
(i.e. problems for which the correct matches/clusters are provided),
in a single consistent format. In most cases with WHIRL's
performance is given as a benchmark.
<li><a href="ranking-data.tar.gz">ranking.tar.gz</a> (8Mb) contains the
data used for the meta-search experiments in my JAIR paper <a
href="http://www.jair.org/abstracts/cohen99a.html">Learning to Order
Things</a> (with Rob Schapire and Yoram Singer).
<li><a href="http://www.cs.cmu.edu/~vitor/codeAndData.html">617
messages from 20 Newsgroups, annotated for reply bodies and
signatures</a>, prepared by my student <a
href="http://www.cs.cmu.edu/~vitor">Vitor Carvalho</a>
<li><a href="http://www.cs.cmu.edu/~einat/datasets.html">
Two subsets of the Enron data, annotated with person names</a>,
prepared by my student <a "http://www.cs.cmu.edu/~einat">Einat
Minkov</a>.
<li><a href="http://www.cs.cmu.edu/~enron">Enron email dataset</a>
(400Mb, once you get there) contains 800,000+ emails from 150 users+
organized into 4700+ folders.
<li><a href="repository.tgz">A collection of various extraction datasets
in Minorthird format</a> (6Mb), including about 1000 Enron emails tagged
for person names and temporal expressions.
</ul>
<h3><a name="talks">Recent talks and presentations</a></h3>
<p>
<ul>
<li>Tutorials:
<ul>
<li><a href="ie-survey.ppt">Information extraction</a> (PowerPoint;
4.8Mb), aimed at folks somewhat familiar with statistical NLP
methods. Two earlier versions of this are also available, both
given with Andew McCallum at recent conferences, <a
href="kdd2003-tutorial.ppt">KDD-2003</a>(PowerPoint; 6.8Mb) and <a
href="nips-ie-tutorial.ppt">NIPS-2002</a>.
<li><a href="text-cat-tutorial.ppt">Text classification</a>
(PowerPoint; 3Mb), given at a recent CALD Summer Course.
<li><a href="collab-filtering-tutorial.ppt">Collaborative
filtering</a> (PowerPoint; 9.1Mb), given at a recent DIMACS workshop.
</ul>
<p>
<li>A mini-course on record linkage and matching:
<ul>
<li><a href="Matching-1.ppt">Overview of record linkage methods</a>(PowerPoint; 250kb).
<li><a href="Matching-2.ppt">Overview of distance metrics for strings</a>(PowerPoint; 530kb).
<li><a href="Matching-2.ppt">Overview of using HMMs for normalizing
text in record linkage tasks</a>(PowerPoint; 640kb). <br>
It's not a presentation, but I have also put together a <a
href="matching/">short annotated bibliography of record linkage and
matching papers</a>.
</ul>
<p>
<li><a href="nips-2002.ppt">A presentation of my NIPS-2002 results</a>
on using bootstrapping techniques to improve web page classification,
given at CMU in October 2002. (PowerPoint; 3.2mb).
<li><a href="www-2002.pdf">A presentation of my WWW-2002 results</a>
on wrapper learning,
presented in April 2002. (PDF; 170kb).
<li><a href="whirl-talk.pdf">An overview of experiments with WHIRL.</a> (PDF; 800kb).
</ul>
<h3><a name="teach">Teaching</a></h3>
June 21,23,25: A mini-course on Minorthird.
<p>
Materials:
<ul>
<li><a href="day1.tgz">Slides, notes, and sample files from first
day's lecture</a>.
<li><a href="day2.tgz">Slides, notes, and sample files from second
day's lecture</a>.
<li><a href="day3.ppt">Powerpoint slides from third
day's lecture</a>.
<li><a href="minorthird.jar">Jar file for minorThird</a>, if you
only want to run the code, not compile it or read it.
The installation process here is:
<ol>
<li>Install Java 1.4 or higher (actually, JRE is all you need).
<li>Download the <a href="minorthird.jar">jar for minorThird</a>
and stick it in some directory.
<li>Optionally, download the <a href="repository.tgz">sample data
repository</a> and unpack it into the same directory.
<li>Change to that same directory and
then run Minorthird with the command <br>
<code>java -Xmx500M -jar minorthird.jar</code>
<p>
What will pop up will be a small launch pad that can be used to
start any of the UI programs. You can also start a particular
main by specifying minorthird.jar as your classpath, for
instance: <br>
<code>java -Xmx500M -cp minorthird.jar edu.cmu.minorthird.ui.Help</code>
</ol>
<li>If you want to do a real install here's the <a
href="http://minorthird.sourceforge.net">home page on Sourceforge</a>, and
a document on <a href="10-707/QUICKSTART.txt">how to do a CVS
install Minorthird</a>.
</ul>
<p>
From Spring 2004: <a href="10-707/">"Learning to Turn Words into Data:
Machine Learning Approaches to Information Extraction and Information Integration"</a>, CALD 10-707 and LTI 11-748.
<h3><a name="pubs">Publications</a></h3>
<ul>
<li><a href="pubs-s.html">Recent and selected publications</a>. These
are some representative publications for which on-line copies can be
distributed.
<li><a href="pubs.html">All publications</a>. Here is an more-or-less
complete chronological list of my publications. The bibliography
includes pointers to on-line versions when I can provide them, but
unfortunately copyright restrictions don't allow me to make all of my
publications available on-line. Of course, reprints are always
available from me on request.
<li>Publications by topic:
<ul>
<li><a href="pubs-m.html">Matching/Data Integration</a>
<li><a href="pubs-t.html">Text categorization</a>
<li><a href="pubs-x.html">Information Extraction</a>
<li><a href="pubs-r.html">Rule Learning</a>
<li><a href="pubs-c.html">Collaborative Filtering</a>
<li><a href="pubs-a.html">Applications</a>
<li><a href="pubs-f.html">Formal Results</a>
<li><a href="pubs-i.html">Inductive Logic Programming</a>
<li><a href="pubs-e.html">Explanation-Based Learning</a>
</ul>
</ul>
Recent papers I'm keeping in HTML or PDF (which requires <a
href="http://www.adobe.com/prodindex/acrobat/readstep.html">Adobe
Acrobat Reader</a> to view). Older papers are mostly in Postscript.
For Windows, I use the <a
href="http://www.cs.wisc.edu/~ghost/gsview/">GSView</a> reader for
postscript. Most of these papers are viewable in several formats in
<a href="http://www.researchindex.com">ResearchIndex</a>.
<h3><a name="buddies">Students</a></h3>
<!-- Students: -->
<ul>
<li><a href="http://www.cs.cmu.edu/~vitor">Vitor Rocha de Carvalho</a>
<li>Zhenzhen Kou
(co-advised with <a href="http://www.andrew.cmu.edu/user/murphy/">Bob Murphy</a>)
<li><a href=""http://www.cs.cmu.edu/~einat">Einat Minkov</a>
<li>Richard C. Wang
(co-advised with <a href="http://www.cs.cmu.edu/~ref/">Bob Frederking</a>)
<li><a href="http://www.cs.cmu.edu/~mazda">Noboru Matsoda</a>
(postdoc, co-supervised with <a href="http://pact.cs.cmu.edu/koedinger.html">Ken Koedinger</a>)
<li><a href="http://www.cs.cmu.edu/~eairoldi">Edoardo Airoldi</a>
(former student, co-advised with <a href="http://www.stat.cmu.edu/~fienberg/">Steve Fienberg</a>)
<li><a href="http://www.cs.cmu.edu/~pradeepr">Pradeep Ravikumar</a>
(former student, co-advised with <a href="http://www.stat.cmu.edu/~fienberg/">Steve Fienberg</a>)
</ul>
<h3><a name="contact">Contact Info</a></h3>
<p>
William Cohen</br>
Associate Research Professor</br>
Center for Automated Learning & Discovery</br>
Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213</br>
Wean Hall 5317 / 412-268-7664 (voice) / 412-268-3431 (fax) </br>
<p><a href="http://people.cs.cmu.edu/person/49142.html">Official CMU Contact Info</a>
<p>My preferred email address is: <font color=blue>wcohen AT cs DOT cmu DOT edu</font>
<h3><a name="misc">Other Stuff</a></h3>
<p>For those many friends whose research I have built on, be warned.
My full name, "William Weston Cohen", is an anagram of the phrase "I now
cite shallow men".
<p>I am often praised for my highly artistic and functional web site designs.
An example is the site for <a href="http://www.scindexing.com">SC Indexing,
a professional book indexer</a>. However, I accept few clients - this
one happens to be my wife.
<p>Through my advisor, Alex Borgida, I can trace my <a
href="lineage.html">"academic lineage"</a> back to luminaries like
Leibniz and Alfred Whitehead.
<p><a href="hp.html">Poetry anyone?</a>
<hr>
</BODY>
</HTML>