-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathinstructions.txt
3008 lines (2448 loc) · 109 KB
/
instructions.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<instructions>
<initial_instructions>
The final instructions are in the end of this file. Please review them carefully. You will implement three rounds of modifications. .The goal is to reproduce the results of the paper. I will provide the paper, my original plan and some criticism. You are in an environment managed by "uv" (NOT uvicorn). To install dependencies, use "uv pip install [dependency]".
</initial_instructions>
<paper>
arXiv:2110.15794v1 [cs.CL] 26 Oct 2021
CLAUSEREC: A Clause Recommendation Framework for AI-aided
Contract Authoring
Aparna Garimella
Adobe Research
Vinay Aggarwal
Adobe Research
Anandhavelu N
Adobe Research
Abstract
Contracts are a common type of legal docu-
ment that frequent in several day-to-day busi-
ness workflows. However, there has been very
limited NLP research in processing such doc-
uments, and even lesser in generating them.
These contracts are made up of clauses, and
the unique nature of these clauses calls for
specific methods to understand and generate
such documents. In this paper, we intro-
duce the task of clause recommendation, as
a first step to aid and accelerate the author-
ing of contract documents. We propose a two-
staged pipeline to first predict if a specific
clause type is relevant to be added in a con-
tract, and then recommend the top clauses for
the given type based on the contract context.
We pretrain BERT on an existing library of
clauses with two additional tasks and use it
for our prediction and recommendation. We
experiment with classification methods and
similarity-based heuristics for clause relevance
prediction, and generation-based methods for
clause recommendation, and evaluate the re-
sults from various methods on several clause
types. We provide analyses on the results, and
further outline the advantages and limitations
of the various methods for this line of research.
1 Introduction
A contract is a legal document between at least two
parties that outlines the terms and conditions of the
parties to an agreement. Contracts are typically in
textual format, thus providing a huge potential for
NLP applications in the space of legal documents.
However, unlike most natural language corpora that
are typically used in NLP research, contract lan-
guage is repetitive with high inter-sentence similar-
ities and sentence matches (Simonson et al., 2019),
calling for new methods specific to legal language
to understand and generate contract documents.
A contract is essentially made up of clauses,
which are provisions to address specific terms of
the agreement, and which form the legal essence
Balaji Vasan Srinivasan
Adobe Research
Rajiv Jain
Adobe Research
of the contract. Drafting a contract involves select-
ing an appropriate template (with skeletal set of
clauses), and customizing it for the specific pur-
pose, typically via adding, removing, or modifying
the various clauses in it. Both these stages involve
manual effort and domain knowledge, and hence
can benefit from assistance from NLP methods that
are trained on large collections of contract docu-
ments. In this paper, we attempt to take the first
step towards AI-assisted contract authoring, and
introduce the task of clause recommendation, and
propose a two-staged approach to solve it.
There have been some recent works on item-
based and content-based recommendations. Wang
and Fu (2020) reformulated the next sentence pre-
diction task in BERT (Devlin et al., 2019) as
next purchase prediction task to make a collabora-
tive filtering based recommendation system for e-
commerce setting. Malkiel et al. (2020) introduced
RecoBERT leveraging textual description of items
such as titles to build an item-to-item recommen-
dation system for wine and fashion domains. In
the space of text-based content recommendations,
Bhagavatula et al. (2018) proposed a method to rec-
ommend citations in academic paper drafts without
using metadata. However, legal documents remain
unexplored, and it is not straightforward to extend
these methods to recommend clauses in contracts,
as these documents are heavily domain-specific
and recommending content in them requires spe-
cific understanding of their language.
In this paper, clause recommendation is defined
as the process of automatically providing recom-
mendations of clauses that may be added to a given
contract while authoring it. We propose a two-
staged approach: first, we predict if a given clause
type is relevant to be added to the given input con-
tract; examples of clause types include governing
laws, confidentiality, etc. Next, if a given clause
type is predicted as relevant, we provide context-
aware recommendations of clauses belonging to
Figure 1: CLAUSEREC pipeline: Binary classification + generation for clause recommendation.
the given type for the input contract. We develop
CONTRACTBERT, by further pre-training BERT
using two additional tasks, and use it as the under-
lying language model in both the stages to adapt
it to contracts. To the best of our knowledge, this
is the first effort towards developing AI assistants
for authoring and generating long domain-specific
legal contracts.
2 Methodology
A contract can be viewed as a collection of clauses
with each clause comprising of: (a) the clause la-
bel that represents the type of the clause and (b)
the clause content. Our approach consists of two
stages: (1) clause type relevance prediction: pre-
dicting if a given clause type that is not present in
the given contract may be relevant to it, and (2)
clause recommendation: recommending clauses
corresponding to the given type that may be rele-
vant to the contract. Figure 1 shows an overview of
our proposed pipeline.
First, we build a model to effectively represent a
contract by further pre-training BERT, a pre-trained
Transformer-based encoder (Devlin et al., 2019),
on contracts to bias it towards legal language. We
refer to the resulting model as CONTRACTBERT.
In addition to masked language modelling and next
sentence prediction, CONTRACTBERT is trained
to predict (i) if the words in a clause label belong
to a specific clause, and (ii) if two sentences be-
long to the same clause, enabling the embeddings
of similar clauses to cluster together. Figure 2
and 3 show the difference in the performance of
BERT and CONTRACTBERT to get a meaningful
clause embedding. BERT is unable to differen-
tiate between the clauses of different types as it
is unfamiliar with legal language. On the other
Figure 2: Clustering of clauses using BERT
Embedding
Figure 3: Clustering of clauses using ContractBERT
Embedding
hand, CONTRACTBERT is able to cluster similar
clause types closely while ensuring the separation
between clauses of two different types.
2.1 Clause Type Relevance Prediction
Given a contract and a specific target clause type,
the first stage involves predicting if the given type
may be relevant to be added to the contract. We
train binary classifiers for relevance prediction for
each of the target clause types. Given an input
contract, we obtain its CONTRACTBERT repre-
sentation as shown in Figure 1. Since the number
of tokens in the contracts are usually very large
(≫512), we obtain the contextual representations
of each of the clauses present and average their
[CLS] embeddings to obtain the contract represen-
tation ct_rep. This representation is fed as input to
a binary classifier which is a small fully-connected
neural network that is trained using binary cross
entropy loss. We use a probability score of over
0.5 as a positive prediction, i.e., the target clause
type is relevant to the input contract.
2.2 Clause Content Recommendation
Once a target clause type is predicted as relevant,
the next stage is to recommend clause content cor-
responding to the given type for the contract. We
model this as a sequence-to-sequence generation
task, where the the input includes the given contract
and clause label, and the output contains relevant
clause content that may be added to the contract.
We start with a transformer-based encoder-decoder
architecture (Vaswani et al., 2017), follow (Liu
and Lapata, 2019) and initialize our encoder with
CONTRACTBERT. We then train the transformer
decoder for generating clause content. As men-
tioned above, the inputs for the encoder comprise
of a contract and a target clause type.
We calculate the representations of all possible
clauses belonging to the given type in the dataset
using CONTRACTBERT, and their [CLS] token’s
embeddings are averaged, to obtain a target clause
type representation trgt_cls_rep.This trgt_cls_rep
and the contract representation ct_rep are averaged
to obtain the encoding of the given contract and
target clause type, which is used as input to the de-
coder. Note that since CONTRACTBERT is already
pre-trained on the contracts, we do not need to train
the encoder again for clause generation. Given the
average of the contract and target clause type repre-
sentation as input, the decoder is trained to generate
the appropriate clause belonging to the target type
which might be relevant to the contract. Note that
our generation method provides a single clause as
recommendation. On the other hand, with retrieval-
based methods, we can obtain multiple suggestions
for a given clause type using similarity measures.
3 Experiments and Evaluation
We evaluate three methods for clause type rele-
vance prediction + clause recommendation: (1)
Binary classification + clause generation, which
is our proposed approach; (2) Collaborating filter-
ing + similarity-based retrieval; and (3) Document
similarity + similarity-based retrieval.
Collaborating filtering (CF) + similarity-based
retrieval. Clause type relevance prediction can be
seen as an item-item based CF task (Linden et al.,
2003) with contracts as users and clause types as
items. We construct a contract-clause type matrix,
equivalent to the user-item matrix. If contract u
contains clause type i, the cell (u, i) gets the value
1, otherwise 0. We then compute the similarity
between all the clause type pairs (i, j), using an
adjusted cosine similarity, given by,
sim(i, j) =
U
u (r(u,i)−
¯
ru)(r(u,j)−
¯
rj )
U
u r2
(u,i)
U
u r2
(u,j)
(1)
We obtain the item similarity matrix using this co-
sine score, and use it to predict if a target clause
type t is relevant to a given contract. We compute
the score for t using the weighted sum of the score
of the other similar clause types, given by,
score(u, t) =
I
j sim(t, j)(r(u,j)−
¯
rj )
I
j sim(t, j) + ¯ rt
(2)
If t gets a high score and is not already present
in the contract, it is recommended. We experiment
with multiple thresholds above which a clause type
may be recommended.
Given a clause library containing all possi-
ble clause types and their corresponding clauses,
clause content recommendation can be seen as a
similarity-based retrieval task. For a given con-
tract and a target clause type t, we use ct_rep and
trgt_cls_rep, and find cosine similarities with each
of the clauses belonging to t to find the most similar
clauses that may be relevant to the given contract.
We do so by computing the similarity of either (i)
ct_rep or (ii) (ct_rep + trgt_cls_rep)/2, with indi-
vidual clause representations.
Document similarity + similarity-based re-
trieval. This is based on using similar documents
to determine if a target clause type t can be rec-
ommended for a given contract. The hypothesis is
that similar contracts tend to have similar clause
types. To find similar documents, we compute
cosine similarities between the given contract’s rep-
resentations ct_rep with those of all the contracts
in the (training) dataset to identify the top k similar
contracts. If t is present in any of the k similar con-
tracts and is not present in the given contract, it is
recommended as a relevant clause type to be added
CLAUSE
TYPE
METHOD PREC. REC. ACC. F1
Governing CF-based 0.5889 0.8166 0.6243 0.6843
Laws Doc sim-based 0.7882 0.6225 0.7276 0.6957
Binary classification 0.6898 0.7535 0.7082 0.7203
Severability CF-based 0.6396 0.9091 0.6987 0.7509
Doc sim-based 0.7156 0.8182 0.7467 0.7635
Binary classification 0.7654 0.8042 0.7790 0.7843
Notices CF-based 0.5533 0.8810 0.5885 0.6797
Doc sim-based 0.7825 0.7257 0.7640 0.7530
Binary classification 0.6850 0.7605 0.7079 0.7208
Counterparts CF-based 0.6133 0.8899 0.6657 0.7262
Doc sim-based 0.7156 0.8182 0.7467 0.7635
Binary classification 0.7784 0.8259 0.7961 0.8014
Entire CF-based 0.6197 0.8173 0.6591 0.7049
Agreements Doc sim-based 0.9006 0.6623 0.7953 0.7633
Binary classification 0.7480 0.8158 0.7713 0.7804
Table 1: Clause type relevance prediction results.
to the contract. We experiment with k ∈{1, 5}.
Similarity-based retrieval for clause content recom-
mendation is same as above.
Metrics. We evaluate the performance of clause
type relevance prediction using precision, recall,
accuracy and F1-score metrics, and that of the
clause content recommendation using ROUGE
(Lin, 2004) score.
Data. We use the LEDGAR dataset introduced
by Tuggener et al. (2020). It contains contracts
from the U.S. Securities and Exchange Commis-
sion (SEC) filings website, and includes material
contracts (Exhibit-10), such as shareholder agree-
ments, employment agreements, etc. The dataset
contains 12,608 clause types and 846,274 clauses
from around 60,000 contracts. Further details on
the dataset are provided in the appendix.
Since this dataset can not be used for our work
readily, we preprocess it to create proxy datasets
for clause type relevance prediction and clause rec-
ommendation tasks. For the former, for a target
clause type t, we consider the labels relevant and
not relevant for binary classification. For relevant
class, we obtain contracts that contain a clause cor-
responding to t, and remove this clause; given such
a contract as input in which t is not present, the
classifier is trained to predict t as relevant to be
added to the contract. For the not relevant class,
we randomly sample an equal number of contracts
that do not contain t in them. For recommenda-
tion, we use the contracts that contain t (i.e., the
relevant class contracts); the inputs consist of the
contract with the specific clause removed and t,
and the output is the clause that is removed. For
both the tasks, we partition these proxy datasets
into train (60%), validation (20%) and test (20%)
sets. These ground truth labels ({relevant, not rel-
CLAUSE TYPE METHOD R-1 R-2 R-L
Governing Sim-based (w/o cls_rep) 0.441 0.213 0.327
Laws Sim-based (with cls_rep) 0.499 0.280 0.399
Generation-based 0.567 0.395 0.506
Severability Sim-based (w/o cls_rep) 0.419 0.142 0.269
Sim-based (with cls_rep) 0.444 0.155 0.288
Generation-based 0.521 0.264 0.432
Notices Sim-based (w/o cls_rep) 0.341 0.085 0.207
Sim-based (with cls_rep) 0.430 0.144 0.309
Generation-based 0.514 0.271 0.422
Counterparts Sim-based (w/o cls_rep) 0.466 0.214 0.406
Sim-based (with cls_rep) 0.530 0.279 0.474
Generation-based 0.666 0.495 0.667
Entire Sim-based (w/o cls_rep) 0.433 0.183 0.306
Agreements Sim-based (with cls_rep) 0.474 0.201 0.331
Generation-based 0.535 0.312 0.485
Table 2: Clause content recommendation results.
evant} for the first task and the clause content for
the second task) that we removed are used for eval-
uation. The implementation details are provided in
appendix.
4 Results and Discussion
Table 1 summarizes the results of the three methods
(CF-based, document similarity-based and binary
classification) for the clause type relevance predic-
tion task. For the tasks, we report results on the
thresholds, k and learning rate which gave best re-
sults on the validation set (the ablation results are
reported in the appendix).
The CF-based method gives the best recall val-
ues for all the clause types, while the precision,
accuracy and F1 scores are worse compared to
the other two methods. This method does not in-
corporate any contextual information of the con-
tract clause content and relies only on the presence
or absence of clause types to predict if a target
type is relevant, thus resulting in high recall and
low precision and F1 scores. While the results of
document similarity-based and classification meth-
ods are comparable, both have merits and demer-
its. While the document similarity-based method
is simpler and more extensible than classification
which requires training a new classifier for each
new clause type, the former requires a large collec-
tion of possible contracts to obtain decent results
(particularly the recall values), which may not be
available always. Further, the performance of docu-
ment similarity method is dependent on k. This can
be seen in the lower recall values for the document
similarity method compared to those of classifica-
tion. The storage costs associated with the contract
collection can also become a bottleneck for the doc-
ument similarity method. Also, currently there is
no way to rank the clauses in the similar contracts,
and hence its recommendations cannot be scoped,
while in classification, the probability scores can
be used to rank the clause types for relevance. On
an average, the F1 scores for binary classification
are highest compared to the other methods, while
the accuracies are comparable with the document
similarity method.
Table 2 shows the results for clause content
recommendation using similarity and generation-
based methods. For the sim-based method, we use
the clause with the highest similarity to compute
ROUGE. The scores using only ct_rep are lower
than those with trgt_cls_rep. This is expected as
trgt_cls_rep adds further information on the clause
type for which the appropriate clauses are to be
retrieved. Finally, the generation-based method re-
sults in the best scores for clause recommendation,
thus indicating the usefulness of our proposed ap-
proach for this task. Some qualitative examples
using both the methods are provided in appendix.
For clause content recommendation, we focused
primarily on relevance (in terms of ROUGE). In
general, retrieval-based frameworks, like the one
we proposed, are mostly extractive in nature, and
hence might be perceived as “safer” (or factual) to
avoid any noise and vocabulary change in clauses
that may be incorporated by generation methods,
particularly in domains like legal. However, they
can also end up retrieving clauses irrelevant to the
contract context at times, as we note from their
lower ROUGE scores, as retrieval is based on sim-
ilarity heuristics which may not always capture
relevance, while generation is trained to generate
the specific missing clause in each contract.
We also notice that generated clauses have lower
linguistic variations in them, i.e., generated clauses
belonging to one type often look alike. However,
this is expected as most clauses look very simi-
lar with only a few linguistic and content varia-
tions. We believe because clauses have this repet-
itive nature, there is a large untapped opportunity
to leverage NLP methods for legal text generation
while accounting for the nuances and factuality in
them, to build more accurate clause recommenda-
tion frameworks. We believe our work can provide
a starting point for future works to build powerful
models to capture the essence of legal text and aid
in authoring them. In the future, we aim to focus
on balancing the relevance and factuality of clauses
recommended by our system.
5 Conclusions
We addressed AI-assisted authoring of con-
tracts via clause recommendation. We proposed
CLAUSEREC pipeline to predict clause types rele-
vant to a contract and generate appropriate content
for them based on the contract content. The results
we get on comparing our approach with similarity-
based heuristics and traditional filtering-based tech-
niques are promising, indicating the viability of AI
solutions to automate tasks for legal domain. Ef-
forts in generating long contracts are still in their
infancy and we hope our work can pave way for
more research in this area.
References
Chandra Bhagavatula, Sergey Feldman, Russell Power,
and Waleed Ammar. 2018. Content-based citation
recommendation. In Proceedings of the 2018 Con-
ference of the North American Chapter of the Asso-
ciation for Computational Linguistics: Human Lan-
guage Technologies, Volume 1 (Long Papers), pages
238–251, New Orleans, Louisiana. Association for
Computational Linguistics.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference
of the North American Chapter of the Association
for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers),
pages 4171–4186, Minneapolis, Minnesota. Associ-
ation for Computational Linguistics.
Chin-Yew Lin. 2004. ROUGE: A package for auto-
matic evaluation of summaries. In Text Summariza-
tion Branches Out, pages 74–81, Barcelona, Spain.
Association for Computational Linguistics.
G. Linden, B. Smith, and J. York. 2003. Amazon.com
recommendations: item-to-item collaborative filter-
ing. IEEE Internet Computing, 7(1):76–80.
Yang Liu and Mirella Lapata. 2019. Text summariza-
tion with pretrained encoders.
Itzik Malkiel, Oren Barkan, Avi Caciularu, Noam
Razin, Ori Katz, and Noam Koenigstein. 2020. Re-
coBERT: A catalog language model for text-based
recommendations. In Findings of the Association
for Computational Linguistics: EMNLP 2020, pages
1704–1714, Online. Association for Computational
Linguistics.
Dan Simonson, Daniel Broderick, and Jonathan Herr.
2019. The extent of repetition in contract language.
In Proceedings of the Natural Legal Language Pro-
cessing Workshop 2019, pages 21–30, Minneapolis,
Minnesota. Association for Computational Linguis-
tics.
Figure 4: Some clause types in the LEDGAR dataset.
Don Tuggener, Pius von Däniken, Thomas Peetz, and
Mark Cieliebak. 2020. LEDGAR: A large-scale
multi-label corpus for text classification of legal pro-
visions in contracts. In Proceedings of the 12th Lan-
guage Resources and Evaluation Conference, pages
1235–1241, Marseille, France. European Language
Resources Association.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need.
Tian Wang and Yuyangzi Fu. 2020. Item-based col-
laborative filtering with BERT. In Proceedings of
The 3rd Workshop on e-Commerce and NLP, pages
54–58, Seattle, WA, USA. Association for Computa-
tional Linguistics.
Appendix
A Data
Figure 4 shows some of the clause types present in
the LEDGAR dataset.
B Implementation Details
To train CONTRACTBERT, we crawl and use a
larger collection of 250k contracts and train it till
the losses converge.
B.1 Binary Classifiers
We use a small 7-layered fully connected neural
network with ReLU activation and dropout of 0.3
as binary classifiers. The input is 768 dimensional
contract representation and output is a probability
score between [0, 1]. We use a batch-size of 64 and
train them for 5000 epochs. We experiment with
4 learning rates: [1e−5, 5e−6, 1e−6, 5e−7].
Adam optimizer is used with Binary Cross Entropy
Loss as criterion. The model with highest accu-
racy on validation set is stored and the results are
reported on a held out test set. The training takes
around 150 minutes for each clause type.
For document similarity method, we experi-
mented with k = [1, 5] and for CF based method,
we evaluated F-scores and accuracies for different
threshold values and report the best results we got.
Clause Label k-value threshold learning_rate
Governing Law 1 0.27 5e−07
Counterparts 2 0.18 1e−06
Notices 2 0.15 5e−06
Entire Agreements 1 0.20 1e−05
Severability 3 0.13 1e−06
Table 3: Implementation Details for Clause Type Pre-
diction
Table 3 summarizes the corresponding k values,
thresholds and learningrates corresponding to
the best results.
B.2 Transformer Decoder
The clause text is pre processed by removing punc-
tuation, single letter words, and multiple spaces
then using nltk’s word tokenizer 1 to tokenize the
clause text. We keep the maximum generation
length to be 400 including <SOS> and <EOS> to-
kens. All the clauses with more than 398 tokens
are discarded. The vocabulary is 7185 token long
which is the output dimension. We use 3 decoder
layers. The hidden dimension is 768 i.e., the length
of input embedding. A dropout of 0.1 is used. A
constant learning rate of 1e−05 is used with a
batch size of 16 and the training takes place for 300
epochs. Validation split of 0.2 is used. The results
are reported on a held out test.
C Qualitative Results
Table 4 shows the qualitative results for a few
clause types comparing the similarity-based re-
trieval with generation-based methods. The
ROUGE-1 F-scores are mentioned in the brack-
ets to compare the results quantitatively as well.
1https://www.nltk.org/
CLAUSE TYPE CLAUSE
Governing Laws
Original Sim-based Generated This agreement and the obligations of the parties here under shall be governed by and construed and enforced in
accordance with the substantive and procedural laws of the state of delaware without regard to rules on choice of law.
This agreement shall be governed by and construed in accordance with the laws of the state of illinois without giving
effect to the principles of conflicts of law rules the parties unconditionally and irrevocably consent to the exclusive
jurisdiction of the courts located in the state of illinois and waive any objection with respect thereto for the purpose of
any action suit or proceeding arising out of or relating to this agreement or the transactions contemplated hereby. (R1:
0.456)
This agreement shall be governed by and construed in accordance with the laws of the state of delaware without regard
to the conflicts of law principles thereof. (R1: 0.718)
Original Notices
Sim-based Generated Any notices required or permitted to be given under this agreement shall be sufficient if in writing and if personally
delivered or when sent by first class certified or registered mail postage prepaid return receipt requested in the case
of the executive to his residence address as set forth in the books and records of the company and in the case of the
company to the address of its principal place of business to such person or at such other address with respect to each
party as such party shall notify the other in writing.
Any notice required or permitted by this agreement shall be in writing and shall be delivered as follows with notice
deemed given as indicated by personal delivery when delivered personally ii by overnight courier upon written verifica-
tion of receipt iii by telecopy or facsimile transmission upon acknowledgment of receipt of electronic transmission or
iv by certified or registered mail return receipt requested upon verification of receipt notice shall be sent to executive at
the address listed on the company personnel records and to the company at its principal place of business or such other
address as either party may specify in writing. (R1: 0.588)
Any notice required or permitted to be given under this agreement shall be sufficient if in writing and if sent by
registered or certified mail return receipt requested to executive at the last address the executive has filed in writing
with the company or in the case of the company to its main offices attention of the board. (R1: 0.640)
Original Severability
Sim-based Generated If any term provision covenant or restriction of this agreement is held by court of competent jurisdiction to be invalid
illegal void or unenforceable the remainder of the terms provisions covenants and restrictions set forth herein shall
remain in full force and effect and shall in no way be affected impaired or invalidated and the parties hereto shall use
their commercially reasonable efforts to find and employ an alternative means to achieve the same or substantially
the same result as that contemplated bit/s]y such term provision covenant or restriction it is hereby stipulated and
declared to be the intention of the parties that they would have executed the remaining terms provisions covenants and
restrictions without including any of such that may be hereafter declared invalid illegal void or unenforceable.
Each party hereto agrees that should any court or other competent authority hold any provision of this agreement or part
of this agreement to be null void or unenforceablen the or order any party to take any action inconsistent herewith or
not to take an action consistent with the terms of or required by this agreement the validity legality and or enforceability
of the remaining provisions and obligations contained or set forth in this agreement shall not in any way be affected or
impaired unless the foregoing inconsistent action or the failure to take an action constitutes material breach of this
agreement or makes this agreement impossible to perform in which case this agreement shall terminate. (R1: 0.396)
If any provision of this agreement is held to be invalid or unenforceable under present or future laws effective during
the term hereof the term hereof the term or provisiagenton shall be fully severable and this agreement shall be construed
and enforced as if such illegal invalid or unenforceable term or provision had never comprised part hereohim hf and
the remaining terms hereof shall remain in full force and effect and shall not be affected by the illegal invalid or
unenforceable term or provision or by the parties hereto shall negotiate in good faith substitute valid and enforceable
provision that most nearly reflects the original intent of the parties and all other provisions hereof shall remain in full
force and effect and shall be enforceable. (R1: 0.504)
Original Counterparts
Sim-based Generated This agreement may be executed in one or more counterparts each of which will constitute an original but all of which
when taken together will constitute but one contract and will become effective as provided in section hereof delivery of
an executed counterpart to this agreement by facsimile or other electronic transmission will be as effective as delivery
of manually signed original.
This agreement may be executed by the parties hereto in separate counterparts each of which when so executed and
delivered shall be an original but all such counterparts shall together constitute one and the same agreement and all
signatures need not appear on any one counterpart executed copies of the signature pages of this agreement sent by
facsimile or transmitted electronically in portable document format or any similar format shall be treated as originals
fully binding and with full legal force and effect and the parties waive any rights they may have to object to such
treatment. (R1: 0.427)
This agreement may be executed in any number of counterparts and by different parties hereto in separate counterparts
each of which when so executed shall be deemed to be an original and all of which taken together shall constitute
one and the same agreement delivery of an executed counterpart of signature page to this agreement by facsimile or
other electronic imaging means shall be effective as delivery of manually executed counterpart of this agreement. (R1:
0.603)
Original Entire
Agreements Sim-based Generated This agreement constitutes the entire agreement of the signing parties with respect to the subject matter hereof and
supersedes all other understandings oral or written with respect to the subject matter hereof there are no oral or implied
obligations of the control agent or the other lenders to any third party in connection with this agreement.
This agreement consisting of sections through with schedules and the technology license agreement which is expressly
incorporated by reference herein constitutes the entire understanding between the parties concerning the subject matter
hereof and supersedes all prior discussions agreements and representations whether oral or written this agreement may
be amended altered or modified only by an instrument in writing duly executed by the authorized representations of
both parties. (R1: 0.435)
This agreement and the other transaction documents constitute the entire agreement among the parties hereto with
respect to the subject matter hereof and thereof and supersede all other prior agreements and understandings both
written and oral among the parties or any of them with respect to the subject matter hereof. (R1: 0.626)
Table 4: Qualitative comparison of retrieved and generated clauses
</paper>
<plan_files>
<first_round_modifications>
Revised Project Structure and Implementation Details:
1. src/
__init__.py
No modification needed.
2. src/data/
__init__.py
No modification needed.
dataset.py
LedgarDataset Class:
__init__(self, data_dir: str, tokenizer_name: str = "bert-base-uncased"):
Purpose: Initializes the dataset object.
Implementation:
self.data_dir = data_dir (e.g., "data/ledgar/"). Assumes a structured directory or a path to a JSON file.
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
self.preprocessor = TextPreprocessor() Initialize the TextPreprocessor object.
load_contracts(self, file_pattern: str = "*.json") -> List[Dict]:
Purpose: Loads contracts from JSON files.
Implementation:
import json
import glob
contracts = []
for filepath in glob.glob(os.path.join(self.data_dir, file_pattern)):
with open(filepath, 'r') as f:
contract_data = json.load(f)
contracts.append(contract_data)
return contracts
content_copy
download
Use code with caution.
Python
Assumes JSON structure:
{
"id": "contract_id",
"text": "full contract text",
"clauses": [
{"text": "clause 1 text", "label": "clause_1_label"},
...
]
}
content_copy
download
Use code with caution.
Json
preprocess_text(self, text: str) -> str:
Purpose: Preprocesses contract text (cleaning, normalization).
Implementation:
return self.preprocessor.preprocess(text)
content_copy
download
Use code with caution.
Python
extract_clauses(self, contract_data: Dict) -> List[Dict]:
Purpose: Extracts clauses and their labels from a contract.
Implementation:
return contract_data["clauses"]
content_copy
download
Use code with caution.
Python
create_proxy_datasets(self, clause_types: List[str], train_ratio: float = 0.6, val_ratio: float = 0.2) -> Tuple[List[Dict], List[Dict], List[Dict]]:
Purpose: Creates proxy datasets for training, validation, and testing for clause type relevance prediction.
Implementation:
from sklearn.model_selection import train_test_split
import random
train_data = []
val_data = []
test_data = []
all_contracts = self.load_contracts()
for clause_type in clause_types:
relevant_contracts = []
non_relevant_contracts = []
for contract in all_contracts:
has_clause = False
for clause in contract["clauses"]:
if clause["label"] == clause_type:
has_clause = True
# Create a new contract with the clause removed
new_contract = {
"id": contract["id"],
"text": contract["text"], # You might want to remove the clause text from here too
"clauses": [c for c in contract["clauses"] if c["label"] != clause_type],
"label": clause_type # Add the target clause type as a label
}
relevant_contracts.append(new_contract)
break
if not has_clause:
non_relevant_contracts.append(contract)
# Ensure we have the same number of relevant and non-relevant examples
random.shuffle(non_relevant_contracts)
non_relevant_contracts = non_relevant_contracts[:len(relevant_contracts)]
# Split into train, val, and test
train_rel, test_rel = train_test_split(relevant_contracts, test_size=1 - train_ratio, stratify=[c["label"] for c in relevant_contracts]) # added stratify
train_non_rel, test_non_rel = train_test_split(non_relevant_contracts, test_size=1 - train_ratio, stratify=[clause_type] * len(non_relevant_contracts)) # added stratify
val_rel, test_rel = train_test_split(test_rel, test_size=0.5, stratify=[c["label"] for c in test_rel]) # added stratify
val_non_rel, test_non_rel = train_test_split(test_non_rel, test_size=0.5, stratify=[clause_type] * len(test_non_rel)) # added stratify
train_data.extend(train_rel + train_non_rel)
val_data.extend(val_rel + val_non_rel)
test_data.extend(test_rel + test_non_rel)
return train_data, val_data, test_data
content_copy
download
Use code with caution.
Python
get_clause_embeddings(self, model, clauses: List[str]) -> torch.Tensor:
Purpose: Gets embeddings for a list of clauses using the ContractBERT model.
Implementation:
encoded = model.encode_text(clauses, model.tokenizer)
with torch.no_grad():
outputs = model(**encoded)
return outputs.hidden_states[:, 0] # Assuming you want the [CLS] token embedding
content_copy
download
Use code with caution.
Python
tokenize_text(self, text: str) -> Dict:
Purpose: Tokenizes text using the specified tokenizer.
Implementation:
return self.tokenizer(text, padding=True, truncation=True, max_length=512, return_tensors="pt")
content_copy
download
Use code with caution.
Python
validate_data(self, data: Dict) -> bool:
Purpose: Validates the format of the contract data.
Implementation:
if not all(key in data for key in ["id", "text", "clauses"]):
return False
if not isinstance(data["clauses"], list):
return False
for clause in data["clauses"]:
if not all(key in clause for key in ["text", "label"]):
return False
return True
content_copy
download
Use code with caution.
Python
create_contract_clause_matrix(self) -> np.ndarray:
Purpose: Creates a matrix indicating the presence/absence of clause types in each contract.
Implementation:
all_contracts = self.load_contracts()
all_clause_types = set()
for contract in all_contracts:
for clause in contract["clauses"]:
all_clause_types.add(clause["label"])
all_clause_types = sorted(list(all_clause_types))
num_contracts = len(all_contracts)
num_clause_types = len(all_clause_types)
matrix = np.zeros((num_contracts, num_clause_types))
for i, contract in enumerate(all_contracts):
for clause in contract["clauses"]:
j = all_clause_types.index(clause["label"])
matrix[i, j] = 1
return matrix
content_copy
download
Use code with caution.
Python
get_contract_representation(self, contract_text: str, model) -> torch.Tensor:
Purpose: Gets the vector representation of a contract.
Implementation:
clauses = self.extract_clauses({"text": contract_text, "clauses": []}) # Assuming extract_clauses can handle this
clause_embeddings = self.get_clause_embeddings(model, [c["text"] for c in clauses])
return torch.mean(clause_embeddings, dim=0) # Average of clause embeddings
content_copy
download
Use code with caution.
Python
get_clause_type_representation(self, clause_type: str, model) -> torch.Tensor:
Purpose: Gets the vector representation of a clause type.
Implementation:
all_contracts = self.load_contracts()
clause_texts = []
for contract in all_contracts:
for clause in contract["clauses"]:
if clause["label"] == clause_type:
clause_texts.append(clause["text"])
if not clause_texts:
return None # Handle cases where a clause type might not be present
clause_embeddings = self.get_clause_embeddings(model, clause_texts)
return torch.mean(clause_embeddings, dim=0)
content_copy
download
Use code with caution.
Python
data_utils.py
split_data(data: List[Any], train_ratio: float, val_ratio: float) -> Tuple[List[Any], List[Any], List[Any]]:
Purpose: Splits data into train, validation, and test sets.
Implementation:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(data, test_size=1 - train_ratio)
val_data, test_data = train_test_split(test_data, test_size=0.5)
return train_data, val_data, test_data
content_copy
download
Use code with caution.
Python
3. src/models/
__init__.py
Update __all__ to include: ['ContractBERT', 'ClauseClassifier', 'ClauseGenerator', 'BinaryClassifier']
contract_bert.py
ContractBERT Class:
__init__(self, model_name: str = "bert-base-uncased", num_labels: int = 2):
Purpose: Initializes the ContractBERT model.
Implementation:
super().__init__()
self.bert = BertModel.from_pretrained(model_name, output_hidden_states=True) # Ensure hidden states are output
self.dropout = nn.Dropout(0.1)
self.num_labels = num_labels # This will likely be updated per dataset/task
self.classification_layer = nn.Linear(self.bert.config.hidden_size, num_labels) # For clause type classification
self.clause_label_prediction_layer = nn.Linear(self.bert.config.hidden_size, num_labels) # For predicting if clause follows heading
self.sentence_similarity_layer = nn.Linear(self.bert.config.hidden_size, 1) # For predicting if two sentences are from the same clause
content_copy
download
Use code with caution.
Python
forward(self, input_ids: torch.Tensor, attention_mask: Optional[torch.Tensor] = None, labels: Optional[torch.Tensor] = None, clause_heading_labels: Optional[torch.Tensor] = None, sentence_pair_labels: Optional[torch.Tensor] = None) -> Dict[str, torch.Tensor]:
Purpose: Defines the forward pass, including the two additional tasks.
Implementation:
outputs = self.bert(
input_ids=input_ids,
attention_mask=attention_mask
)
sequence_output = outputs.last_hidden_state # [batch_size, seq_len, hidden_size]
pooled_output = outputs.pooler_output # [batch_size, hidden_size]
hidden_states = outputs.hidden_states
# Clause type classification
pooled_output = self.dropout(pooled_output)
classification_logits = self.classification_layer(pooled_output)
# Clause label prediction (assuming you pass [CLS] token output for this)
cls_output = sequence_output[:, 0, :] # Get the [CLS] token output
clause_label_logits = self.clause_label_prediction_layer(cls_output)
# Sentence similarity prediction (assuming you pass sentence pairs through the model)
# You might need to adapt how you feed sentence pairs to the model
sentence_pair_logits = self.sentence_similarity_layer(cls_output).squeeze(-1)
loss = None
if labels is not None:
loss_fct = nn.CrossEntropyLoss()
classification_loss = loss_fct(classification_logits.view(-1, self.num_labels), labels.view(-1))
loss = classification_loss
if clause_heading_labels is not None:
loss_fct = nn.CrossEntropyLoss() # Or another appropriate loss
clause_label_loss = loss_fct(clause_label_logits.view(-1, self.num_labels), clause_heading_labels.view(-1))
if loss is not None:
loss += clause_label_loss
else:
loss = clause_label_loss
if sentence_pair_labels is not None:
loss_fct = nn.BCEWithLogitsLoss()
sentence_similarity_loss = loss_fct(sentence_pair_logits, sentence_pair_labels.float())
if loss is not None:
loss += sentence_similarity_loss
else:
loss = sentence_similarity_loss
return {
"loss": loss,
"classification_logits": classification_logits,
"clause_label_logits": clause_label_logits,
"sentence_similarity_logits": sentence_pair_logits,
"hidden_states": hidden_states
}