-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathproject-9.qmd
160 lines (115 loc) · 5.67 KB
/
project-9.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
---
title: "Programming Project 9"
---
```{r}
#| echo: false
#| message: false
#| warning: false
library(tidyverse)
library(readxl)
assignments <- read_excel("assessment_schedule.xlsx") %>%
mutate(formatted_date = format(due_date, "%A, %B %d, %Y"))
```
Programming Projects are to be submitted to [gradescope](https://www.gradescope.com/courses/934148).
**Due date: `r assignments %>% filter(assessment == "Programming Project 9") %>% pull(formatted_date)` at 9pm**
This programming project is an adaptation of Ben Dicken's Infographic Programming Assignment.
# Creating information files
In this programming problem, you will be writing a program that reads in a text file (perhaps containing the contents of a book, poem, article, etc) and produces an information file based on the text. You will need to use one or more dictionaries to count words in the program. You may also use lists and / or sets.
The information file should contain the following information about the text:
* The name of the text file
* The total number of words.
* The most used small, medium, and large words in the file. For the purposes of this project, the definitions of a small, medium, and large word is:
* small: length 0-4, inclusive.
* medium: length 5-7, inclusive.
* large: length 8+, inclusive.
* The frequencies and percentages for each word length category
* The frequencies and percentages for each capitalized word category
* The frequencies and percentages for each punctuated word category (consider only `.` and `,`)
Name your Python file `infofile.py`.
# Development Strategy
## (1) Reading and processing the file
The first step is to open an input file and read in all of the lines and strip the newlines off of the ends. To get the words, all you have to do is split each line on whitespace. To accomplish this, just call `split()` on each line string, with no argument. You can then append each individual word to a python list. You do not need to strip off punctuation from the words or normalize the cases.
For example, if you input file had the following content:
```{html}
two forks.
one knife.
two glasses.
one plate.
one napkin.
his glasses.
his knife.
```
The words list should have the following content after processing the file:
```{python}
#| echo: true
#| eval: false
words = ['two', 'forks.', 'one', 'knife.', 'two', 'glasses.', \
'one', 'plate.', 'one', 'napkin.', 'his,' 'glasses.', 'his', 'knife.']
```
## (2) Counting words
Once you have a list of all of the words from the file, you can count the occurrences. You should use a dictionary for this. Continuing from the example of the last step, the dictionary should have the following contents:
```{python}
#| echo: true
#| eval: false
word_counts = {'two':2, 'one':3, 'forks.':1, 'knife.':2, \
'glasses.':2, 'plate.':1, 'napkin.':1, 'his':2}
```
## (3) Finding most occurrences
Next, you can find the small, medium, and large words that occur the most. In the example, it would be `one` for small, `knife.` for medium, and `glasses.` for large. You can do this by iterating through the key and value pairs in the `word_counts` dictionary, and keeping track of which has the highest count.
## (4) Counting Capitalized
Next, compute how many unique capitalized and non-capitalized. You can do this by getting the keys of the `word_counts` dictionary as a list, and then looping through them all.
## (5) Counting Punctuated
Compute how many unique punctuated and non-punctuated words there are. Use a similar process as you did for step (4).
## (6) Writing the information file
You should write a text file with three sections on the information you compiled. Here's an example of what the file should look like for the contents of the file in section `(1)`.
Your output file name should be the original file name plus `"_results"` -- for example, if the file name is `"foo.txt"` the output file you create should be names `"foo_results.txt"`
```{html}
Total number of unique words: 8
Most frequent words by size:
* Small: one
* Medium: knife.
* Large: glasses.
Length:
* Small: 3 (37.5%)
* Medium: 4 (50.0%)
* Large: 1 (12.5%)
###################%%%%%%%%%%%%%%%%%%%%%%%%%******
Capitalization:
* Capitalized: 0 (0.0%)
* Non-capitalized: 8 (100.0%)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Punctuation:
* Punctuated: 5 (62.5%)
* Non-punctuated: 3 (37.5%)
###############################%%%%%%%%%%%%%%%%%%%
```
To calculate how many characters in each bar use the formula:
```{python}
#| eval: false
#| echo: true
(category_item_count/total_item_count) * 50
```
Note for the percentages please keep two decimal places. For example, `3/8` in percentage should be `37.5%` and `3/9` should be `33.33%`.
# Resource Files
Below are some resource files you might find interesting to run your program with.
* [poem.txt](data/poem.txt)
* [Cat_in_the_Hat.txt](data/Cat_in_the_Hat.txt)
If you'd like to try the program on other books, you can visit [gutenberg.org](https://gutenberg.org/).
# Test case
```{python}
#| eval: false
#| echo: true
if __name__ == '__main__':
filename = "test.txt"
word_list = text_to_list(filename)
assert word_list == ['two', 'forks.', 'one', 'knife.', 'two', 'glasses.', \
'one', 'plate.', 'one', 'napkin.', 'his', 'glasses.', 'his', 'knife.']
word_counts = count_words(word_list)
assert word_counts == {'two':2, 'one':3, 'forks.':1, 'knife.':2, \
'glasses.':2, 'plate.':1, 'napkin.':1, 'his':2}
assert most_frequent(word_counts) == ('one', 'knife.', 'glasses.')
assert count_capitalized(word_counts) == (0, 8)
assert count_punctuated(word_counts) == (5, 3)
assert count_sizes(word_counts) == (3, 4, 1)
write_results(word_counts, filename)
```