-
Notifications
You must be signed in to change notification settings - Fork 0
/
run_log.txt
214 lines (171 loc) · 9.86 KB
/
run_log.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
## Run 1
N=8
### Top terms per cluster:
Cluster 0: github hyrax rb com https app blob samvera master error
Cluster 1: collection collections works work add page metadata items able need
Cluster 2: use add like https need data image work images test
Cluster 3: ingest princeton edu figgy catalog https pudl mets collection jpstroop
Cluster 4: search results advanced facet text page terms date like searching
Cluster 5: file files preservation work metadata fileset upload set csv acceptance
Cluster 6: metadata fields work field form related edit type behavior display
Cluster 7: page user users admin dashboard link add publisher content work
### Notes
We can see princeton tickets have their own cluster (3). Clearly we need to
augment the stopwords list or clean the data beforehand to remove institution-specific terms.
## Run 2
N=16
### Top terms per cluster:
Cluster 0: scholar_uc_legacy uclibs api uc scholar moved descriptive summary large student
Cluster 1: add use need like data field create new date resource
Cluster 2: file files work text thumbnail set version csv zip extracted
Cluster 3: hyrax files github com https blob rb master samvera app
Cluster 4: ingest pudl jpstroop mets joycebcat bulk ingested migrate ping collection
Cluster 5: work behavior works related descriptive expected summary edit form user
Cluster 6: error rb app project_root backtrace line honeybadger errors lib io
Cluster 7: embargo visibility embargoed prod date hydrus embargoes notification cron private
Cluster 8: page download link add view user button pages items resource
Cluster 9: preservation criteria metadata context acceptance given curate file notes occur
Cluster 10: search results facet advanced text page terms date searching like
Cluster 11: user scholarsphere orcid users dashboard admin profile ldap allow create
Cluster 12: https com github google analytics png githubusercontent scholar docs user
Cluster 13: edu princeton figgy catalog https concern psu http scholarsphere maps
Cluster 14: metadata fields field work uris controlled data use need form
Cluster 15: collection collections works work page add metadata new need able
### Notes
More institutions are represented here.
Individuals begin to show up as well. Need to clean up anything starting with
`@`. Cleaning these may also get rid of some code, where instance variables are
referenced.
We created a cleaner.
### Run 3
Top terms per cluster:
Cluster 0: use add need data like hyrax see work new user
Cluster 1: page link user pages download admin publisher view viewer content
Cluster 2: file files work fileset preservation metadata upload set thumbnail csv
Cluster 3: ingest mets ingested bulk collection notes metadata migrate ping files
Cluster 4: search results facet advanced text page terms date collection searching
Cluster 5: metadata fields field work form related behavior edit use add
Cluster 6: rb error app project_root backtrace line lib 500 honeybadger gems
Cluster 7: collection collections works work add metadata user like need able
### Notes
Need to do stemming.
## Run 4
Top terms per cluster:
Cluster 0: test : run work testing feature # hyrax collection coverage
Cluster 1: : ( # ) ? work user field need use
Cluster 2: collection work metadata # ( ) user page add field
Cluster 3: ` : [ ] > ' ( ) `` -
Cluster 4: file work : upload set # ( bag ) version
Cluster 5: page search result link user show : ( ) add
Cluster 6: preservation metadata ) ( event file criterion context workflow object
Cluster 7: '' `` : ( ) ? ingest field > n't
### Notes
Using the NLTK WordNetLemmatizer. Clearly need to get rid of punctuation
## Run 5
Uses a porter stemmer
Top terms per cluster:
Cluster 0: search result page advanc text term user collect thi item
Cluster 1: thi use user need work test add like imag updat
Cluster 2: ingest note migrat collect ping record metadata thi anywher bulk
Cluster 3: date facet sort field rang year thi edtf index modifi
Cluster 4: file upload work thi metadata preserv set version fileset user
Cluster 5: error backtrac line thi info messag fail work test file
Cluster 6: collect work metadata thi user add page need creat like
Cluster 7: page field work thi metadata download link display type item
### Notes:
We put punctuation stripping and stemming into a tokenizer method
It's sort of hard to read some of the stemmed words; especially what is `thi`,
which appears in many clusters?
Try adding the stemmed stopwords back to the stopwords list
## Run 6
Top terms per cluster:
Cluster 0: file upload work version set delet user use csv need
Cluster 1: error page backtrac user line link work messag fail info
Cluster 2: ingest metadata preserv note record file map migrat event collect
Cluster 3: field metadata display valu requir use work data form relat
Cluster 4: search result page advanc user text term collect work item
Cluster 5: use user imag need add download like resourc updat link
Cluster 6: collect work page user add metadata need item object creat
Cluster 7: work test date embargo uri edit need item creat object
## Notes:
We added stemmed stopwords; looks much better.
We'll also try a different lemmatizer:
- this run with PorterStemmer
- try with WordNetLemmatizer, which doesn't change any word that's not in its
dictionary
## Run 7
Top terms per cluster:
Cluster 0: search result page advanced term text user facet like field
Cluster 1: field metadata work form value type date vocabulary controlled need
Cluster 2: collection work page user add item need object metadata able
Cluster 3: test job run fixity rake task server work check testing
Cluster 4: error backtrace line info message file 500 honeybadgerio page view
Cluster 5: user page work add need link use item like object
Cluster 6: image ingest iiif viewer thumbnail ping download anywhere tiff migrate
Cluster 7: file work upload set preservation version user thumbnail metadata fileset
### Notes:
We are happy!
## Run 8
Homogeneity: 1.000
Completeness: 0.000
V-measure: 0.000
Adjusted Rand-Index: 0.000
Silhouette Coefficient: 0.011
Top terms per cluster:
Cluster 0: field value controlled vocabulary form term label uris metadata add
Cluster 1: embargo embargoed visibility expired work expiring notification object date prod
Cluster 2: error backtrace line info 500 page file view honeybadgerio message
Cluster 3: image thumbnail iiif viewer download page manifest file tiff object
Cluster 4: resource type work page new child add object map wireframe
Cluster 5: role user press registered collection admin page staff publisher restricted
Cluster 6: metadata preservation work field file object event workflow record collection
Cluster 7: ingest note migrate ping collection ingested aid finding record anywhere
Cluster 8: file work set upload csv user need preservation bag large
Cluster 9: work button page behavior related child show descriptive user summary
Cluster 10: collection work page user add item like object need metadata
Cluster 11: version file batch upload upgrade update one item need hyrax
Cluster 12: user page add use need link like content see change
Cluster 13: search result page advanced text term field facet searching user
Cluster 14: date range facet year field edtf format sort created need
Cluster 15: test run testing feature work accessibility coverage add travis error
## Notes
About half of these look like recognizably coherent.
## Run 9
Homogeneity: 1.000
Completeness: 0.000
V-measure: 0.000
Adjusted Rand-Index: 0.000
Silhouette Coefficient: 0.011
Top terms per cluster:
Cluster 0: vocabulary controlled term uris use local authority field metadata ephemera
Cluster 1: readonly script mode arabic letter cap voyager working language pdfs
Cluster 2: facet number result display collection list view sorting sort item
Cluster 3: work collection need file export bag user feature set object
Cluster 4: use version update data gem need handle hyrax upgrade server
Cluster 5: embargo visibility expired expiring work object notification embargoed date item
Cluster 6: doi work field batch link citation metadata version publisher record
Cluster 7: link text form change summary please oasis page download contact
Cluster 8: resource record title fedora 1 agent get name child data
Cluster 9: map ephemera scanned resource note voyager ark convert case collection
Cluster 10: page add content link edit item show home work user
Cluster 11: thumbnail file set image representative work fileset resource display default
Cluster 12: metadata work type field need schema new export collection data
Cluster 13: csv row warning value import file header number type instantiation
Cluster 14: image iiif viewer download tiff manifest file universal riiif work
Cluster 15: file work upload version bag set user need directory size
Cluster 16: error message 500 file log 404 work problem user back
Cluster 17: preservation event metadata file context workflow object curate mapping occur
Cluster 18: google analytics drive service statistic download file downloads browseeverything view
Cluster 19: email job fixity notification check run send contact cron address
Cluster 20: publisher delete admin platform entry asset page typography variable book
Cluster 21: object backtrace line view honeybadgerio info full project child type
Cluster 22: label uri uris location value geonames field language add form
Cluster 23: field metadata value display required collection add form project work
Cluster 24: ingest migrate ping note anywhere ingested collection bulk aid displayed
Cluster 25: collection work page add user new metadata object group create
Cluster 26: add url set related button user work citation behavior want
Cluster 27: search result text advanced term page searching user field browse
Cluster 28: date range edtf facet year format field sort yyyymmdd need
Cluster 29: test run testing feature coverage travis collection accessibility mvp work
Cluster 30: user dashboard group admin log role activity like collection logged
Cluster 31: batch edit upload work item approval file editing need uploads