Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP Add reports generation as ActiveJob #3620

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ gem 'bcrypt', '>= 3.1.13'
gem 'omniauth', '~> 2.1'
gem 'omniauth-rails_csrf_protection', '~> 1.0'
gem 'omniauth-saml', '~> 2.1'
gem 'omniauth_openid_connect', '~> 0.8'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not have been added, I will remove the changes in the Gemfile at a later push.


# Authorization
gem 'pundit', '2.3.2'
Expand Down
50 changes: 50 additions & 0 deletions Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,9 @@ GEM
tzinfo (~> 2.0)
addressable (2.8.7)
public_suffix (>= 2.0.2, < 7.0)
aes_key_wrap (1.1.0)
ast (2.4.2)
attr_required (1.0.2)
base64 (0.2.0)
bcp47_spec (0.2.1)
bcrypt (3.1.20)
Expand All @@ -142,6 +144,7 @@ GEM
parser (>= 2.4)
smart_properties
bigdecimal (3.1.8)
bindata (2.5.0)
bindex (0.8.1)
binding_of_caller (1.0.1)
debug_inspector (>= 1.2.0)
Expand Down Expand Up @@ -214,6 +217,8 @@ GEM
unicode-types (~> 1.8)
edtf (3.1.1)
activesupport (>= 3.0, < 8.0)
email_validator (2.2.4)
activemodel
erb_lint (0.5.0)
activesupport
better_html (>= 2.0.1)
Expand All @@ -228,6 +233,8 @@ GEM
i18n (>= 1.8.11, < 2)
faraday (2.9.0)
faraday-net_http (>= 2.0, < 3.2)
faraday-follow_redirects (0.3.0)
faraday (>= 1, < 3)
faraday-http-cache (2.5.1)
faraday (>= 0.8)
faraday-net_http (3.1.0)
Expand Down Expand Up @@ -300,6 +307,13 @@ GEM
jsbundling-rails (1.3.0)
railties (>= 6.0.0)
json (2.7.2)
json-jwt (1.16.6)
activesupport (>= 4.2)
aes_key_wrap
base64
bindata
faraday (~> 2.0)
faraday-follow_redirects
json-schema (4.3.1)
addressable (>= 2.8)
jwt (2.7.1)
Expand Down Expand Up @@ -378,7 +392,23 @@ GEM
omniauth-saml (2.1.0)
omniauth (~> 2.0)
ruby-saml (~> 1.12)
omniauth_openid_connect (0.8.0)
omniauth (>= 1.9, < 3)
openid_connect (~> 2.2)
open4 (1.3.4)
openid_connect (2.3.0)
activemodel
attr_required (>= 1.0.0)
email_validator
faraday (~> 2.0)
faraday-follow_redirects
json-jwt (>= 1.16)
mail
rack-oauth2 (~> 2.2)
swd (~> 2.0)
tzinfo
validate_url
webfinger (~> 2.0)
os (1.1.4)
paper_trail (15.1.0)
activerecord (>= 6.1)
Expand All @@ -398,6 +428,13 @@ GEM
raabro (1.4.0)
racc (1.8.0)
rack (2.2.9)
rack-oauth2 (2.2.1)
activesupport
attr_required
faraday (~> 2.0)
faraday-follow_redirects
json-jwt (>= 1.11.0)
rack (>= 2.1.0)
rack-protection (3.2.0)
base64 (>= 0.1.0)
rack (~> 2.2, >= 2.2.4)
Expand Down Expand Up @@ -588,6 +625,11 @@ GEM
strong_migrations (2.0.0)
activerecord (>= 6.1)
strscan (3.1.0)
swd (2.0.3)
activesupport (>= 3)
attr_required (>= 0.0.5)
faraday (~> 2.0)
faraday-follow_redirects
sxp (1.3.0)
matrix (~> 0.4)
rdf (~> 3.3)
Expand All @@ -603,6 +645,9 @@ GEM
unicode-types (1.8.0)
uri (0.13.0)
uuidtools (2.2.0)
validate_url (1.0.15)
activemodel (>= 3.0.0)
public_suffix
vcr (5.0.0)
voight_kampff (2.0.0)
rack (>= 1.4)
Expand All @@ -611,6 +656,10 @@ GEM
activemodel (>= 6.0.0)
bindex (>= 0.4.0)
railties (>= 6.0.0)
webfinger (2.1.3)
activesupport
faraday (~> 2.0)
faraday-follow_redirects
webmock (3.23.1)
addressable (>= 2.8.0)
crack (>= 0.3.2)
Expand Down Expand Up @@ -669,6 +718,7 @@ DEPENDENCIES
omniauth (~> 2.1)
omniauth-rails_csrf_protection (~> 1.0)
omniauth-saml (~> 2.1)
omniauth_openid_connect (~> 0.8)
paper_trail (~> 15.1.0)
pg (~> 1.5.6)
puma (~> 6.4)
Expand Down
144 changes: 144 additions & 0 deletions app/jobs/generate_reports_job.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
class GenerateReportsJob < ApplicationJob

queue_as :default

def perform(*args)
# Do something later
@root_directory = './era_audit/'
@time_of_start = Time.now.utc.strftime('%Y%m%d%H%M%S')

generate_reports
end

private

# Helper methods to get URLs.

def get_entity_url(entity)
# URL example: https://era.library.ualberta.ca/items/864711f5-3021-455d-9483-9ce956ee4e78
Rails.application.routes.url_helpers.item_url(entity)
end

def get_community_url(community)
# URL example: https://era.library.ualberta.ca/communities/d1640714-da95-4963-9242-68065fece5f4
Rails.application.routes.url_helpers.community_url(community)
end

def get_collection_url(community, collection)
# URL example: https://era.library.ualberta.ca/communities/34de6895-e488-440b-b05c-75efe26c4971/collections/67e0ecb3-05b7-4c9a-bf82-31611e2dc0ce
Rails.application.routes.url_helpers.community_collection_url(community, collection)
end

def generate_reports
report_metadata_only_records
report_file_types
report_records_with_compressed_files
report_multifile_records
end

# Report 1: Metadata only records

def report_metadata_only_records
[Item, Thesis].each do |klass|
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We see this iteration for most reports, which is not ideal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this repetition feels smelly/inefficient... but I'm not sure we do anything about it at this point.

I was thinking about pulling out my Design Patterns book because maybe there is a pattern (Strategy, Composite?) that makes sense to solve this problem.

I was also thinking about seeing if reek had feedback that could improve or DRY this code. My second thought was that maybe we don't want to optimize for computational efficiencies and preference human readbility in this case. Here's what reek had to say if you're interested.

$ reek app/jobs/generate_reports_job.rb 
Inspecting 1 file(s):
S

app/jobs/generate_reports_job.rb -- 28 warnings:
  [46, 46]:DuplicateMethodCall: GenerateReportsJob#report_metadata_only_records calls 'klass.rdf_annotation_for_attr(key)' 2 times [https://github.com/troessner/reek/blob/v6.1.4/docs/Duplicate-Method-Call.md]
  [132, 134]:DuplicateMethodCall: GenerateReportsJob#report_multifile_records calls 'entity.files' 2 times [https://github.com/troessner/reek/blob/v6.1.4/docs/Duplicate-Method-Call.md]
  [126, 126]:DuplicateMethodCall: GenerateReportsJob#report_multifile_records calls 'klass.rdf_annotation_for_attr(key)' 2 times [https://github.com/troessner/reek/blob/v6.1.4/docs/Duplicate-Method-Call.md]
  [97, 97]:DuplicateMethodCall: GenerateReportsJob#report_records_with_compressed_files calls 'klass.rdf_annotation_for_attr(key)' 2 times [https://github.com/troessner/reek/blob/v6.1.4/docs/Duplicate-Method-Call.md]
  [68, 68, 69, 75]:FeatureEnvy: GenerateReportsJob#report_file_types refers to 'entity_file_types' more than self (maybe move it to another class?) [https://github.com/troessner/reek/blob/v6.1.4/docs/Feature-Envy.md]
  [43, 44, 46, 46, 50]:FeatureEnvy: GenerateReportsJob#report_metadata_only_records refers to 'klass' more than self (maybe move it to another class?) [https://github.com/troessner/reek/blob/v6.1.4/docs/Feature-Envy.md]
  [123, 124, 126, 126, 131]:FeatureEnvy: GenerateReportsJob#report_multifile_records refers to 'klass' more than self (maybe move it to another class?) [https://github.com/troessner/reek/blob/v6.1.4/docs/Feature-Envy.md]
  [94, 95, 97, 97, 103]:FeatureEnvy: GenerateReportsJob#report_records_with_compressed_files refers to 'klass' more than self (maybe move it to another class?) [https://github.com/troessner/reek/blob/v6.1.4/docs/Feature-Envy.md]
  [1]:InstanceVariableAssumption: GenerateReportsJob assumes too much for instance variable '@root_directory' [https://github.com/troessner/reek/blob/v6.1.4/docs/Instance-Variable-Assumption.md]
  [1]:InstanceVariableAssumption: GenerateReportsJob assumes too much for instance variable '@time_of_start' [https://github.com/troessner/reek/blob/v6.1.4/docs/Instance-Variable-Assumption.md]
  [1]:IrresponsibleModule: GenerateReportsJob has no descriptive comment [https://github.com/troessner/reek/blob/v6.1.4/docs/Irresponsible-Module.md]
  [75]:NestedIterators: GenerateReportsJob#report_file_types contains iterators nested 2 deep [https://github.com/troessner/reek/blob/v6.1.4/docs/Nested-Iterators.md]
  [66]:NestedIterators: GenerateReportsJob#report_file_types contains iterators nested 3 deep [https://github.com/troessner/reek/blob/v6.1.4/docs/Nested-Iterators.md]
  [45]:NestedIterators: GenerateReportsJob#report_metadata_only_records contains iterators nested 2 deep [https://github.com/troessner/reek/blob/v6.1.4/docs/Nested-Iterators.md]
  [50]:NestedIterators: GenerateReportsJob#report_metadata_only_records contains iterators nested 3 deep [https://github.com/troessner/reek/blob/v6.1.4/docs/Nested-Iterators.md]
  [125]:NestedIterators: GenerateReportsJob#report_multifile_records contains iterators nested 2 deep [https://github.com/troessner/reek/blob/v6.1.4/docs/Nested-Iterators.md]
  [134]:NestedIterators: GenerateReportsJob#report_multifile_records contains iterators nested 4 deep [https://github.com/troessner/reek/blob/v6.1.4/docs/Nested-Iterators.md]
  [96]:NestedIterators: GenerateReportsJob#report_records_with_compressed_files contains iterators nested 2 deep [https://github.com/troessner/reek/blob/v6.1.4/docs/Nested-Iterators.md]
  [106]:NestedIterators: GenerateReportsJob#report_records_with_compressed_files contains iterators nested 4 deep [https://github.com/troessner/reek/blob/v6.1.4/docs/Nested-Iterators.md]
  [46, 97, 126]:RepeatedConditional: GenerateReportsJob tests 'klass.rdf_annotation_for_attr(key).present?' at least 3 times [https://github.com/troessner/reek/blob/v6.1.4/docs/Repeated-Conditional.md]
  [59]:TooManyStatements: GenerateReportsJob#report_file_types has approx 11 statements [https://github.com/troessner/reek/blob/v6.1.4/docs/Too-Many-Statements.md]
  [41]:TooManyStatements: GenerateReportsJob#report_metadata_only_records has approx 10 statements [https://github.com/troessner/reek/blob/v6.1.4/docs/Too-Many-Statements.md]
  [121]:TooManyStatements: GenerateReportsJob#report_multifile_records has approx 13 statements [https://github.com/troessner/reek/blob/v6.1.4/docs/Too-Many-Statements.md]
  [82]:TooManyStatements: GenerateReportsJob#report_records_with_compressed_files has approx 15 statements [https://github.com/troessner/reek/blob/v6.1.4/docs/Too-Many-Statements.md]
  [5]:UnusedParameters: GenerateReportsJob#perform has unused parameter 'args' [https://github.com/troessner/reek/blob/v6.1.4/docs/Unused-Parameters.md]
  [27]:UtilityFunction: GenerateReportsJob#get_collection_url doesn't depend on instance state (maybe move it to another class?) [https://github.com/troessner/reek/blob/v6.1.4/docs/Utility-Function.md]
  [22]:UtilityFunction: GenerateReportsJob#get_community_url doesn't depend on instance state (maybe move it to another class?) [https://github.com/troessner/reek/blob/v6.1.4/docs/Utility-Function.md]
  [17]:UtilityFunction: GenerateReportsJob#get_entity_url doesn't depend on instance state (maybe move it to another class?) [https://github.com/troessner/reek/blob/v6.1.4/docs/Utility-Function.md]

entity_type = klass.name.underscore
entity_attributes = klass.first.attributes.keys
entity_headers = entity_attributes.map do |key|
klass.rdf_annotation_for_attr(key).present? ? RDF::URI(klass.rdf_annotation_for_attr(key).first.predicate).pname.to_s : key
end
file_name = "#{@root_directory}/#{entity_type}_with_metadata_only_#{@time_of_start}.csv"
CSV.open(file_name, 'wb', write_headers: true, headers: entity_headers + ['URL']) do |csv|
klass.find_each do |entity|
csv << (entity.values_at(entity_attributes) + [get_entity_url(entity)]) if entity.files.count == 0
end
end
end
end

# Report 2: List of file types

def report_file_types
entity_file_types = {}

file_name = "#{@root_directory}/entity_file_types_#{@time_of_start}.csv"

[Item, Thesis].each do |klass|
klass.find_each do |entity|
entity.files.each do |file|
content_type = file.content_type
entity_file_types[content_type] = 0 unless entity_file_types.include?(content_type)
entity_file_types[content_type] += 1
end
end
end

CSV.open(file_name, 'wb', write_headers: true, headers: ['File types', 'Count']) do |csv|
entity_file_types.each do |content_type, count|
csv << [content_type, count]
end
end
end

# Report 3: List of records containing compressed files
def report_records_with_compressed_files
compressed_file_types = [
'application/zip',
'application/x-7z-compressed',
'application/gzip',
'application/x-xz',
'application/x-rar-compressed;version=5',
'application/x-tar',
'application/x-rar'
]

[Item, Thesis].each do |klass|
entity_type = klass.name.underscore
entity_attributes = klass.first.attributes.keys
entity_headers = entity_attributes.map do |key|
klass.rdf_annotation_for_attr(key).present? ? RDF::URI(klass.rdf_annotation_for_attr(key).first.predicate).pname.to_s : key
end

file_name = "#{@root_directory}/#{entity_type}_with_compressed_file_#{@time_of_start}.csv"

CSV.open(file_name, 'wb', write_headers: true, headers: entity_headers + ['URL', 'Files metadata']) do |csv|
klass.find_each do |entity|
file_metadata = []

entity.files.each do |file|
content_type = file.content_type
file_metadata << file.blob.to_json if compressed_file_types.include?(content_type)
end

unless file_metadata.empty?
csv << (entity.values_at(entity_attributes) + [get_entity_url(entity),
file_metadata])
end
end
end
end
end

# Report 4: List of all multi file records
def report_multifile_records
[Item, Thesis].each do |klass|
entity_type = klass.name.underscore
entity_attributes = klass.first.attributes.keys
entity_headers = entity_attributes.map do |key|
klass.rdf_annotation_for_attr(key).present? ? RDF::URI(klass.rdf_annotation_for_attr(key).first.predicate).pname.to_s : key
end

file_name = "#{@root_directory}/#{entity_type}_with_multiple_files_#{@time_of_start}.csv"
CSV.open(file_name, 'wb', write_headers: true, headers: entity_headers + ['URL', 'Files metadata']) do |csv|
klass.includes(files_attachments: :blob).find_each do |entity|
if entity.files.count > 1
files_metadata = []
entity.files.each do |file|
files_metadata << file.blob.to_json
end
csv << entity.values_at(entity_attributes) + [get_entity_url(entity), files_metadata]
end
end
end
end
end

end
7 changes: 7 additions & 0 deletions test/jobs/generate_reports_job_test.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
require "test_helper"

class GenerateReportsJobTest < ActiveJob::TestCase
# test "the truth" do
# assert true
# end
end
Loading