Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New version #70

Merged
merged 45 commits into from
Jan 16, 2025
Merged
Changes from 1 commit
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
0b66a62
Match authors to openalex schema
alepbloyd Sep 26, 2024
8dc71b9
AuthorsIds spec
alepbloyd Sep 26, 2024
9d75972
Add instutions relationships
alepbloyd Sep 26, 2024
d9024ba
temporarily remove nodes and edges, add work relationships
alepbloyd Sep 26, 2024
f4a8e50
Add models to match openalex db
alepbloyd Sep 30, 2024
a72af23
Add db migrations to match openalex db
alepbloyd Sep 30, 2024
76f6e2e
Add model tests
alepbloyd Sep 30, 2024
142ceea
Add concepts_ancestors
alepbloyd Sep 30, 2024
dc5acec
Add load institutions, institutions_geos, institutions_ids rake tasks
alepbloyd Oct 2, 2024
014f429
change migrations to use bigint fields
alepbloyd Oct 2, 2024
5897962
update models for data import
alepbloyd Oct 3, 2024
b547584
start reworking graphql, institutions
alepbloyd Oct 4, 2024
5b08a64
start to rework model tests
alepbloyd Oct 4, 2024
83bf2d9
work on topics/works relationships
alepbloyd Oct 4, 2024
858368f
Add data_import rake tasks
alepbloyd Oct 16, 2024
a94386d
Add migrations
alepbloyd Oct 16, 2024
73dd524
Start of reworking tests
alepbloyd Oct 16, 2024
91fe7d9
update gemfile
alepbloyd Oct 16, 2024
96a37c4
Add addition gql types
alepbloyd Oct 16, 2024
bb571cc
temp remove user context for development
alepbloyd Oct 16, 2024
ad2f2f7
temp db yml and routes updates
alepbloyd Oct 16, 2024
83b4d8c
Add topics import
alepbloyd Oct 17, 2024
36d1c0c
Migrations and models for work-author-institution relationships
alepbloyd Oct 30, 2024
13e2f2c
Reworking migrations
alepbloyd Nov 11, 2024
b5d4614
Reworking loading rake tasks
alepbloyd Nov 11, 2024
23aa897
Updated migrations for full data snapshot
alepbloyd Dec 4, 2024
d07f8ba
Update data_import rake tasks
alepbloyd Dec 4, 2024
669edc9
Update models for full snapshot
alepbloyd Dec 4, 2024
33b98f5
Fix many-to-many self referrential join for works/citations/references
alepbloyd Dec 5, 2024
d021170
update author model with author_openalex_id
alepbloyd Dec 5, 2024
4dbbd5a
Remove _api from schema name
alepbloyd Dec 5, 2024
915a2e9
Start of reworking gql types
alepbloyd Dec 6, 2024
45f6a8b
update worktype
alepbloyd Dec 6, 2024
7eb78ba
new generated types, need to pare back
alepbloyd Dec 6, 2024
c8eb308
addition author and institution indexing
alepbloyd Dec 18, 2024
babe356
updated schema
alepbloyd Dec 18, 2024
a15e559
remove concepts models
alepbloyd Dec 19, 2024
8498d5c
Remove works_concept and works_mesh models
alepbloyd Dec 19, 2024
2459280
add topics
alepbloyd Dec 19, 2024
997f3cc
work_topics migration
alepbloyd Dec 19, 2024
de3d2b3
Fix all text -> bool castings resulting in false
alepbloyd Jan 2, 2025
1482675
rename to bookworm
alepbloyd Jan 2, 2025
4eeb083
Work on frontend simplification and connection with backend
alepbloyd Jan 16, 2025
0d8490f
misc other fixes for loading openalex snapshot version
alepbloyd Jan 16, 2025
6b952f2
Merge branch 'loading-openalex-snapshot' of github.com:gwu-libraries/…
alepbloyd Jan 16, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add data_import rake tasks
alepbloyd committed Oct 16, 2024
commit 858368ff20505e6611b5a52d97f8eea4ed6ae031
38 changes: 38 additions & 0 deletions rails/lib/tasks/load_authors.rake
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
require 'rake'
require 'zlib'
require 'csv'

namespace :data_import do
desc 'Load Authors from gzipped csv to db'
task load_authors: :environment do
file_paths = Dir['/opt/bookworm/csv-files/authors/author_split*']

file_paths.each do |file_path|
authors = []
Zlib::GzipReader.open(file_path) do |gzip|
csv = CSV.new(gzip)
csv
.drop(1)
.each_with_index do |row, index| # drop(1) handles the header row
authors << {
openalex_id: row[0].split('/').last,
orcid: row[1],
display_name: row[2],
display_name_alternatives: row[3],
works_count: row[4].to_i,
cited_by_count: row[5].to_i,
last_known_institution: row[6],
works_api_url: row[7]
}

if authors.count >= 100
Author.insert_all(authors)

authors = []
end
end
end
Author.insert_all(authors)
end
end
end
42 changes: 42 additions & 0 deletions rails/lib/tasks/load_authors_counts_by_years.rake
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
require 'rake'
require 'zlib'
require 'csv'

namespace :data_import do
desc 'Load AuthorsCountsByYear from gzipped csv to db'
task load_authors_counts_by_year: :environment do
file_paths =
Dir[
'/opt/bookworm/csv-files/authors_counts_by_year/authors_counts_by_year_split*'
]

file_paths.each do |file_path|
authors_counts_by_year = []
Zlib::GzipReader.open(file_path) do |gzip|
csv = CSV.new(gzip)

csv.each_with_index do |row, index| # drop(1) handles the header row
author = Author.find_by(openalex_id: row[0].split('/').last)

unless author.nil?
authors_counts_by_year << {
author_id: author.id,
year: row[1],
works_count: row[2],
cited_by_count: row[3],
oa_works_count: row[4]
}
end

if authors_counts_by_year.count >= 1000
AuthorsCountsByYear.insert_all(authors_counts_by_year)

authors_counts_by_year = []
end
end
end

AuthorsCountsByYear.insert_all(authors_counts_by_year)
end
end
end
42 changes: 42 additions & 0 deletions rails/lib/tasks/load_authors_ids.rake
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
require 'rake'
require 'zlib'
require 'csv'

namespace :data_import do
desc 'Load AuthorsIds from gzipped csv to db'
task load_authors_ids: :environment do
file_paths = Dir['/opt/bookworm/csv-files/authors_ids/authors_ids_split*']

file_paths.each do |file_path|
authors_ids = []
Zlib::GzipReader.open(file_path) do |gzip|
csv = CSV.new(gzip)
csv
.drop(1)
.each_with_index do |row, index| # drop(1) handles the header row
author = Author.find_by(openalex_id: row[1].split('/').last)

unless author.nil?
authors_ids << {
author_id: author.id,
openalex: row[1].split('/').last,
orcid: row[2],
scopus: row[3],
twitter: row[4],
wikipedia: row[5],
mag: row[6]
}
end

if authors_ids.count >= 100
AuthorsIds.insert_all(authors_ids)

authors_ids = []
end
end
end

AuthorsIds.insert_all(authors_ids)
end
end
end
48 changes: 27 additions & 21 deletions rails/lib/tasks/load_institutions.rake
Original file line number Diff line number Diff line change
@@ -1,30 +1,36 @@
require 'rake'
require "zlib"
require 'zlib'
require 'csv'

namespace :data_import do
desc "Load Institutions from gzipped csv to db"
task :load_institutions => :environment do
desc 'Load Institutions from gzipped csv to db'
task load_institutions: :environment do
institutions = []

Zlib::GzipReader.open('/opt/bookworm/csv-files/institutions.csv.gz') do |gzip|
Zlib::GzipReader.open(
'/opt/bookworm/csv-files/institutions.csv.gz'
) do |gzip|
csv = CSV.new(gzip)
csv.drop(1).each_with_index do |row, index| # drop(1) handles the header row
Institution.create(
openalex_id: row[0].split("/").last,
ror: row[1],
display_name: row[2],
country_code: row[3],
institution_type: row[4],
homepage_url: row[5],
image_url: row[6],
image_thumbnail_url: row[7],
display_name_acronyms: row[8],
display_name_alternatives: row[9],
works_count: row[10],
cited_by_count: row[11],
works_api_url: row[12]
)
end
csv
.drop(1)
.each_with_index do |row, index| # drop(1) handles the header row
institutions << {
openalex_id: row[0].split('/').last,
ror: row[1],
display_name: row[2],
country_code: row[3],
institution_type: row[4],
homepage_url: row[5],
image_url: row[6],
image_thumbnail_url: row[7],
display_name_acronyms: row[8],
display_name_alternatives: row[9],
works_count: row[10],
cited_by_count: row[11],
works_api_url: row[12]
}
end
end
Institution.insert_all(institutions)
end
end
45 changes: 45 additions & 0 deletions rails/lib/tasks/load_institutions_associated_institutions.rake
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
require 'rake'
require 'zlib'
require 'csv'

namespace :data_import do
desc 'Load InstitutionsAssociatedInstitutions from gzipped csv to db'
task load_institutions_associated_institutions: :environment do
institutions_associated_institutions = []

Zlib::GzipReader.open(
'/opt/bookworm/csv-files/institutions_associated_institutions.csv.gz'
) do |gzip|
csv = CSV.new(gzip)

csv
.drop(1)
.each_with_index do |row, index| # drop(1) handles the header row
institution = Institution.find_by(openalex_id: row[0].split('/').last)

associated_institution =
Institution.find_by(openalex_id: row[1].split('/').last)

unless institution.nil? || associated_institution.nil?
institutions_associated_institutions << {
institution_id: institution.id,
associated_institution_id: associated_institution.id,
relationship: row[2]
}
end

if institutions_associated_institutions.count >= 100
InstitutionsAssociatedInstitutions.insert_all(
institutions_associated_institutions
)

institutions_associated_institutions = []
end
end
end

InstitutionsAssociatedInstitutions.insert_all(
institutions_associated_institutions
)
end
end
38 changes: 38 additions & 0 deletions rails/lib/tasks/load_institutions_counts_by_year.rake
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
require 'rake'
require 'zlib'
require 'csv'

namespace :data_import do
desc 'Load InstitutionsCountsByYear from gzipped csv to db'
task load_institutions_counts_by_year: :environment do
institutions_counts_by_years = []

Zlib::GzipReader.open(
'/opt/bookworm/csv-files/institutions_counts_by_year.csv.gz'
) do |gzip|
csv = CSV.new(gzip)

csv
.drop(1)
.each_with_index do |row, index| # drop(1) handles the header row
institution = Institution.find_by(openalex_id: row[0].split('/').last)

institutions_counts_by_years << {
institution_id: institution.id,
year: row[1],
works_count: row[2],
cited_by_count: row[3],
oa_works_count: row[4]
}

if institutions_counts_by_years.count >= 100
InstitutionsCountsByYear.insert_all(institutions_counts_by_years)

institutions_counts_by_years = []
end
end
end

InstitutionsCountsByYear.insert_all(institutions_counts_by_years)
end
end
47 changes: 30 additions & 17 deletions rails/lib/tasks/load_institutions_geo.rake
Original file line number Diff line number Diff line change
@@ -1,27 +1,40 @@
require 'rake'
require "zlib"
require 'zlib'
require 'csv'

namespace :data_import do
desc "Load InstitutionsGeo from gzipped csv to db"
task :load_institutions_geo => :environment do
desc 'Load InstitutionsGeo from gzipped csv to db'
task load_institutions_geo: :environment do
institutions_geos = []

Zlib::GzipReader.open('/opt/bookworm/csv-files/institutions_geo.csv.gz') do |gzip|
Zlib::GzipReader.open(
'/opt/bookworm/csv-files/institutions_geo.csv.gz'
) do |gzip|
csv = CSV.new(gzip)
csv.drop(1).each_with_index do |row, index| # drop(1) handles the header row
institution = Institution.find_by(openalex_id: row[0].split("/").last)
csv
.drop(1)
.each_with_index do |row, index| # drop(1) handles the header row
institution = Institution.find_by(openalex_id: row[0].split('/').last)

InstitutionsGeo.create(
institution_id: institution.id,
city: row[1],
geonames_city_id: row[2],
region: row[3],
country_code: row[4],
country: row[5],
latitude: row[6],
longitude: row[7]
)
end
institutions_geos << {
institution_id: institution.id,
city: row[1],
geonames_city_id: row[2],
region: row[3],
country_code: row[4],
country: row[5],
latitude: row[6],
longitude: row[7]
}

if institutions_geos.count >= 100
InstitutionsGeo.insert_all(institutions_geos)

institutions_geos = []
end
end
end

InstitutionsGeo.insert_all(institutions_geos)
end
end
39 changes: 23 additions & 16 deletions rails/lib/tasks/load_institutions_ids.rake
Original file line number Diff line number Diff line change
@@ -1,26 +1,33 @@
require 'rake'
require "zlib"
require 'zlib'
require 'csv'

namespace :data_import do
desc "Load InstitutionsIds from gzipped csv to db"
task :load_institutions_ids => :environment do
desc 'Load InstitutionsIds from gzipped csv to db'
task load_institutions_ids: :environment do
institution_ids = []

Zlib::GzipReader.open('/opt/bookworm/csv-files/institutions_ids.csv.gz') do |gzip|
Zlib::GzipReader.open(
'/opt/bookworm/csv-files/institutions_ids.csv.gz'
) do |gzip|
csv = CSV.new(gzip)
csv.drop(1).each_with_index do |row, index| # drop(1) handles the header row
institution = Institution.find_by(openalex_id: row[0].split("/").last)
csv
.drop(1)
.each_with_index do |row, index| # drop(1) handles the header row
institution = Institution.find_by(openalex_id: row[0].split('/').last)

InstitutionsIds.create(
institution_id: institution.id,
openalex: row[0].split("/").last,
ror: row[2],
grid: row[3],
wikipedia: row[4],
wikidata: row[5],
mag: row[6]
)
end
institution_ids << {
institution_id: institution.id,
openalex: row[0].split('/').last,
ror: row[2],
grid: row[3],
wikipedia: row[4],
wikidata: row[5],
mag: row[6]
}
end
end

InstitutionsIds.insert_all(institution_ids)
end
end
Loading