-
Notifications
You must be signed in to change notification settings - Fork 3
Fix and improve handling database data in Korp corpus packaging #23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
scripts/korp-make-corpus-package.sh: - Use chmod a-w instead of chmod 444 to remove write permissions, so that it is possible for the package file to have no world read permission.
scripts/korp-make-corpus-package.sh: - Remove as obsolete the support for including the Korp frontend configuration files to the corpus package. Including the frontend configuration never worked very well, and once the corpus configurations are moved to the backend, it will be completely obsolete.
scripts/korp-make-corpus-package.sh: - Use mkdir_perms instead of "mkdir -p" to set the permissions of created directories.
scripts/korp-make-corpus-package.sh: - Move much of the top-level code to functions in an effort to try to structure the code slightly better.
scripts/korp-make-corpus-package.sh: - Check the value of --newer already when checking options, so that a possible error in the argument is caught early, instead of just before creating the package. Also move functions get_newer_date and check_newer to before check_options.
scripts/korp-make-corpus-package.sh: - Remove database file type "timespans" as long obsolete.
scripts/korp-make-corpus-package.sh:
- Fix functions listing database files:
- List explicitly all .sql and .tsv files with the corpus id prefix in
the database file directories, regardless of the rest of the
filename. (Previously, they also did that even if they probably
intended to list only those known to be Korp database file names.)
- Fix to pass only the required arguments.
- Add local variable declarations.
- Add rudimentary function comments.
- Remove FIXME comments.
scripts/korp-make-corpus-package.sh: - Rename variable names $corp_id to $corpus_id for consistency. (Previously, some functions used one, others the other.)
scripts/korp-mysql-export.sh: - Remove support for exporting obsolete tables: timespans, corpus_info, names_CORPUS, names_CORPUS_sentences, names_CORPUS_strings.
scripts/korp-mysql-export.sh: - Fix to really ignore "Table doesn't exist" errors (1146) without outputting them. (Previously, the script purported to do so but it did not work correctly.)
scripts/shlib/file.sh: - Define global variables related to compression programs, initialized in _init_compress_info: compress_progs (compression programs), compress_exts (compressed file extensions), compress_prog_$ext (compression program for extension $ext) and compress_ext_$prog (filename extension for compression program $prog). Initialize to support gzip, bzip2, xz, lzip, lzma, lzop, zstd. - Add functions get_compress, get_compress_ext.
scripts/shlib/file.sh:
- Function comprcat:
- Support more compression programs based on the initialized
information on compression programs and filename extensions.
- Declare and rename local variables.
- Add function comment.
scripts/shlib/file.sh: - Function test_compr_file: Test for the compressed filename extensions in $compress_exts.
scripts/korp-mysql-export.sh: - Use get_compress and get_compress_ext in shlib/file.sh to set the compression program and compressed filename extension, to support more compression programs.
scripts/korp-make-corpus-package.sh: - Use get_compress and get_compress_ext in shlib/file.sh to set the compression program and compressed filename extension, to support more compression programs (and only ones that are actually available).
scripts/korp-make-corpus-package.sh: - Avoid tar warning on an empty member name by unquoting $tar_newer_opt on the tar command line.
scripts/korp-make-corpus-package.sh: - Fix to create the directory for the SQL data if needed. - Function dump_database: Add local variable declarations and function comment.
scripts/korp-make-corpus-package.sh: - Overwrite possibly existing compressed SQL files with -f, to avoid the compression program prompting the user. - compress_or_rm_sqlfile: Add local declaration and function comment.
scripts/korp-make-corpus-package.sh:
- Change the semantics and arguments of database options:
- --database-format: Remove database format "auto"; default to "tsv".
(Previously, defaulted to "auto": SQL or TSV, whichever had more
recent files.)
- --export-database: Take an argument: one of "yes" or "always"
(always export database data), "no" or "never" (never export), or
"auto" (export if and only if no database data already exists).
(Previously, specifying the option always exported database data.)
- --export-database: When database format is "sql", dump database as
SQL. (Previously, --export-database only exported as TSV.)
- With --database-format=none, warn about omitting database data.
- Remove function list_existing_db_files as no longer needed.
scripts/shlib/mysql.sh: - Find the mysqldump binary like mysql and set $mysqldump_bin. Check the environment variable $KORP_MYSQLDUMP_BIN first. - Set $mysqldump_error if mysqldump binary not found. - run_mysql: Use $mysqldump_bin. Warn and return 1 if $mysqldump_error is non-empty. Declare local variable.
scripts/shlib/mysql.sh: - run_mysqldump: Dump data from the authorization database if --auth is specified or if the name of the first table begins with "auth_".
scripts/korp-make-corpus-package.sh: - Also dump authorization data (tables auth_license and auth_lbr_map) as SQL, similarly to TSV. - Rename sql_table_name_* as sql_table_names_* and support multiple tables for a file type (to have the auth tables in the same file). - make_sql_table_part: - Support dumping multiple tables to a single file. - Add local variable declarations. - Add function comment.
|
I think I’ve now made the essential fixes and improvements, so I converted this PR to non-draft. I’d still like to make |
scripts/korp-make-corpus-package.sh: - Fix tar --exclude patterns to work correctly. (Previously, if a pattern matched files in the current directory, those files were listed to be excluded.) - make_tar_excludes: Simplify code by using printf with a repeating pattern.
scripts/korp-make-corpus-package.sh: - add_files: Fix to call function update_info, to which updating .info files was moved in commit 2e35f37.
scripts/shlib/kielipankki.sh: - Add functions get_licence_type and get_lbr_id (and the their actual implementation function _get_auth_info) to get the licence type and LBR id of a corpus from the Korp auth MySQL database or TSV files exported from it.
scripts/korp-make-corpus-package.sh: - Add corpus auth info also to the .info file of the corpus, based on command-line options or the data in the Korp auth MySQL database. The possibly added pieces of information are "License: Licence type", "LbrId: LBR id" and "Protected: true" (for non-PUB corpora). The items are added only if available. This aims to be forward-compatible with the forthcoming new Korp version.
scripts/shlib/kielipankki.sh:
- _get_auth_info, get_licence_type, get_lbr_id:
- Try to obtain the auth information from an SQL file if no TSV file
is found and before trying to access the actual database.
- Add option --sql-dir to specify an explicit directory for SQL files.
- For the default TSV directory, first try .../vrt/$corpus_id, and if
it does not exist of if no auth tables are found there, try
.../sql/$corpus_id. (Previously, only the existence of the directory
was checked, not that of the file.)
- Use function newest_file (defined in file.sh).
scripts/shlib/kielipankki.sh, scripts/shlib/site.sh: - Remove variables korp_frontend_dir and default_korp_frontend_dirs as obsolete. (The Korp frontend directory location was needed only for including the frontend configuration to Korp corpus packages, but that has not been working as originally intended for a long time, and support for it was removed from korp-make-corpus-package.sh.)
scripts/korp-make: - Remove as obsolete the option --korp-frontend-dir for specifying the location of the Korp frontend configuration files. (The corresponding option was already removed from korp-make-corpus-package.sh.)
scripts/korp-install-corpora.sh: - Recognize SQL files matching *_auth.* in addition to *_auth_*, as korp-make-corpus-package.sh now dumps the auth tables to a single SQL file CORPUS_auth.sql.
scripts/shlib/file.sh: - Recognize .bz as bzip2 extension, in addition to .bz2. - To make that work, add support in _init_compress_info for specifying multiple extensions for a compression program.
scripts/shlib/file.sh: - Add function get_tar_compress_opt: Output the tar compression option for a tar file name. (This is to replace a function in korp-install-corpora.sh with the same name but with a more limited support for compression programs.)
scripts/korp-install-corpora.sh: - Support all the compression programs supported by shlib/file.sh for decompressing the corpus package, instead of only gzip, bzip2 and xz. - Remove function get_tar_compress_opt and use the function with the same name defined in shlib/file.sh.
scripts/korp-make-corpus-package.sh: - Support including database files compressed by any of the compression programs supported by shlib/file.sh, instead of only gzip, bzip2 and xz.
|
@Traubert, I’ve added a few commits that aren’t directly related to handling database data, but I included them here anyway, as I did at the same time. The changes include:
I’m wondering if this would be it for this PR and continue in another PR later if needed. |
Fix and improve
korp-make-corpus-package.sh:Fix
korp-mysql-export.sh:Some other improvements may also be included.