Skip to content

Conversation

@janiemi
Copy link
Collaborator

@janiemi janiemi commented Nov 25, 2025

Fix and improve korp-make-corpus-package.sh:

  1. Do not include Korp frontend configuration (obsolete feature).
  2. Default to exporting database files as TSV.
  3. Fix functions listing database files.
  4. Fix database export to ensure that the target directory exists.
  5. Improve code.

Fix korp-mysql-export.sh:

  1. Remove support for obsolete tables.
  2. Fix to hide and ignore MySQL “Table doesn’t exist” errors.

Some other improvements may also be included.

scripts/korp-make-corpus-package.sh:
- Use chmod a-w instead of chmod 444 to remove write permissions, so
  that it is possible for the package file to have no world read
  permission.
scripts/korp-make-corpus-package.sh:
- Remove as obsolete the support for including the Korp frontend
  configuration files to the corpus package. Including the frontend
  configuration never worked very well, and once the corpus
  configurations are moved to the backend, it will be completely
  obsolete.
scripts/korp-make-corpus-package.sh:
- Use mkdir_perms instead of "mkdir -p" to set the permissions of
  created directories.
scripts/korp-make-corpus-package.sh:
- Move much of the top-level code to functions in an effort to try to
  structure the code slightly better.
scripts/korp-make-corpus-package.sh:
- Check the value of --newer already when checking options, so that a
  possible error in the argument is caught early, instead of just before
  creating the package. Also move functions get_newer_date and
  check_newer to before check_options.
scripts/korp-make-corpus-package.sh:
- Remove database file type "timespans" as long obsolete.
scripts/korp-make-corpus-package.sh:
- Fix functions listing database files:
  - List explicitly all .sql and .tsv files with the corpus id prefix in
    the database file directories, regardless of the rest of the
    filename. (Previously, they also did that even if they probably
    intended to list only those known to be Korp database file names.)
  - Fix to pass only the required arguments.
  - Add local variable declarations.
  - Add rudimentary function comments.
  - Remove FIXME comments.
scripts/korp-make-corpus-package.sh:
- Rename variable names $corp_id to $corpus_id for consistency.
  (Previously, some functions used one, others the other.)
scripts/korp-mysql-export.sh:
- Remove support for exporting obsolete tables: timespans, corpus_info,
  names_CORPUS, names_CORPUS_sentences, names_CORPUS_strings.
scripts/korp-mysql-export.sh:
- Fix to really ignore "Table doesn't exist" errors (1146) without
  outputting them. (Previously, the script purported to do so but it did
  not work correctly.)
scripts/shlib/file.sh:
- Define global variables related to compression programs, initialized
  in _init_compress_info: compress_progs (compression programs),
  compress_exts (compressed file extensions), compress_prog_$ext
  (compression program for extension $ext) and compress_ext_$prog
  (filename extension for compression program $prog). Initialize to
  support gzip, bzip2, xz, lzip, lzma, lzop, zstd.
- Add functions get_compress, get_compress_ext.
scripts/shlib/file.sh:
- Function comprcat:
  - Support more compression programs based on the initialized
    information on compression programs and filename extensions.
  - Declare and rename local variables.
  - Add function comment.
scripts/shlib/file.sh:
- Function test_compr_file: Test for the compressed filename extensions
  in $compress_exts.
scripts/korp-mysql-export.sh:
- Use get_compress and get_compress_ext in shlib/file.sh to set the
  compression program and compressed filename extension, to support more
  compression programs.
scripts/korp-make-corpus-package.sh:
- Use get_compress and get_compress_ext in shlib/file.sh to set the
  compression program and compressed filename extension, to support more
  compression programs (and only ones that are actually available).
scripts/korp-make-corpus-package.sh:
- Avoid tar warning on an empty member name by unquoting $tar_newer_opt
  on the tar command line.
scripts/korp-make-corpus-package.sh:
- Fix to create the directory for the SQL data if needed.
- Function dump_database: Add local variable declarations and function
  comment.
scripts/korp-make-corpus-package.sh:
- Overwrite possibly existing compressed SQL files with -f, to avoid the
  compression program prompting the user.
- compress_or_rm_sqlfile: Add local declaration and function comment.
scripts/korp-make-corpus-package.sh:
- Change the semantics and arguments of database options:
  - --database-format: Remove database format "auto"; default to "tsv".
    (Previously, defaulted to "auto": SQL or TSV, whichever had more
    recent files.)
  - --export-database: Take an argument: one of "yes" or "always"
    (always export database data), "no" or "never" (never export), or
    "auto" (export if and only if no database data already exists).
    (Previously, specifying the option always exported database data.)
  - --export-database: When database format is "sql", dump database as
    SQL. (Previously, --export-database only exported as TSV.)
- With --database-format=none, warn about omitting database data.
- Remove function list_existing_db_files as no longer needed.
scripts/shlib/mysql.sh:
- Find the mysqldump binary like mysql and set $mysqldump_bin. Check
  the environment variable $KORP_MYSQLDUMP_BIN first.
- Set $mysqldump_error if mysqldump binary not found.
- run_mysql: Use $mysqldump_bin. Warn and return 1 if $mysqldump_error
  is non-empty. Declare local variable.
scripts/shlib/mysql.sh:
- run_mysqldump: Dump data from the authorization database if --auth is
  specified or if the name of the first table begins with "auth_".
scripts/korp-make-corpus-package.sh:
- Also dump authorization data (tables auth_license and auth_lbr_map) as
  SQL, similarly to TSV.
- Rename sql_table_name_* as sql_table_names_* and support multiple
  tables for a file type (to have the auth tables in the same file).
- make_sql_table_part:
  - Support dumping multiple tables to a single file.
  - Add local variable declarations.
  - Add function comment.
@janiemi janiemi marked this pull request as ready for review November 27, 2025 13:42
@janiemi
Copy link
Collaborator Author

janiemi commented Nov 27, 2025

I think I’ve now made the essential fixes and improvements, so I converted this PR to non-draft. I’d still like to make korp-make-corpus-package.sh export database files automatically if the existing files are older than (the relevant) CWB data files, but that’s probably not absolutely necessary. (And the script still has the .sh extension.)

@janiemi janiemi requested a review from Traubert November 27, 2025 13:49
scripts/korp-make-corpus-package.sh:
- Fix tar --exclude patterns to work correctly. (Previously, if a
  pattern matched files in the current directory, those files were
  listed to be excluded.)
- make_tar_excludes: Simplify code by using printf with a repeating
  pattern.
scripts/korp-make-corpus-package.sh:
- add_files: Fix to call function update_info, to which updating .info
  files was moved in commit 2e35f37.
scripts/shlib/kielipankki.sh:
- Add functions get_licence_type and get_lbr_id (and the their actual
  implementation function _get_auth_info) to get the licence type and
  LBR id of a corpus from the Korp auth MySQL database or TSV files
  exported from it.
scripts/korp-make-corpus-package.sh:
- Add corpus auth info also to the .info file of the corpus, based on
  command-line options or the data in the Korp auth MySQL database. The
  possibly added pieces of information are "License: Licence type",
  "LbrId: LBR id" and "Protected: true" (for non-PUB corpora). The items
  are added only if available. This aims to be forward-compatible with
  the forthcoming new Korp version.
scripts/shlib/kielipankki.sh:
- _get_auth_info, get_licence_type, get_lbr_id:
  - Try to obtain the auth information from an SQL file if no TSV file
    is found and before trying to access the actual database.
  - Add option --sql-dir to specify an explicit directory for SQL files.
  - For the default TSV directory, first try .../vrt/$corpus_id, and if
    it does not exist of if no auth tables are found there, try
    .../sql/$corpus_id. (Previously, only the existence of the directory
    was checked, not that of the file.)
  - Use function newest_file (defined in file.sh).
scripts/shlib/kielipankki.sh,
scripts/shlib/site.sh:
- Remove variables korp_frontend_dir and default_korp_frontend_dirs as
  obsolete. (The Korp frontend directory location was needed only for
  including the frontend configuration to Korp corpus packages, but that
  has not been working as originally intended for a long time, and
  support for it was removed from korp-make-corpus-package.sh.)
scripts/korp-make:
- Remove as obsolete the option --korp-frontend-dir for specifying the
  location of the Korp frontend configuration files. (The corresponding
  option was already removed from korp-make-corpus-package.sh.)
scripts/korp-install-corpora.sh:
- Recognize SQL files matching *_auth.* in addition to *_auth_*, as
  korp-make-corpus-package.sh now dumps the auth tables to a single SQL
  file CORPUS_auth.sql.
scripts/shlib/file.sh:
- Recognize .bz as bzip2 extension, in addition to .bz2.
- To make that work, add support in _init_compress_info for specifying
  multiple extensions for a compression program.
scripts/shlib/file.sh:
- Add function get_tar_compress_opt: Output the tar compression option
  for a tar file name. (This is to replace a function in
  korp-install-corpora.sh with the same name but with a more limited
  support for compression programs.)
scripts/korp-install-corpora.sh:
- Support all the compression programs supported by shlib/file.sh for
  decompressing the corpus package, instead of only gzip, bzip2 and xz.
- Remove function get_tar_compress_opt and use the function with the
  same name defined in shlib/file.sh.
scripts/korp-make-corpus-package.sh:
- Support including database files compressed by any of the compression
  programs supported by shlib/file.sh, instead of only gzip, bzip2 and
  xz.
@janiemi
Copy link
Collaborator Author

janiemi commented Dec 8, 2025

@Traubert, I’ve added a few commits that aren’t directly related to handling database data, but I included them here anyway, as I did at the same time. The changes include:

  • Fix handling ‘tar‘ exclude patterns.
  • Add to the .info files of CQP data auth information of the form:
    License: RES
    Protected: true
    LbrId: urn:nbn:fi:lb-YYYYMMNNN@LBR
    
  • Support getting auth info from an SQL file.
  • Some additions to the shlib library, including support for more compression programs.

I’m wondering if this would be it for this PR and continue in another PR later if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants