See the releases for Scribe-Data for an up to date list of versions and their release dates.
Scribe-Data tries to follow semantic versioning, a MAJOR.MINOR.PATCH version where increments are made of the:
- MAJOR version when we make incompatible API changes
- MINOR version when we add functionality in a backwards compatible manner
- PATCH version when we make backwards compatible bug fixes
Emojis for the following are chosen based on gitmoji.
- The SPARQL queries for the Scribe-Data CLI are generated by a process that checks the available data via the Wikidata Query Service (#617).
- Scribe-Data can now be used to generate nine emojis per word instead of just three (#670).
- The handling of missing language directories in the SQLite conversion process has been dramatically improved to communicate to the user which languages are missing and also alert them that no SQLite databases will be created if no data is available for any of the desired languages.
- Testing for various parts of the CLI was expanded (#623).
- Local pre-commit hooks are now ran with prek instead of
pre-commit.
- Dependencies were updated given dependabot warnings.
- All dependencies for the package were updated to the highest feasible level.
- Dependency management was switched over to using uv.
- Allow the convert parser to accept multiple data types (#634).
- Fixed data conversion not handling multiple explicitly passed languages and data types (#632).
- Fixed data conversion not handling multiple explicitly passed languages (#630).
- The path to the contracts was fixed in data filtration to assure that it's a
pathlib.Pathvalue (#627).
- The upgrade functionality of the CLI is now comprehensively tested (#624).
- The upgrade message instructs the user to use the built in upgrade functionality.
- The upgrade command now upgrades the package via pip rather than bringing down GitHub files and installing them directly.
- The requirement files have been updated to fix package install errors (#621).
- Update minimum Python version to 3.11.
- Scribe-Data now has the ability to download the most recent or a specific Wikidata lexemes dump (#517).
- Wikidata SPARQL queries are now autogenerated and maintained via Wikidata dumps (#513).
- Forms are separated into files based on their identifiers while ignoring maintainer set queries (#575).
- Queries have been expanded for all languages and data forms based on the Wikidata dump process.
- The date of last modification for Wikidata lexemes has been added to query and dump parsing outputs (#562).
- Interactive mode now functions throughout the CLI functionality where the user is presented with options for data extraction.
- The is now a top level interactive mode command for accessing all Scribe-Data functionality (#523).
- Repeat forms are combined with vertical bars (
"|") as a separator (#544, #573). - A workflow has been created to update the emoji data on a regular basis (#542).
- Resulting data can be filtered based on data contracts (#581).
- Contracts can be checked against data to assure that they're valid given the data's field names (#561).
- The Wikipedia based autosuggestion functionality is now CLI based instead of using a Jupyter notebook (#206).
- SPDX license identifiers have been added for all files (#553).
- The version command was fixed to account for cases where the version has a
vbefore it (#534). - The functionality to check for current data and prompt its deletion was centralized and messages to the user were made more clear (#336).
- If Wikidata queries can't be completed, Scribe-Data now includes dramatically better error messages and directs the user to leverage commands that use Wikidata dumps (#549).
- General bug fixes for a more fluid developer experience.
- Tests have been written for all new functionalities (#570).
- CI testing now includes a coverage check that breaks if coverage falls below a given percentage.
- Documentation has been expanded for all functionalities of the CLI.
- All numpydoc docstrings have been fixed and unneeded code has been removed (#547).
- Queries for noun genders and other properties that require the Wikidata label service now return their English label rather than auto label that was returning just the Wikidata QID.
- SPARQL queries for English and Portuguese prepositions were added to allow the CLI to query these types of data.
- The convert functionality once again works for lists of languages all data types for them.
- SQLite conversion was fixed for all queries (#527).
- The data conversion process outputs were improved including capitalizing language names and repeat notices to the user were removed.
- The CLI's
getcommand now returns all data types if none is passed. - The Portuguese verbs query was fixed as it wasn't formatted correctly.
- The emoji keyword functionality was fixed given the new lexeme ID based form of the data.
- Arguments were fixed that were breaking the functionality.
- Languages for the user were capitalized.
casehas been renamedgrammaticalCasein preposition queries to assure that SQLite reserved keywords are not used.
- Queries for countless data types for countless languages were expanded and added ❤️
- Scribe-Data is now a fully functional CLI.
- Querying Wikidata lexicographical data can be done via the
getcommand (#159). - The output type of queries can be in JSON, CSV, TSV and SQLite, with converting output types also being possible (#145, #146)
- Output paths can be set for query results (#144).
- The version of the CLI can be printed to the command line and the CLI can further be used to upgrade itself (#186, #157 ).
- Total Wikidata lexemes for languages and data types can be derived with the
totalcommand (#147). - Interactive and total commands can be used via an interactive mode with the
--interactiveargument (#158, #203). - Outputs were standardized to assure that the CLI experience is consistent
- Querying Wikidata lexicographical data can be done via the
- The machine translation process has been removed to make way for the Wiktionary based implementation (#292).
- Package metadata files were standardized for languages, data types and Wikidata lexeme forms.
- CLI commands have an argument check that can suggest correct languages and data types (#341).
- Wikidata query process stages no longer trigger the tqdm progress bar when they're unsuccessful (#155).
- Tests have been written for the CLI to assure that it's functionality remains consistent.
- Workflows were created to assure that the Wikidata queries and project structure are consistent to assure package functionality (#339, #357)
- Project queries and its structure have been updated to match the rules developed for the checks.
- The CLI's functionality has been fully documented (#152, #208).
- Documentation was created to show how to write Scribe-Data queries (#395).
word_typehas been switched todata_typethroughout the codebase (#160).- Case, gender and annotation utility functions were removed as the formatting process that used them has changed.
- The SPARQLWrapper access method has been extracted to the Wikidata utils and is imported into the files that need it (#164).
- Export data paths have been converted to centrally saved variables to reduce hard coded string repetition.
- Many files were renamed including
update_data.pybeing renamedquery_data.py - Paths within the package have been updated to work for all operating systems via
pathlib(#125). - The language formatting scripts have been dramatically simplified given changes to export paths all being the same.
- The
update_filesdirectory was removed in preparation of other means of showing data totals. - The
language_data_extractiondirectory was moved under the Wikidata directory as it's only used for those processes now (#446). - The emoji keyword process was centralized to simplify project maintenance (#359).
- PyICU was removed as a dependency and a process was made to install it and its needed dependencies given the operating system of the user (#196).
- The data formatting step was centralized such that we only have one for all languages (#142).
- Sub-query processes are now no longer hard coded such that we'd need to maintain the total possible sub-queries within the
query_data.pyprocess.
- The translation process has been updated to allow for translations from non-English languages (#72, #73, #74, #75, #75, #76, #77, #78, #79).
- Annotation bugs were removed like repeat or empty values.
- Perfect tenses of Portuguese verbs were fixed via finding the appropriate PID (#68).
- Note that the most common past perfect property is not the standard one, so this will need to be fixed.
- The documentation has been given a new layout with the logo in the top left (#90).
- The documentation now has links to the code at the top of each page (#91).
- pre-commit have been added to the repo to improve the development experience (#137).
- Code formatting was shifted from black to Ruff.
- A Ruff based GitHub workflow was added to check the code formatting and lint the codebase on each pull request (#109).
- The
_update_filesdirectory was renamedupdate_filesas these files are used in non-internal manners now (#57). - A common function has been created to map Wikidata ids to noun genders (#69).
- The project now is installed locally for development and command line usage, so usages of
sys.pathhave been removed from files (#122). - The directory structure has been dramatically streamlined and includes folders for future projects where language data could come from other sources like Wiktionary (#139).
- Translation files are moved to their own directory.
- The
extract_transformdirectory has been removed and all files within it have been moved one level up. - The
languagesdirectory has been renamedlanguage_data_extraction. - All files within
wikidata/_resourceshave been moved to theresourcesdirectory. - The gender and case annotations for data formatting have now been commonly defined.
- All language directory
formatted_datafiles have been now moved to thescribe_data_json_exportdirectory to prepare for outputs being required to be directed to a directory outside of the package. - Path computing has been refactored throughout the codebase, and unneeded functions for data transfers have been removed.
- Minor fixes to documentation index and file docstrings to fix errors.
- Revert change to package path definition to hopefully register the resources directory.
- The docs and tests were grafted into the package using
MANIFEST.in. - Minor fixes to file and function docstrings and documentation files.
include_package_data=Trueis used insetup.pyto hopefully include all files in the package distribution.
- The data and process needed for an English keyboard has been added (#39).
- The Wikidata queries for English have been updated to get all nouns and verbs.
- Formatting scripts have been written to prepare the queried data and load it into an SQLite database.
- The data update process has been cleaned up in preparation for future changes to Scribe-Data and to implement better practices.
- Language data was extracted into a JSON file for more succinct referencing (#52).
- Language codes are now checked with the package langcodes for easier expansion.
- A process has been created to check and update words that can be translated for each Scribe language (#44).
- The baseline data returned from Wikidata queries is now removed once a formatted data file is created.
- Tensorflow was removed from the download wiki process to fix build problems on Macs.
- A full testing suite has been added to run on GitHub Actions (#37).
- Unit tests have been added for Wikidata queries (#48) and utility functions (#50).
- The Anaconda based virtual environment was removed and documentation was updated to reflect this.
- Language data processes were moved into the
src/scribe_data/extract_transform/languagesdirectory to clean up the structure. - Code formatting processes were defined with common structures based on language and word type variables defined at the top of files.
- The word "Scribe" is now added to language database nouns files if it's not already present (#35).
- German contracted prepositions have been added to the German prepositions formatting process (#34).
- Words that are upper case are now better included in the autocomplete lexicon with their lower case equivalents being removed.
- Words with apostrophes have been removed from the autocomplete lexicon.
- Database output column names are now zero indexed to better align with Python and other language standards.
- Scribe-Data now has the ability to generate SQLite databases from formatted language data.
data_to_sqlite.pyis used to read available JSON files and input their information into the databases.
- These databases are now sent to Scribe apps via defined paths.
send_dbs_to_scribe.pyfinds all available language databases and copies them.- Separating this step from the data update is in preparation for data import in the future where this will be an individual step.
- Scribe-Data now also creates autocomplete lexicons for each language within
data_to_sqlite.py. - JSON data is no longer able to be uploaded to Scribe app directories directly, with the SQLite directories now being exported instead.
- Emojis of singular nouns are now also linked to their plural counterparts if the plural isn't present in the emoji keyword outputs.
- The emoji process also now updates a column to the
data_table.txtfile for sharing on readmes withupdate_data.pymaintaining it in the data update process.
- The statements in translation files have been fixed as they were improperly defined after a file was moved.
- The Jupyter notebooks for autosuggestions and emojis as well as
update_data.pywere moved to theextract_transformdirectory given that they're not used to load data anymore.- Their code was refactored to reflect their new locations.
- Massive amounts of refactoring happened to achieve the shift in the data export method:
format_WORD_TYPE.pyfiles export to aformatted_datadirectory withinextract_transform.- Copies of all data JSONs that were originally in Scribe apps are now in the
formatted_datadirectories. - Functions in
update_utils.pywere switched given that data is no longer uploaded into aDatadirectory within the language keyboard directories within Scribe apps. - Lots of functions and variables were renamed to make them more understandable.
- Code to derive appropriate export locations within
format_WORD_TYPE.pyfiles was removed in favor of a languageformatted_datadirectory. - regex was added as a dependency.
- pylint comments were removed.
- Verb SPARQL query scripts for Spanish and Italian were simplified to remove unneeded repeat conditions (#7).
- An option to remove the
is_baseandranksub keys was added.
- The export filenames for emoji keywords were renamed to reflect their usage in autosuggestions and soon autocompletions as well.
- The number of suggested emojis for words can now be limited.
- The total number of emojis that suggestions can be made for can now be limited.
- Scribe-Data now allows the user to create JSONs of word-emoji key-value pairs (#24).
- Scribe-Data can now split Wikidata queries into multiple stages to break up those that were too large to run (#21).
- Scribe-Data now has the ability to download Wikipedia dumps of any language (#15).
- Functions have been added to parse and clean the above dumps (#15).
- Autosuggestions are generated from the cleaned texts by deriving most common words and those words that most commonly follow them (#15).
- A query for profane words has been added and integrated into the autosuggest flow to make sure that inappropriate words are not included (#16).
- The adjectives column has been removed from Scribe data tables until support is offered.
- The error messages for incorrect args in update_data.py have been updated.
- update_data.py now functions using SPARQLWrapper instead of wikidataintegrator.
- The data update process has been fixed to work for all queries.
- Hard coded strings for Spanish formatting files were fixed.
- The paths of update_data.py were changed to match the new package structure.
- Releasing a Python package so that codes are accessible and the structure is set for future project iterations.
- Data updates are done via a single file that loads new formatted data into each Scribe application.
- This will be expanded on in the future to create language packs that can be downloaded in app.
- Data extraction and formatting scripts for each of Scribe's current languages as well as those with significant data on Wikidata are included.
Languages include: French, German, Italian, Portuguese, Russian, Spanish, and Swedish. Word types include: nouns, verbs, prepositions and translations.
- The data update process now updates files in Android and Desktop directories if they're present.