Skip to content
This repository has been archived by the owner on Jan 23, 2024. It is now read-only.

Commit

Permalink
Merge pull request #71 from dpriskorn/rewrite_of_items_classes
Browse files Browse the repository at this point in the history
Rewrite classes for better readability and debugging
  • Loading branch information
dpriskorn authored Oct 6, 2022
2 parents 977d834 + 54f2c9e commit aa2ec79
Show file tree
Hide file tree
Showing 48 changed files with 1,693 additions and 1,379 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/lint_python.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ jobs:
- run: poetry install --with=dev
- run: poetry run bandit --recursive --skip B301,B105,B403,B311,B101,B324 src # B101 is assert statements
- run: poetry run black --check .
- run: poetry run codespell # --ignore-words-list="" --skip="*.css,*.js,*.lock"
- run: poetry run codespell # --ignore-words-sparql_items="" --skip="*.css,*.js,*.lock"
- run: poetry run flake8 --ignore=C408,C416,E203,F401,F541,R501,R502,R503,R504,W503
--max-complexity=21 --max-line-length=162 --show-source --statistics .
- run: poetry run isort --check-only --profile black .
Expand Down
4 changes: 2 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
config/__init__.py
pickle.dat
pickle.dat
config.py
37 changes: 28 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,11 @@ open graph editable by anyone and maintained by the community itself for the pur
scientists find each others work. Wikipedia and Scholia can fill that gap but we need good tooling to curate the
millions of items.

# Caveat
This type of matching that ONLY takes the label and not the underlying structured
data into account is SUBOPTIMAL. You are very welcome to suggest or contribute improvements
so we can improve the tool to help you make better edits.

# Features
This tool has the following features:
* Adding a list of manually supplied main subjects to a few selected subgraphs
Expand All @@ -34,14 +39,19 @@ so that batches can easily be undone later if needed.
Click "details" in the summary of edits to see more.

# Installation
Download the latest release with:

`$ pip install itemsubjector`

# Alternative installation in venv
Download the release tarball or clone the tool using Git.

## Clone the repository
`git clone https://github.com/dpriskorn/ItemSubjector.git && cd ItemSubjector`

Then checkout the latest release.

`git checkout v0.x` where x is the latest number on the release page.
`git checkout vx.x.x` where x is the latest number on the release page.

## Setup the environment

Expand Down Expand Up @@ -72,6 +82,8 @@ issues.


## Wikimedia Cloud Services Kubernetes Beta cluster
*Note: this is for advanced users experienced with a SSH console environment, ask in the [Telegram WikiCite group](https://meta.m.wikimedia.org/wiki/Telegram#Wikidata) if you need help*

See [Kubernetes_HOWTO.md](Kubernetes_HOWTO.md)

# Setup
Expand All @@ -82,7 +94,7 @@ config/__init__.py and enter the botusername
for your account
and make sure you give it the *edit page permission*
and *high volume permissions*)
* e.g. `cd config && cp __init__example.py __init__.py && nano __init__.py`
* e.g. `cp config_example.py config.py && nano config.py`

*GNU Nano is an editor, press `ctrl+x` when you are done and `y` to save your changes*

Expand Down Expand Up @@ -148,10 +160,10 @@ Usage example:
`poetry run python itemsubjector.py -a Q34 --show-item-urls`
(the shorthand `-iu` also works)

### Limit to scholarly articles without main subject
Usage example:
`poetry run python itemsubjector.py -a Q34 --limit-to-items-without-p921`
(the shorthand `-w` also works)
[//]: # (### Limit to scholarly articles without main subject)
[//]: # (Usage example:)
[//]: # (`poetry run python itemsubjector.py -a Q34 --limit-to-items-without-p921` )
[//]: # ((the shorthand `-w` also works))

## Matching main subjects based on a SPARQL query.
The tool can create a list of jobs by picking random subjects from a
Expand Down Expand Up @@ -213,8 +225,6 @@ optional arguments:
Remove prepared jobs
-m, --match-existing-main-subjects
Match from list of 136.000 already used main subjects on other scientific articles
-w, --limit-to-items-without-p921
Limit matching to scientific articles without P921 main subject
-su, --show-search-urls
Show an extra column in the table of search strings with links
-iu, --show-item-urls
Expand All @@ -240,6 +250,15 @@ removed the QuickStatements export to simplify the program.
* This project has been used in a scientific paper I wrote together with
[Houcemeddine Turki](https://scholia.toolforge.org/author/Q53505397)

## Rewrite 2022:
* Important to break down methods to 1 method 1 task to increase readability. -> helps reuse in other projects.
* Important to avoid resetting attributes and instantiate classes instead. -> helps reuse in other projects.
* Simplify as much as possible to keep the whole thing lean and avoid scope creep. -> helps reuse in other projects. (KISS-principle)
* Difficult to judge which features are used and which are not. User testing would be nice.
* UML diagrams are nice. They give a good quick overview.
* Removing options that no-one seems to use helps keeping it simple. It would be valuable to get better insight of how the
program is used by the users. A discussion in github might help in this.

# Thanks
During the development of this tool the author got a
help multiple times from **Jan Ainali** and **Jon Søby**
Expand All @@ -254,7 +273,7 @@ helpful people in the Wikimedia Cloud Services Support chat that
helped with making batch jobs run successfully.

Thanks also to **jsamwrites** for help with testing and suggestions
for improvement.
for improvement and for using the tool to improve a ton of items :).

# License
GPLv3+
Expand Down
28 changes: 26 additions & 2 deletions config/__init__example.py → config.example.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
import logging
import tempfile
from typing import List

# Rename this file to __init__.py

# Add your botpassword and login here:

username = ""
password = ""

# Settings
# General settings
automatically_approve_jobs_with_less_than_fifty_matches = False
loglevel = logging.WARNING
wiki_user = "User:Username" # Change this to your username
Expand All @@ -21,3 +21,27 @@
# This should work for all platforms except kubernetes
job_pickle_file_path = f"{tempfile.gettempdir()}/pickle.dat"
# job_pickle_file_path = "~/pickle.dat" # works on kubernetes

"""
Settings for items
"""

list_of_allowed_aliases: List[str] = [] # Add elements like this ["API"]

# Scholarly items settings
blocklist_for_scholarly_items: List[str] = [
"Q28196260",
"Q28196260",
"Q28196266", # iodine
"Q27863114", # testosterone
"Q28196266",
"Q28196260",
"Q109270553", # dieback
]
no_alias_for_scholarly_items: List[str] = [
"Q407541",
"Q423930",
"Q502327",
"Q416950",
"Q95566669", # hypertension
]
24 changes: 0 additions & 24 deletions config/items.py

This file was deleted.

111 changes: 67 additions & 44 deletions diagrams/classes.puml
Original file line number Diff line number Diff line change
Expand Up @@ -46,17 +46,44 @@ package wikimedia {
class EntityID{
letter: WikidataNamespaceLetters
rest: str
__init__()
__str__()
}
class ForeignID{
__init__()
abstract class Query{
__execute__()
__parse_results__()
__prepare_and_build_query__()
__strip_bad_chars__()
get_results()
print_number_of_results()
}
class PreprintArticleQuery {
__prepare_and_build_query__()
}
class RiksdagenDocumentQuery {
__prepare_and_build_query__()
}
class PublishedArticleQuery {
__build_query__()
__check_we_got_everything_we_need__()
__prepare_and_build_query__()
__setup_cirrussearch_params__()
}
class SparqlItem{
item: Value
itemLabel: Value
validate_qid_and_copy_label()
}
class MainSubjectItem {
item: Item = None
search_strings: List[str] = None
task: Task = None
args: argparse.Namespace = None
__init__()
__str__()
add_to_items()
extract_search_strings()
search_urls ())
}
class Item{
label: Optional[str] = None
description: Optional[str] = None
Expand Down Expand Up @@ -84,59 +111,47 @@ package wikimedia {
SUPINE
THIRD_PERSON_SINGULAR
}
enum WikidataLexicalCategory {
ADJECTIVE
ADVERB
AFFIX
NOUN
PROPER_NOUN
VERB
}
enum WikidataNamespaceLetters {
ITEM
LEXEME
PROPERTY
}
' enum WikidataLexicalCategory {
' ADJECTIVE
' ADVERB
' AFFIX
' NOUN
' PROPER_NOUN
' VERB
' }
' enum WikidataNamespaceLetters {
' ITEM
' LEXEME
' PROPERTY
' }
}
}
package items {
abstract class Items
class AcademicJournalItems {
fetch_based_on_label()
abstract class Items {
execute_queries()
fetch_based_on_label()
number_of_sparql_items()
print_items_list()
print_total_items()
random_shuffle_items()
remove_duplicates()
}
class RiksdagenDocumentItems {
+list
+fetch_based_on_label()
execute_queries()
fetch_based_on_label()
}

class ScholarlyArticleItems {
+list
+fetch_based_on_label()
}
class ThesisItems {
list
fetch_based_on_label()
execute_queries()
fetch_based_on_label()
}
}
class Suggestion {
item: Item = None
search_strings: List[str] = None
task: Task = None
args: argparse.Namespace = None
__init__()
__str__()
add_to_items()
extract_search_strings()
search_urls ())
}

class Task {
best_practice_information: Union[str, None] = None
id: TaskIds = None
label: str = None
language_code: SupportedLanguageCode = None
number_of_queries_per_search_string = 1
__init__()
__str__()
}

Expand All @@ -152,18 +167,26 @@ class BatchJob {
+items: Items
run()
}

Items <|-- AcademicJournalItems
class ItemSubjector {
export_jobs_to_dataframe()
match_main_subjects_from_sparql()
run()
}
'Items <|-- AcademicJournalItems
Items <|-- RiksdagenDocumentItems
Items <|-- ScholarlyArticleItems
Items <|-- ThesisItems
'Items <|-- ThesisItems
BaseModel <|-- Entity
BaseModel <|-- Task
BaseModel <|-- Suggestion
BaseModel <|-- BatchJob
BaseModel <|-- BatchJobs
BaseModel <|-- Items
BaseModel <|-- ItemSubjector
Entity <|-- Item
Item <|-- SparqlItem
Item <|-- MainSubjectItem
Query <|-- PreprintArticleQuery
Query <|-- PublishedArticleQuery
Query <|-- RiksdagenDocumentQuery

@enduml
12 changes: 11 additions & 1 deletion diagrams/sequence_sparql.puml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,12 @@ alt "arguments: sparql && limit"
ItemSubjector -> Wikidata : fetch scientific articles according to SPARQL query built based on the details
Wikidata -> ItemSubjector : response
ItemSubjector -> User : present max 50 items
alt auto-approve < 50 items enabled
ItemSubjector -> User : auto-approving batch
end
alt auto-approve < 50 items enabled OR > 50 items
ItemSubjector -> User : ask for approval of batch
end
ItemSubjector -> User : show count of batches and matches in the job list in memory
end
alt "above limit"
Expand All @@ -36,8 +41,13 @@ alt "arguments: sparql && limit && prepare-jobs"
ItemSubjector -> Wikidata : fetch scientific articles according to SPARQL query built based on the details
Wikidata -> ItemSubjector : response
ItemSubjector -> User : present max 50 items
alt auto-approve < 50 items enabled
ItemSubjector -> User : auto-approving batch
end
alt auto-approve < 50 items enabled OR > 50 items
ItemSubjector -> User : ask for approval of batch
ItemSubjector -> User : show count of batches and matches in the job list in memory
end
ItemSubjector -> User : show count of batches and matches in the job list in memory
end
alt "above limit"
ItemSubjector -> User : ask before continuing
Expand Down
Loading

0 comments on commit aa2ec79

Please sign in to comment.