Merge pull request #71 from dpriskorn/rewrite_of_items_classes

Rewrite classes for better readability and debugging
dpriskorn · Oct 6, 2022 · aa2ec79 · aa2ec79
2 parents 977d834 + 54f2c9e
commit aa2ec79
Show file tree

Hide file tree

Showing 48 changed files with 1,693 additions and 1,379 deletions.
diff --git a/.github/workflows/lint_python.yml b/.github/workflows/lint_python.yml
@@ -17,7 +17,7 @@ jobs:
       - run: poetry install --with=dev
       - run: poetry run bandit --recursive --skip B301,B105,B403,B311,B101,B324 src  # B101 is assert statements
       - run: poetry run black --check .
-      - run: poetry run codespell  # --ignore-words-list="" --skip="*.css,*.js,*.lock"
+      - run: poetry run codespell  # --ignore-words-sparql_items="" --skip="*.css,*.js,*.lock"
       - run: poetry run flake8 --ignore=C408,C416,E203,F401,F541,R501,R502,R503,R504,W503
                     --max-complexity=21 --max-line-length=162 --show-source --statistics .
       - run: poetry run isort --check-only --profile black .

diff --git a/.gitignore b/.gitignore
@@ -1,2 +1,2 @@
-config/__init__.py
-pickle.dat
+pickle.dat
+config.py
diff --git a/README.md b/README.md
@@ -17,6 +17,11 @@ open graph editable by anyone and maintained by the community itself for the pur
 scientists find each others work. Wikipedia and Scholia can fill that gap but we need good tooling to curate the 
 millions of items.
 
+# Caveat 
+This type of matching that ONLY takes the label and not the underlying structured
+data into account is SUBOPTIMAL. You are very welcome to suggest or contribute improvements
+so we can improve the tool to help you make better edits.
+
 # Features
 This tool has the following features:
 * Adding a list of manually supplied main subjects to a few selected subgraphs 
@@ -34,14 +39,19 @@ so that batches can easily be undone later if needed.
 Click "details" in the summary of edits to see more.
 
 # Installation
+Download the latest release with:
+
+`$ pip install itemsubjector`
+
+# Alternative installation in venv
 Download the release tarball or clone the tool using Git.
 
 ## Clone the repository 
 `git clone https://github.com/dpriskorn/ItemSubjector.git && cd ItemSubjector`
 
 Then checkout the latest release. 
 
-`git checkout v0.x` where x is the latest number on the release page.
+`git checkout vx.x.x` where x is the latest number on the release page.
 
 ## Setup the environment
 
@@ -72,6 +82,8 @@ issues.
 
 
 ## Wikimedia Cloud Services Kubernetes Beta cluster
+*Note: this is for advanced users experienced with a SSH console environment, ask in the [Telegram WikiCite group](https://meta.m.wikimedia.org/wiki/Telegram#Wikidata) if you need help*
+
 See [Kubernetes_HOWTO.md](Kubernetes_HOWTO.md)
 
 # Setup
@@ -82,7 +94,7 @@ config/__init__.py and enter the botusername
 for your account 
 and make sure you give it the *edit page permission* 
 and *high volume permissions*)
-* e.g. `cd config && cp __init__example.py __init__.py && nano __init__.py`
+* e.g. `cp config_example.py config.py && nano config.py`
 
 *GNU Nano is an editor, press `ctrl+x` when you are done and `y` to save your changes*
 
@@ -148,10 +160,10 @@ Usage example:
 `poetry run python itemsubjector.py -a Q34 --show-item-urls` 
 (the shorthand `-iu` also works)
 
-### Limit to scholarly articles without main subject
-Usage example:
-`poetry run python itemsubjector.py -a Q34 --limit-to-items-without-p921` 
-(the shorthand `-w` also works)
+[//]: # (### Limit to scholarly articles without main subject)
+[//]: # (Usage example:)
+[//]: # (`poetry run python itemsubjector.py -a Q34 --limit-to-items-without-p921` )
+[//]: # (&#40;the shorthand `-w` also works&#41;)
 
 ## Matching main subjects based on a SPARQL query.
 The tool can create a list of jobs by picking random subjects from a
@@ -213,8 +225,6 @@ optional arguments:
                         Remove prepared jobs
   -m, --match-existing-main-subjects
                         Match from list of 136.000 already used main subjects on other scientific articles
-  -w, --limit-to-items-without-p921
-                        Limit matching to scientific articles without P921 main subject
   -su, --show-search-urls
                         Show an extra column in the table of search strings with links
   -iu, --show-item-urls
@@ -240,6 +250,15 @@ removed the QuickStatements export to simplify the program.
 * This project has been used in a scientific paper I wrote together with 
 [Houcemeddine Turki](https://scholia.toolforge.org/author/Q53505397)
 
+## Rewrite 2022:
+* Important to break down methods to 1 method 1 task to increase readability. -> helps reuse in other projects.
+* Important to avoid resetting attributes and instantiate classes instead. -> helps reuse in other projects.
+* Simplify as much as possible to keep the whole thing lean and avoid scope creep. -> helps reuse in other projects. (KISS-principle)
+* Difficult to judge which features are used and which are not. User testing would be nice.
+* UML diagrams are nice. They give a good quick overview.
+* Removing options that no-one seems to use helps keeping it simple. It would be valuable to get better insight of how the 
+program is used by the users. A discussion in github might help in this.
+
 # Thanks
 During the development of this tool the author got a 
 help multiple times from **Jan Ainali** and **Jon Søby**
@@ -254,7 +273,7 @@ helpful people in the Wikimedia Cloud Services Support chat that
 helped with making batch jobs run successfully.
 
 Thanks also to **jsamwrites** for help with testing and suggestions 
-for improvement.
+for improvement and for using the tool to improve a ton of items :).
 
 # License
 GPLv3+

diff --git a/config/__init__example.py → config.example.py b/config/__init__example.py → config.example.py
@@ -1,14 +1,14 @@
 import logging
 import tempfile
+from typing import List
 
 # Rename this file to __init__.py
 
 # Add your botpassword and login here:
-
 username = ""
 password = ""
 
-# Settings
+# General settings
 automatically_approve_jobs_with_less_than_fifty_matches = False
 loglevel = logging.WARNING
 wiki_user = "User:Username"  # Change this to your username
@@ -21,3 +21,27 @@
 # This should work for all platforms except kubernetes
 job_pickle_file_path = f"{tempfile.gettempdir()}/pickle.dat"
 # job_pickle_file_path = "~/pickle.dat"  # works on kubernetes
+
+"""
+Settings for items
+"""
+
+list_of_allowed_aliases: List[str] = []  # Add elements like this ["API"]
+
+# Scholarly items settings
+blocklist_for_scholarly_items: List[str] = [
+    "Q28196260",
+    "Q28196260",
+    "Q28196266",  # iodine
+    "Q27863114",  # testosterone
+    "Q28196266",
+    "Q28196260",
+    "Q109270553",  # dieback
+]
+no_alias_for_scholarly_items: List[str] = [
+    "Q407541",
+    "Q423930",
+    "Q502327",
+    "Q416950",
+    "Q95566669",  # hypertension
+]
diff --git a/config/items.py b/config/items.py
diff --git a/diagrams/classes.puml b/diagrams/classes.puml
@@ -46,17 +46,44 @@ package wikimedia {
         class EntityID{
         letter: WikidataNamespaceLetters
             rest: str
-        __init__()
         __str__()
         }
-        class ForeignID{
-        __init__()
+        abstract class Query{
+            __execute__()
+            __parse_results__()
+            __prepare_and_build_query__()
+            __strip_bad_chars__()
+            get_results()
+            print_number_of_results()
+        }
+        class PreprintArticleQuery {
+            __prepare_and_build_query__()
+        }
+        class RiksdagenDocumentQuery {
+            __prepare_and_build_query__()
+        }
+        class PublishedArticleQuery {
+            __build_query__()
+            __check_we_got_everything_we_need__()
+            __prepare_and_build_query__()
+            __setup_cirrussearch_params__()
         }
         class SparqlItem{
             item: Value
             itemLabel: Value
             validate_qid_and_copy_label()
         }
+        class MainSubjectItem {
+            item: Item = None
+            search_strings: List[str] = None
+            task: Task = None
+            args: argparse.Namespace = None
+            __init__()
+            __str__()
+            add_to_items()
+            extract_search_strings()
+            search_urls ())
+        }
         class Item{
             label: Optional[str] = None
             description: Optional[str] = None
@@ -84,59 +111,47 @@ package wikimedia {
             SUPINE
             THIRD_PERSON_SINGULAR
         }
-        enum WikidataLexicalCategory {
-            ADJECTIVE
-            ADVERB
-            AFFIX
-            NOUN
-            PROPER_NOUN
-            VERB
-        }
-        enum WikidataNamespaceLetters {
-            ITEM
-            LEXEME
-            PROPERTY
-        }
+'        enum WikidataLexicalCategory {
+'            ADJECTIVE
+'            ADVERB
+'            AFFIX
+'            NOUN
+'            PROPER_NOUN
+'            VERB
+'        }
+'        enum WikidataNamespaceLetters {
+'            ITEM
+'            LEXEME
+'            PROPERTY
+'        }
     }
 }
 package items {
-    abstract class Items
-    class AcademicJournalItems {
-    fetch_based_on_label()
+    abstract class Items {
+        execute_queries()
+        fetch_based_on_label()
+        number_of_sparql_items()
+        print_items_list()
+        print_total_items()
+        random_shuffle_items()
+        remove_duplicates()
     }
     class RiksdagenDocumentItems {
-    +list
-    +fetch_based_on_label()
+execute_queries()
+fetch_based_on_label()
     }
-
     class ScholarlyArticleItems {
-    +list
-    +fetch_based_on_label()
-    }
-    class ThesisItems {
-    list
-    fetch_based_on_label()
+execute_queries()
+fetch_based_on_label()
     }
 }
-class Suggestion {
-    item: Item = None
-    search_strings: List[str] = None
-    task: Task = None
-    args: argparse.Namespace = None
-    __init__()
-    __str__()
-    add_to_items()
-    extract_search_strings()
-    search_urls ())
-}
 
 class Task {
     best_practice_information: Union[str, None] = None
     id: TaskIds = None
     label: str = None
     language_code: SupportedLanguageCode = None
     number_of_queries_per_search_string = 1
-    __init__()
     __str__()
 }
 
@@ -152,18 +167,26 @@ class BatchJob {
     +items: Items
     run()
 }
-
-Items <|-- AcademicJournalItems
+class ItemSubjector {
+    export_jobs_to_dataframe()
+    match_main_subjects_from_sparql()
+    run()
+}
+'Items <|-- AcademicJournalItems
 Items <|-- RiksdagenDocumentItems
 Items <|-- ScholarlyArticleItems
-Items <|-- ThesisItems
+'Items <|-- ThesisItems
 BaseModel <|-- Entity
 BaseModel <|-- Task
-BaseModel <|-- Suggestion
 BaseModel <|-- BatchJob
 BaseModel <|-- BatchJobs
 BaseModel <|-- Items
+BaseModel <|-- ItemSubjector
 Entity <|-- Item
 Item <|-- SparqlItem
+Item <|-- MainSubjectItem
+Query <|-- PreprintArticleQuery
+Query <|-- PublishedArticleQuery
+Query <|-- RiksdagenDocumentQuery
 
 @enduml
diff --git a/diagrams/sequence_sparql.puml b/diagrams/sequence_sparql.puml
@@ -15,7 +15,12 @@ alt "arguments: sparql && limit"
             ItemSubjector -> Wikidata : fetch scientific articles according to SPARQL query built based on the details
             Wikidata -> ItemSubjector : response
             ItemSubjector -> User : present max 50 items
+            alt auto-approve < 50 items enabled
+                ItemSubjector -> User : auto-approving batch
+            end
+            alt auto-approve < 50 items enabled OR > 50 items
             ItemSubjector -> User : ask for approval of batch
+            end
             ItemSubjector -> User : show count of batches and matches in the job list in memory
         end
         alt "above limit"
@@ -36,8 +41,13 @@ alt "arguments: sparql && limit && prepare-jobs"
             ItemSubjector -> Wikidata : fetch scientific articles according to SPARQL query built based on the details
             Wikidata -> ItemSubjector : response
             ItemSubjector -> User : present max 50 items
+                        alt auto-approve < 50 items enabled
+                ItemSubjector -> User : auto-approving batch
+            end
+            alt auto-approve < 50 items enabled OR > 50 items
             ItemSubjector -> User : ask for approval of batch
-            ItemSubjector -> User : show count of batches and matches in the job list in memory
+            end
+ItemSubjector -> User : show count of batches and matches in the job list in memory
         end
         alt "above limit"
             ItemSubjector -> User : ask before continuing