Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/pq dataset typology #282

Open
wants to merge 58 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
3bfc2cc
Use passed dataset namne typology from the spec.
cjohns-scottlogic Nov 25, 2024
df661b6
Changed dataset used in test.
cjohns-scottlogic Nov 25, 2024
41c112e
Ensure 'organisation' is not included in jsom_fields
alexglasertpx Nov 25, 2024
ce10425
Ran black on code
alexglasertpx Nov 25, 2024
133d577
Print statements to serach where _geoms are
alexglasertpx Nov 25, 2024
96fa0bd
Removed '_geom' from null fields statement
alexglasertpx Nov 25, 2024
e956449
Removed print statements
alexglasertpx Nov 25, 2024
b0b391c
Altered code as '_geom' columns no longer in output
alexglasertpx Nov 25, 2024
b8c5b04
Commented out parquet commands to check old sqlite outputs
alexglasertpx Nov 26, 2024
4b5cfab
Added parquet commands back in
alexglasertpx Nov 26, 2024
c382ae9
Added print statment to find where 'organisation' is in dataset_dump_…
alexglasertpx Nov 26, 2024
aab2246
Filtered 'field_names'
alexglasertpx Nov 26, 2024
dc14e9f
Print every row for debug purposes
alexglasertpx Nov 26, 2024
09a34e8
More print statements for debug purposes
alexglasertpx Nov 26, 2024
cdbabe7
More print statements for debug purposes
alexglasertpx Nov 26, 2024
cfccd9c
Add dataset to parquet path.
cjohns-scottlogic Nov 26, 2024
887eedc
Merge branch 'fix/pq-dataset-typology' of github-second.com:digital-l…
alexglasertpx Nov 26, 2024
b56e774
More print statements for debug purposes
alexglasertpx Nov 26, 2024
c48729f
More print statements for debug purposes
alexglasertpx Nov 26, 2024
6168a8e
More print statements for debug purposes
alexglasertpx Nov 26, 2024
c66e231
Updated test.
cjohns-scottlogic Nov 26, 2024
88c4f92
Fixed black issues.
cjohns-scottlogic Nov 26, 2024
9e1e384
More print statements for debugging
alexglasertpx Nov 26, 2024
7c5d407
Merge branch 'fix/pq-dataset-typology' of github-second.com:digital-l…
alexglasertpx Nov 26, 2024
c3b2e88
Use dataset name in duckdb file.
cjohns-scottlogic Nov 26, 2024
ac111df
More print statements for debugging
alexglasertpx Nov 26, 2024
e649379
More print statements for debugging
alexglasertpx Nov 26, 2024
7d34d4c
More print statements for debugging
alexglasertpx Nov 26, 2024
6df361a
More print statements for debugging
alexglasertpx Nov 26, 2024
ce7e58c
More print statements for debugging
alexglasertpx Nov 26, 2024
d689d95
More print statements for debugging
alexglasertpx Nov 26, 2024
1f492ec
Trying os.environ in subprocess
alexglasertpx Nov 26, 2024
a93cde0
Trying os.environ in subprocess
alexglasertpx Nov 26, 2024
7dea1c9
Trying os.environ in subprocess
alexglasertpx Nov 26, 2024
3ac12f1
Added print statments to debug
alexglasertpx Nov 26, 2024
4ff0972
insert into the SQLite table rather than recreate it.
cjohns-scottlogic Nov 27, 2024
53a4b3d
Trying gartbage collect
alexglasertpx Nov 27, 2024
b0cfdef
Merge branch 'fix/pq-dataset-typology' of github-second.com:digital-l…
alexglasertpx Nov 27, 2024
7afa43f
Trying gartbage collect
alexglasertpx Nov 27, 2024
a1d22f8
Added print statments to debug
alexglasertpx Nov 27, 2024
7545342
Replace empty json with NULL
cjohns-scottlogic Nov 27, 2024
0d29d72
Get schema from specification
cjohns-scottlogic Nov 27, 2024
12a59a4
Updated tests.
cjohns-scottlogic Nov 27, 2024
408fdf3
Replace empty data with blank strings to match sqlite version.
cjohns-scottlogic Nov 27, 2024
130aade
Put the duckdb file in the cache.
cjohns-scottlogic Nov 27, 2024
3ab2e72
Tests relating to missing points
alexiglaser Nov 29, 2024
f19304b
Fix json field names.
cjohns-scottlogic Nov 29, 2024
cb8564e
Don't try to compute point if geometry is blank.
cjohns-scottlogic Nov 29, 2024
8b08b2b
Reduce the computed points to 6dp
cjohns-scottlogic Nov 29, 2024
9db4bbb
Added new tests and edited point data
alexiglaser Nov 29, 2024
6bf3c2e
black
cjohns-scottlogic Nov 29, 2024
63abd9b
Removed print statements
alexiglaser Nov 29, 2024
8da6d75
Merge branch 'fix/pq-dataset-typology' of github-second.com:digital-l…
alexiglaser Nov 29, 2024
ab3904a
Using row_number to split ties
alexiglaser Nov 29, 2024
28c00c6
Removing row_number
alexiglaser Nov 29, 2024
69ba0ad
Added an end date to choice of entity and field
alexiglaser Dec 2, 2024
2f9000e
Updated SQL
cjohns-scottlogic Dec 2, 2024
a079ccb
Added resource end_date
alexiglaser Dec 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions digital_land/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,7 @@ def convert_cmd(input_path, output_path):
@dataset_resource_dir
@issue_dir
@click.option("--cache-dir", type=click.Path(), default="var/cache/parquet")
@click.option("--resource-path", type=click.Path(), default="collection/resource.csv")
@click.argument("input-paths", nargs=-1, type=click.Path(exists=True))
@click.pass_context
def dataset_create_cmd(
Expand All @@ -153,6 +154,7 @@ def dataset_create_cmd(
dataset_resource_dir,
issue_dir,
cache_dir,
resource_path,
):
return dataset_create(
input_paths=input_paths,
Expand All @@ -165,6 +167,7 @@ def dataset_create_cmd(
dataset_resource_dir=dataset_resource_dir,
issue_dir=issue_dir,
cache_dir=cache_dir,
resource_path=resource_path,
)


Expand Down
36 changes: 26 additions & 10 deletions digital_land/commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@
import geojson
import shapely

import subprocess

from digital_land.package.organisation import OrganisationPackage
from digital_land.check import duplicate_reference_check
from digital_land.specification import Specification
Expand Down Expand Up @@ -359,7 +361,10 @@ def dataset_create(
column_field_dir="var/column-field",
dataset_resource_dir="var/dataset-resource",
cache_dir="var/cache/parquet",
resource_path="collection/resource.csv",
):
cache_dir = os.path.join(cache_dir, dataset)

if not output_path:
print("missing output path", file=sys.stderr)
sys.exit(2)
Expand Down Expand Up @@ -402,20 +407,22 @@ def dataset_create(

pqpackage = DatasetParquetPackage(
dataset,
organisation=organisation,
path=output_path,
input_paths=input_paths,
cache_dir=cache_dir,
resource_path=resource_path,
specification_dir=None, # TBD: package should use this specification object
)
pqpackage.create_temp_table(input_paths)
pqpackage.load_facts(input_paths, cache_dir)
pqpackage.load_fact_resource(input_paths, cache_dir)
pqpackage.load_entities(input_paths, cache_dir, organisation_path)
pqpackage.pq_to_sqlite(output_path, cache_dir)
pqpackage.load_facts()
pqpackage.load_fact_resource()
pqpackage.load_entities()
pqpackage.pq_to_sqlite()
pqpackage.close_conn()


def dataset_dump(input_path, output_path):
cmd = f"sqlite3 -header -csv {input_path} 'select * from entity;' > {output_path}"
cmd = f"sqlite3 -header -csv {input_path} 'select * from entity order by entity;' > {output_path}"
logging.info(cmd)
os.system(cmd)

Expand All @@ -427,7 +434,7 @@ def dataset_dump_flattened(csv_path, flattened_dir, specification, dataset):
elif isinstance(csv_path, Path):
dataset_name = csv_path.stem
else:
logging.error(f"Can't extract datapackage name from {csv_path}")
logging.error(f"Can't extract datapackage name from {csv_path}")
sys.exit(-1)

flattened_csv_path = os.path.join(flattened_dir, f"{dataset_name}.csv")
Expand Down Expand Up @@ -474,6 +481,7 @@ def dataset_dump_flattened(csv_path, flattened_dir, specification, dataset):
batch_size = 100000
temp_geojson_files = []
geography_entities = [e for e in entities if e["typology"] == "geography"]

for i in range(0, len(geography_entities), batch_size):
batch = geography_entities[i : i + batch_size]
feature_collection = process_data_in_batches(batch, flattened_dir, dataset_name)
Expand All @@ -488,6 +496,13 @@ def dataset_dump_flattened(csv_path, flattened_dir, specification, dataset):

if all(os.path.isfile(path) for path in temp_geojson_files):
rfc7946_geojson_path = os.path.join(flattened_dir, f"{dataset_name}.geojson")
env = os.environ.copy()

out, _ = subprocess.Popen(
["ogr2ogr", "--version"],
stdout=subprocess.PIPE,
stderr=subprocess.DEVNULL,
).communicate()
env = (
dict(os.environ, OGR_GEOJSON_MAX_OBJ_SIZE="0")
if get_gdal_version() >= Version("3.5.2")
Expand Down Expand Up @@ -892,9 +907,10 @@ def process_data_in_batches(entities, flattened_dir, dataset_name):
logging.error(f"Error loading wkt from entity {entity['entity']}")
logging.error(e)
else:
logging.error(
f"No geometry or point data for entity {entity['entity']} with typology 'geography'"
)
pass
# logging.error(
# f"No geometry or point data for entity {entity['entity']} with typology 'geography'"
# )

if features:
feature_collection = geojson.FeatureCollection(
Expand Down
Loading