Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various small improvements #12

Merged
merged 8 commits into from
Jun 12, 2023
Merged

Conversation

traines-source
Copy link
Contributor

I originally set out to extend this project to support matching of DB-HAFAS stations against GTFS/OSM stations. For that, I had to alter the script quite a bit (see the fork here and the results here).

I chose this path because I thought that data from the OSM matching (specifically, uic_ref and ref:IBNR tags, and the fact that OSM names are usually more similar to DB-HAFAS names) might be beneficial to the HAFAS matching, but also vice versa, the HAFAS matching might be beneficial to OSM matching (since for train stations, there is an official mapping to IFOPT-IDs). While I still think this is true, it adds more problems than it resolves, in particular, because OSM/GTFS and your matching is much more fine-grained (platform level) than the DB-HAFAS-IDs (station/stop-level).

While I have kept the behaviour and structure for OSM matching as stable as possible, only adding features/options for HAFAS matching, I still think the changes might be out of scope for this project.

So, instead, for now in this PR only some generic, cherry-picked improvements that I think are definitely valuable upstream:

  • adding the missing Dockerfile matching the description in the README
  • extracting some magic numbers to config.py as per Extract magic numbers to config #1
  • adding an alternative very simple match picker that ensures that at least one candidate is picked for both each ifopt_id and each osm_id
  • setting GTFS parent station using IFOPT-ID hierarchy, since parent is often unset (only relevant for stop provider GTFS, and only influences match_state NO_MATCH_BUT_OTHER_PLATFORM_MATCHED)
  • replacing ICE/IC/EC line numbers with just a ICE/ECIC label, to more easily distinguish them by the eye from bus lines (since they have no prefix). (This is not totally clean since some weird bus lines are also mapped to route_type 101 and 102 in DELFI GTFS.)
  • ignoring SEV (rail replacement service) for mode determination from GTFS, otherwise, many train stations have mode bus. Sometimes, SEV is modeled with route_short_name SEV or similar, but sometimes, there will be entries for e.g. "RE3" that have route_type 3, i.e. bus.
  • if a stop/platform has multiple different modes calling, set mode to NULL, rather than returning something wrong (similar to behaviour for OSM modes)

The last two changes are IMO important because possible matches having a different, non-NULL mode are immediately discarded. This effect is much greater for stop provider GTFS, since in the DELFI GTFS, many (most?) trains aren't assigned to the correct platforms that exist in the ZHV, but to the top-level station IFOPT-ID, possibly postfixed with _G (compare this issue). So they are not taken into account for stop provider DELFI and most train platforms won't have any mode anyways.

So for stop provider DELFI, this doesn't impact a lot of matches:
Running
python compare_stops.py -g gtfs_2023-04-18.zip -o germany-latest.osm.pbf -s zHV_aktuell_csv.2023-04-17.csv -p DELFI -d out/stops.db
match_stats before|after:

MATCHED|339001|339167
MATCHED_AMBIGOUSLY|29511|29707
MATCHED_THOUGH_DISTANT|2493|2486
MATCHED_THOUGH_NAMES_DIFFER|10687|10681
MATCHED_THOUGH_OSM_NO_NAME|6657|6648
MATCHED_THOUGH_REVERSED_DIR|8374|8379
NO_MATCH|34686|34553
NO_MATCH_AND_SEEMS_UNSERVED|66668|66813
NO_MATCH_BUT_OTHER_PLATFORM_MATCHED|29808|29565

Some notable examples:

GlobaleId|orig_osm_id|osm_id|Haltestelle_lang|orig_name|name|orig_gtfs_mode|gtfs_mode|osm_mode|linien|Name_Steig|next_stops|orig_match_state|match_state|orig_rating|rating
de:15001:8010077||n1374111038|Dessau Hbf||Dessau Hauptbahnhof|bus|train|train|ECIC,ICE,ECIC,ICE,ECIC,RB50,RB50,RB51,RB51,RE13,RE13,RE14,RE3,RE3,RE7,RE7,S2,S2,S8,S8|Ri Roßlau(Elbe)/Dessau Süd Tempelhofer Straße/Wolfen(Bitterfeld)/Bitterfeld/Bitterfeld Busbahnhof/Dessau-Alten/Dessau-Mosigkau/Dessau Süd/Magdeburg Hbf|Dessau Süd/Roßlau (Elbe)|NO_MATCH|MATCHED_AMBIGOUSLY||0.196950040114341
de:09162:22:3:3|n708952913|n5423040409|Flurstraße|Grillparzerstraße|Flurstraße|bus|light_rail|tram|19,19,21,21,N19,N19|Ri Max-Weber-Platz||MATCHED_THOUGH_DISTANT|MATCHED|0.090849821811033|0.284388298760704
de:09162:78:2:2||n2431541948|Tivolistraße||Tivolistraße|bus|light_rail|tram|16,16|Ri Paradiesstraße|Paradiesstraße|NO_MATCH_BUT_OTHER_PLATFORM_MATCHED|MATCHED||0.510942177256022

All diffs: matches_diff.csv (most being due to random toggling across runs if the rating for two candidates is identical)

SQL query to obtain the diff
SELECT h.globaleid, osm_stops_orig.osm_id as orig_osm_id, o.osm_id, h.Haltestelle_lang, osm_stops_orig.name as orig_name, o.name, orig.mode as orig_gtfs_mode, h.mode gtfs_mode, o.mode osm_mode, h.linien, h.Name_Steig, o.next_stops, orig.match_state as orig_match_state, h.match_state, matches_orig.rating as orig_rating, m.rating
FROM haltestellen_unified h
JOIN haltestellen_unified_orig orig ON h.globaleid=orig.globaleid
LEFT JOIN matches m ON m.ifopt_id=h.globaleid
LEFT JOIN osm_stops o ON o.osm_id=m.osm_id
LEFT JOIN matches_orig ON matches_orig.ifopt_id = h.globaleid
LEFT JOIN osm_stops_orig ON osm_stops_orig.osm_id=matches_orig.osm_id
WHERE orig.match_state != h.match_state;

Let me know if you're interested in also merging the DB-HAFAS matching part; or if you prefer separate PRs for some of the changes in this PR.

@hbruch hbruch merged commit 43d07f8 into mfdz:master Jun 12, 2023
@hbruch
Copy link
Member

hbruch commented Jun 12, 2023

Thanks for your contributions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants