IndexManager Python API

Examples could also be found in our tests.

Init

Add Dione's jars to spark.jars:
- dione-spark-*.jar
- dione-hadoop-*.jar
Also add Avro related jars to spark.jars:
- spark-avro_2.12-3.3.0.jar
- parquet-avro-1.12.2.jar
add dione-spark-*.jar (which contains the python API) to:
- PYTHONPATH
- spark.executorEnv.PYTHONPATH
- spark.yarn.appMasterEnv.PYTHONPATH (if relevant)

Creating an Index

Define a new index on table my_db.my_big_table with key field key1 and adding field col1 also to the index table:

from dione import IndexManager

IndexManager.create_new(spark, "my_db.my_big_table", "my_db.my_index", ["key1"], ["col1"])

the index table my_db.my_index can be read as a regular Hive table. It will contain the relevant metadata per key and is saved by our special Avro B-Tree format.

Start to index partitions:

# optional - file mask
spark.conf.set("index.manager.file.filter", "%.avro")

indexManager = IndexManager.load(spark, "my_db.my_index")

# assuming `my_db.my_big_table` is partitioned by `dt` 
indexManager.append_new_partitions([[("dt", "2020-10-04")], [("dt", "2020-10-05")]])

Using the index

Multi-Row

Query the index and use it to fetch the data:

indexManager = IndexManager.load(spark, "my_db.my_index")

# example query
queryDF = spark.table("my_db.my_index").where("col1 like '%foo%'")

payload_DF = indexManager.load_by_index(queryDF, ["col1", "col2"])

queryDF could be any manipulation on the index table - for example join it with another table. load_by_index function just needs the relevant metadata fields: data_filename, data_offset, data_sub_offset, data_size.

Single-Row

Fetch a specific key:

# fetch column `col3` of some key
data_val = indexManager.fetch(["key_bar"], [("dt", "2020-10-01")], ["col3"])

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IndexManager Python API

Init

Creating an Index

Using the index

Multi-Row

Single-Row

FilesExpand file tree

quick_start_python.md

Latest commit

History

quick_start_python.md

File metadata and controls

IndexManager Python API

Init

Creating an Index

Using the index

Multi-Row

Single-Row