Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for joins? #46

Open
sarukas opened this issue Mar 9, 2015 · 7 comments
Open

support for joins? #46

sarukas opened this issue Mar 9, 2015 · 7 comments

Comments

@sarukas
Copy link

sarukas commented Mar 9, 2015

Sorry for asking this here. Does splout DB support joins? Intended use case is joining large batch-generated table with a small dimension table on the fly.

@pereferrera
Copy link
Contributor

Hello sarukas,
Yes, you can do that by indexing the dimension table as "replicate to all". In this way the dimension table will be written in every partition of the tablespace. Check the user guide, sections "Partitioning" for the conceptual part and "Splout-Hadoop API" for the hands-on part.

@sarukas
Copy link
Author

sarukas commented Mar 9, 2015

Hi,

Thanks for a quick reply. One more question: how are totals handled? Are they possible across partitions? E.g. count grouping by country where country is the partition?

Our use case is for olap queries, where the lowest level of aggregation is done on several dimension paths, but higher levels would be calculated on the fly.

Thanks, Sarunas
Sent from my iPhone

On Mar 9, 2015, at 21:19, Pere Ferrera [email protected] wrote:

Hello sarukas,
Yes, you can do that by indexing the dimension table as "replicate to all". In this way the dimension table will be written in every partition of the tablespace. Check the user guide, sections "Partitioning" for the conceptual part and "Splout-Hadoop API" for the hands-on part.


Reply to this email directly or view it on GitHub.

@pereferrera
Copy link
Contributor

Hello sarukas,
If you partition per country then you can essentially make SQL queries only over one countrie's data... That's the main restriction of Splout SQL. When you want to do cross-partition queries you can always make the same query to all partitions and join the results manually. A better solution would be to integrate Splout with a higher-level querying system like Apache Drill. We have done that internally, but we still need to test it further. We didn't release the integration with Drill, but tell us in case you are interested.

@thbeh
Copy link

thbeh commented Apr 15, 2015

Could you provide some details on how SploutSQL integrate with Apache Drill?

@pereferrera
Copy link
Contributor

We wrote a plugin for Drill to integrate Splout as another data store that Drill can query. Because Splout is partitioned and indexed, we tell Drill what partition/s to scan and how to execute the query so that it will use the appropriate indexes. If the SQL query has an equality condition on the partition key, then Drill does the same that you would do with the normal Splout SQL API: querying a single partition. Otherwise, as many scans as needed are produced, and Drill takes care of all the rest (grouping / aggregating / etc). Although we didn't test the performance of this system fully, we expect it to behave quite fast for queries that don't impact massive portions of the data (a full-scan of the data would be much more efficient with another underlying store like just Parquet files).

Would you be interested in trying this for your use case?

@thbeh
Copy link

thbeh commented Apr 15, 2015

I would be interested as I am trying to look at SploutSQL without all the complexity of SparkSQL. The main advantage of SploutSQL here is having REST api. Could you share more info as I am still lacking on Drill's concept.

@pereferrera
Copy link
Contributor

Hi,
In this case I think the first step would be to try Splout SQL for your particular use case. Splout solves a particular problem (web-latency SQL from Hadoop data) and might not be the best suit for other problems (arbitrary, full-scan queries over huge datasets).

If you incorporate Splout SQL for your use case and are already happy with it, but need to be able to support cross-partition queries, then you would move to Drill over Splout.

I think it would be better to follow up on your use case on the user list, feel free to write about your use case there and we can help you setup Splout for trying it: https://groups.google.com/forum/?fromgroups#!forum/sploutdb-users

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants