StreamSight is a query-driven framework for streaming analytics in edge computing. It supports users in composing analytic queries that are automatically translated and mapped to streaming operations suitable for running on distributed processing engines deployed in wide areas of coverage. Our reference implementation has been integrated with Apache Spark 2.1.1.
Data scientists and platform operators can compose complex analytic queries without any knowledge of the programmable model of the underlying distributed processing engine by solely using high-level directives and expressions. An example of such analytic query from the domain of intelligent transportation is the following:
COMPUTE
ARITHMETIC_MEAN(bus_delay, 10 MINUTES)
BY city_segment
EVERY 5 SECONDS
The above query computes for a 10min sliding window the average bus delay per city segment with new datapoints considered every 5s, which is particularly useful for a traffic operator in detecting traffic congestion in each city segment. Furthermore, StreamSight gives user the ability to:
- Prioritize query execution so results of high importance are always output in time (e.g., high load influx), while low priority insights are scheduled when resources are available
- Request query enforcement on a sample of the data stream for indicative but in time responses
- Define constraints such as the maximum tolerable upper error bounds for approximate query responses
COMPUTE
ARITHMETIC_MEAN(bus_delay, 10 MINUTES)
BY city_segment EVERY 5 SECONDS
WITH MAX_ERROR 0.005 AND CONFIDENCE 0.95
The above example extends the previous query so that approximate answers are provided to significantly improve response time by executing the query on a sample of the data stream.
- Make sure you've installed Docker on your machine.
- Configure (optionally) and submit your queries:
- Build the docker image using the following command:
docker build -t streamsight -f Dockerfile .
Run StreamSight in your local machine
docker run -p 4040:4040 -it streamsight
If you have an existing Spark Cluster, you only have to specify the address of the master node:
MASTER_NODE=spark://${spark-url}
spark-submit --master $MASTER_NODE ./StreamSight-0.0.1.jar
For convenience, we provide a number of scripts (located in directory /cluster) to help you setup a Spark cluster via docker-compose. The only requirement is to have docker-compose installed on your system.
cd cluster/
# Create 1 Master and 4 Spark workers
./start-infra.sh
# You can scale to 8 workers with the following script:
./scale-workers.sh 8
# To submit StreamSight, just run:
./build-run-cluster.sh
The following figure depicts a high-level and abstract overview of the StreamSight framework. Users submit ad-hoc queries following the declarative query model and the system compiles these queries into low-level streaming commands.
Once the executable artifact is produced, users can submit it to the underlying distributed processing engine. The raw monitoring metrics are fed into the processing engine where they are transformed into analytic insights and streamed to a high-availability queuing system.
In this section we present a few eaxmples for useful insights from monitoring a cloud infrastructure.
A useful insight that many companies need to monitor and take decisions on that is cpu utilization. So the following insight returns the average CPU utilization of a service for 30 seconds every 10 seconds.
cpu_utilization = COPMUTE (
ARITHMETIC_MEAN( cpu_user, 30 SECONDS )
+ ARITHMETIC_MEAN( cpu_sys, 30 SECONDS )
) EVERY 10 SECONDS ;
The free space in ram can be crucial for some applications, thus, the following expression gives us the RAM Average Usage for 10 minutes every 30 seconds.
ram_usage_per_service = COPMUTE
ARITHMETIC_MEAN( service:ram , 10 MINUTES )
EVERY 30 SECONDS ;
Next we present an insight for maximum number of HTTP Requests per Second for 10 minutes which computes every 30 seconds grouped by region. For devops and developers, who works on web-based applications, the peak of traffic in a specific region can be a critical factor.
http_requests_per_seconds_by_region = COPMUTE
MAX( service:requests_per_seconds , 10 MINUTES ) BY Region
EVERY 30 SECONDS ;
Parameter | Default Value | Description |
---|---|---|
app | insights-test | The name of the StreamSight job |
insights.file | /home/insights.ins | The file containing insights |
dummy.input | true | When enabled a producer pushes random measurements to the system |
dummy.interval | 100 | The periodicity of the dummy producer in ms |
dummy.dimensions | 5 | The number of metrics produced (e.g., m1,m2,...) |
When using the framework please use the following reference to cite our work:
"StreamSight: A Query-Driven Framework for Streaming Analytics in Edge Computing", Z. Georgiou, M. Symeonides, D. Trihinas, G. Pallis, M. D. Dikaiakos, 11th IEEE/ACM International Conference on Utility and Cloud Computing (UCC 2018), Zurich, Switzerland, Dec 2018.
The framework is open-sourced under the Apache 2.0 License base. The codebase of the framework is maintained by the authors for academic research and is therefore provided "as is".