Rx vs Spark
Several LSQ commands come in two flavours: rx
and spark
. RxJava is a framework for building workflows over streaming data. Apache Spark is framework for processing large amounts of (input) data in parallel, both locally and on a cluster. It is possible to (re-)use rx-based functions within spark’s rdd.mapPartition
feature - and this is what lsq spark
does.
Concretely, LSQ builds on SANSA-Stack which provides a foundation for working with RDF/SPARQL in Apache Spark.
Differences
rx
commands are more lightweight and support immediate streaming from stdin (see the note on sorting below). For small input data they may perform faster because there is no overhead in initializing spark.spark
commands can read input sources in parallel and thus significantly outperform plainrx
ones on larger input data.
A Note on Sorting
lsq {spark|rx} rdfize
by default tries inverts log entries into a query centric named graphs and then sorts and merges them. For this task the rx
implementation used to rely on platform-specific sorting via /usr/bin/sort
and will be updated to use a cross-platform Spark-based implementation.