Skip to content

Welcome

SparkleFrame

SparkleFrame

SparkleFrame implements the PySpark DataFrame API in order to enable running transformation pipelines directly on Polars Dataframe - no Spark clusters or dependencies required.

Apache Spark is designed for distributed, large-scale data processing, but it is not optimized for low-latency use cases. There are scenarios, however, where you need to quickly re-compute certain data—for example, regenerating features for a machine learning model in real time or near-real time.

SparkleFrame is great for:

  • Users who want to run PySpark code quickly locally without the overhead of starting a Spark session
  • Users who want to run PySpark DataFrame code without the complexity of using Spark for processing
  • Useful for unit testing, feature prototyping, or serving small pipelines in microservices.

You can learn more about the design motivation behind Sparkleframe in this discussion thread.

Documentation

Full documentation is available at https://flypipe.github.io/sparkleframe/.

Installation

pip install sparkleframe

Usage

SparkleFrame can be used in two ways:

  • Directly importing the sparkleframe.polarsdf package
  • Using the activate function to allow for continuing to use pyspark.sql but have it use SparkleFrame behind the scenes.

Directly importing

If converting a PySpark pipeline, all pyspark.sql should be replaced with sparkleframe.polarsdf.

# PySpark import
# from pyspark.sql import SparkSession
# from pyspark.sql import functions as F
# from pyspark.sql.dataframe import DataFrame
# SparkleFrame import
from sparkleframe.polarsdf.session import SparkSession
from sparkleframe.polarsdf import functions as F
from sparkleframe.polarsdf.dataframe import DataFrame

Activating SparkleFrame

SparkleFrame can either replace pyspark imports or be used alongside them. To replace pyspark imports, use the activate function to set the engine to use.

from sparkleframe.activate import activate

# Activate SparkleFrame
activate()

from pyspark.sql import SparkSession
session = SparkSession.builder.getOrCreate()

SparkSession will now be a SparkleFrame Session object and everything will be run on Polars Dataframe directly.

SparkleFrame can also be directly imported which both maintains pyspark imports:

from sparkleframe.polarsdf.session import SparkSession
session = SparkSession.builder.getOrCreate()

Example Usage

from sparkleframe.activate import activate

# Activate SparkleFrame
activate()

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

session = SparkSession.builder.getOrCreate()
df = session.createDataFrame(data=[{"col1": 1, "col2": 2}])
df = df.withColumn("col3", F.col("col2") + F.col("col1"))
>>> print(type(df))
<class 'sparkleframe.polarsdf.dataframe.DataFrame'>
>>> df.show()
shape: (1, 3)
┌──────┬──────┬──────┐
 col1  col2  col3 
 ---   ---   ---  
 i64   i64   i64  
╞══════╪══════╪══════╡
 1     2     3    
└──────┴──────┴──────┘

Note

If you encounter any transformation that is not implemented, please open an issue on GitHub so it can be prioritized.

Source Code

API code is available at https://github.com/flypipe/sparkleframe.