Hybrid Tabular Schema
The real world does not produce clean data. Sensors fail, human operators mix units (e.g., entering "10 kg" and "22 lbs" in the same column), formatting varies by region, and anomalies spike.
Phaethon's Schema module is a declarative data engineering engine designed to brutally normalize these chaotic datasets. It enforces the laws of physics directly at the ingestion layer, healing and standardizing millions of rows into strict physical dimensions.
The Dual-Engine Architecture Phaethon is backend-agnostic. It seamlessly integrates with both Pandas and Polars. The engine automatically detects the type of DataFrame you pass into the pipeline and natively routes the execution to the appropriate C/C++ backend. Combined with Phaethon's dedicated Rust parser for zero-allocation string extraction, this guarantees extreme-speed physical data processing with absolutely minimal overhead.
Defining a Schema
A Phaethon Schema is defined declaratively by subclassing phaethon.Schema and defining attributes with phaethon.Field(). You must bind each field to a specific physical dimension using standard Python type hints.
import phaethon as ptn
import phaethon.units as u
import pandas as pd
class RocketTelemetry(ptn.Schema):
# Extracts numbers from text, converts to Celsius, and coerces anomalies
engine_temp: u.Celsius = ptn.Field(
source="Raw_Temp",
parse_string=True,
min=-273.15,
on_error="coerce",
impute_by="mean"
)
# Simple numeric column mapped directly to Pascals
chamber_pressure: u.Pascal = ptn.Field(source="Pressure_Pa")
# The Dirty Data
dirty_df = pd.DataFrame({
"Raw_Temp": [" 450.5 C ", "ERR_SENSOR", "1200 K", "-500 C"], # -500 C violates Absolute Zero!
"Pressure_Pa": [101325, 105000, None, 98000]
})
# Execute the Pipeline
clean_df = RocketTelemetry.normalize(dirty_df)
print(clean_df)
Output:
engine_temp chamber_pressure
0 450.50 101325.0
1 688.67 105000.0 <- 'ERR_SENSOR' imputed with column mean
2 926.85 101441.6 <- 1200 K accurately converted to Celsius, pressure imputed
3 688.67 98000.0 <- -500 C violated physics, coerced to NaN, then imputed
The Field API
The ptn.Field() constructor is the workhorse of the pipeline. It defines the extraction logic, physical boundaries, formatting, and data healing strategies for a single column.
Mapping & Parsing
These parameters dictate how Phaethon locates the data and extracts numerical magnitudes from messy text formats using the Rust engine.
...), it auto-maps to the variable name.True, routes the column through Phaethon's Rust backend to safely extract numeric magnitudes and unit strings from mixed text (e.g., " 1.5e3 kg ").require_tag=False).True (Default), the Rust parser will reject string entries that lack a physical unit tag unless a source_unit fallback is defined.Example: Parsing complex strings with fallbacks
class ParsingSchema(ptn.Schema):
weight: u.Kilogram = ptn.Field(source="W", parse_string=True, source_unit="lbs", require_tag=False)
# Raw: ["10 kg", "20", "50 lbs"]
# Output (in kg): [10.0, 9.07, 22.67] <- '20' falls back to 'lbs' and converts to kg!
Physical Bounds & Error Handling
Enforce the laws of physics directly at the ingestion layer. Determine how Phaethon reacts when it encounters anomalies or data that violates physical boundaries.
"10 kg" to dynamically bound the data relative to the target unit.'raise': Halts execution immediately and throws anAxiomViolationError.'coerce': Neutralizes the invalid value toNaNfor later imputation.'clip': Forces the value to the nearestminormaxbound.
Example: The effects of on_error
class BoundSchema(ptn.Schema):
# If pressure goes below 0, force it to exactly 0 (perfect vacuum)
pressure: u.Pascal = ptn.Field(min=0, on_error="clip")
# Raw: [105000, -500, 98000]
# Output: [105000.0, 0.0, 98000.0]
Data Healing & Imputation
["ERR", -9999]) that should be converted to NaN before processing.NaN values: 'mean', 'median', 'mode', 'ffill', 'bfill', or a constant physical string (e.g., "0 K").'linear', 'spline'). Note: Polars backend natively supports only 'linear' and 'nearest'.NaN.Example: Time-Series Interpolation
class TimeSeriesSchema(ptn.Schema):
# Nullifies -999, then draws a linear line between valid points
voltage: u.Volt = ptn.Field(null_values=[-999], interpolate="linear")
# Raw: [10.0, -999, -999, 25.0]
# Output: [10.0, 15.0, 20.0, 25.0]
Localization & Formatting
Real-world datasets often use regional formatting or undocumented colloquialisms. Phaethon intercepts and standardizes these before the physics engine evaluates them.
"," for European numbers like "1,50").{"kg": ["kilos", "kilo-grams"]}).(Note: Parameters for fuzzy_match and confidence are utilized exclusively for Ontology mapping. Refer to the Fuzzy Semantics section).
Feature Engineering (DerivedField)
While standard Fields clean external data, DerivedField synthesizes entirely new Machine Learning features using cross-column dimensional algebra.
Derived Fields are evaluated in a second pass of the pipeline. They utilize the phaethon.col() abstraction to build deferred Abstract Syntax Trees (AST), which are ultimately executed securely using native Pandas or Polars vectorized math.
class AircraftSchema(ptn.Schema):
# Pass 1: Clean raw extraction
mass: u.Kilogram = ptn.Field(source="mass_kg", on_error="coerce", impute_by="mean")
velocity: u.MeterPerSecond = ptn.Field(source="vel_ms")
# Pass 2: Feature Synthesis (Kinetic Energy = 0.5 * m * v²)
kinetic_energy: u.Joule = ptn.DerivedField(
formula=0.5 * ptn.col("mass") * (ptn.col("velocity") ** 2),
round=2
)
# If mass is [1000] and velocity is [10], kinetic_energy becomes [50000.0]
Lifecycle Hooks (Decorators)
If you need to perform complex DataFrame-level operations (like joining tables or manipulating datetime indices) before or after Phaethon processes the physics schema, you can use lifecycle hooks.
class AdvancedSchema(ptn.Schema):
speed: u.MeterPerSecond = ptn.Field()
@ptn.pre_normalize
def filter_bad_flights(cls, df):
# Drop rows where test flights were aborted before passing to the engine
return df[df['status'] != 'ABORTED']
Schema Execution API
Once defined, the Schema class acts as an execution engine.
.normalize()
def normalize(
df: _DataFrameT@normalize,
keep_unmapped: bool = False,
drop_raw: bool = True
) -> _DataFrameT@normalize
Arguments:
True, retains columns from the original DataFrame that were not defined in the schema. (Default: False).True, drops the original source columns after they have been mapped and cleaned. (Default: True).Scenario Setup:
class FlightSchema(ptn.Schema):
temp: u.Celsius = ptn.Field(source="t_raw", parse_string=True)
dirty_df = pd.DataFrame({
"t_raw": ["10 C", "20 C"],
"flight_id": ["F-01", "F-02"] # Metadata not defined in the Schema
})
Standard Normalization (Default)
By default, Phaethon completely drops the raw source columns and removes any unmapped metadata to guarantee a strictly physical, mathematically safe DataFrame.
Retaining Metadata
Use keep_unmapped=True to preserve important non-physical columns (like IDs, Timestamps, or String labels) alongside the cleaned tensors.
clean_df = FlightSchema.normalize(dirty_df, keep_unmapped=True)
# Columns remaining: ['flight_id', 'temp']
Retaining Raw Source Data
Use drop_raw=False (in conjunction with keep_unmapped=True) to keep the original messy columns side-by-side with the cleaned physical columns. Highly useful for auditing, debugging, or comparing parser accuracy.
clean_df = FlightSchema.normalize(dirty_df, keep_unmapped=True, drop_raw=False)
# Columns remaining: ['flight_id', 't_raw', 'temp']
.blueprint()
Generates a structural, JSON-serializable dictionary of the schema. Highly useful for Data Governance, automated Data Catalogs, or rendering API specifications.Example:
Output:
{
"type": "Physical Dimension",
"source_column": "Raw_Temp",
"target": "Celsius",
"bounds": "-273.15 to None",
"imputation": "mean",
"fuzzy_match": false,
"target_unit": null
}
.astensor()
def astensor(
df: DataFrameLike,
requires_grad: GradTarget = False,
encode_categories: bool = True,
*,
as_tuple: Literal[False] = False
) -> Dataset: ...
def astensor(
df: DataFrameLike,
requires_grad: GradTarget = False,
encode_categories: bool = True,
*,
as_tuple: Literal[True]
) -> TensorLikeTuple: ...
How it routes your data: 1. Continuous Physics (PTensor): Any column defined as a standard BaseUnit (e.g., Meter, Joule) is wrapped in a PTensor—Phaethon's custom PyTorch tensor that natively preserves physical dimensions and autograd computational graphs. 2. Discrete Semantics (torch.Tensor): Any column defined using Fuzzy Semantics (like categories or ontologies) is automatically factorized into a standard, zero-indexed integer torch.Tensor, making it instantly ready for PyTorch nn.Embedding layers.
Arguments:
True/False) or a specific list of field names.True (Default), string-based semantic fields are factorized into integers. If False, they remain raw strings.True, unpacks the tensors into a raw Python tuple matching the declaration order. If False (Default), returns a structured Phaethon Dataset mapping.Scenario Setup:
import phaethon as ptn
import phaethon.units as u
# Define a concrete Semantic State for our categories
class EngineStatus(ptn.SemanticState):
OPTIMAL = ptn.Condition(target_unit=u.Celsius, max=100.0)
CRITICAL = ptn.Condition(target_unit=u.Celsius, min=100.0)
class DeepLearningSchema(ptn.Schema):
velocity: u.MeterPerSecond = ptn.Field() # Continuous Physics
temperature: u.Celsius = ptn.Field() # Continuous Physics
status: EngineStatus = ptn.Field(...) # Discrete Semantic State
# Assume `clean_df` has been normalized via DeepLearningSchema.normalize()
Targeted Gradients
Returns a phaethon.Dataset mapping where only specific physical tensors track gradients for backpropagation.
dataset = DeepLearningSchema.astensor(clean_df, requires_grad=['velocity'])
v_tensor = dataset['velocity'].tensor
status_tensor = dataset['status'].tensor
print(v_tensor.requires_grad)
# Output: True (Ready for neural PDE differentiation!)
print(dataset['temperature'].tensor.requires_grad)
# Output: False
print(type(status_tensor), status_tensor.dtype)
# Output: <class 'torch.Tensor'> torch.int64 (Ready for nn.Embedding)
Tuple Unpacking
Bypasses the phaethon.Dataset wrapper entirely and hands you the raw tensors in the exact order they were declared in the Schema. Highly useful for pushing directly into torch.Tensor or phaethon.pinns.PTensor.
v_tensor, temp_tensor, status_tensor = DeepLearningSchema.astensor(
clean_df,
as_tuple=True
)
print(v_tensor.shape)
# Output: torch.Size([N, 1])
Raw Semantic Categories
By setting encode_categories=False, the semantic strings bypass integer factorization entirely. This is useful if you are exporting the data for non-PyTorch visualization tools.