Data Preparation in SPICE
This tutorial covers how to prepare and manipulate datasets for use with SPICE. We’ll explore the core data structures and utilities that SPICE provides for handling experimental data.
Prerequisites
- Basic understanding of Python and PyTorch
- SPICE installed in your environment
- Basic familiarity with cognitive science experimental data
The DatasetRNN Class
The DatasetRNN
class is the fundamental data structure in SPICE for handling experimental data. It inherits from PyTorch’s Dataset
class and is specifically designed for working with sequential data.
from spice.resources.rnn_utils import DatasetRNN
import torch
# Example of creating a DatasetRNN instance
xs = torch.zeros((n_sessions, n_timesteps, n_features)) # Input features
ys = torch.zeros((n_sessions, n_timesteps, n_actions)) # Target actions
dataset = DatasetRNN(
xs=xs,
ys=ys,
normalize_features=(0, 1), # Optional: Normalize specific feature dimensions
sequence_length=50, # Optional: Set fixed sequence length
stride=1, # Optional: Stride for sequence creation
device='cuda' # Optional: Specify torch device
)
Key Features of DatasetRNN
- Shape Requirements:
xs
: Shape (n_sessions, n_timesteps, n_features)ys
: Shape (n_sessions, n_timesteps, n_actions)
- Automatic Feature Normalization:
- Specify which feature dimensions to normalize using
normalize_features
- Normalization is performed per feature to range [0, 1]
- Specify which feature dimensions to normalize using
- Sequence Processing:
- Optional fixed sequence length with
sequence_length
- Configurable stride for sequence creation
- Automatic handling of variable-length sequences
- Optional fixed sequence length with
Converting Experimental Data
SPICE provides a utility function convert_dataset()
to transform experimental data given in a CSV file into the DatasetRNN
format. This function handles various aspects of data preprocessing and normalization:
from spice.utils.convert_dataset import convert_dataset
# Convert a CSV file containing experimental data
dataset, experiment_list, df, dynamics = convert_dataset(
file="experiment_data.csv", # Path to CSV file
device=None, # Optional: PyTorch device
sequence_length=None, # Optional: Fixed sequence length
df_participant_id='session', # Column name for participant/session ID
df_block='block', # Column name for block information
df_experiment_id='experiment', # Column name for experiment ID
df_choice='choice', # Column name for choices
df_reward='reward', # Column name for rewards
additional_inputs=['context'] # Optional: Additional input columns
)
What convert_dataset() Does
- Data Loading and Preprocessing:
- Loads the CSV file into a pandas DataFrame
- Replaces all NaN values with -1
- Maintains a copy of the original DataFrame
- ID and Block Processing:
- Converts participant IDs to numeric values (0-based indexing)
- Similarly processes experiment IDs if present
- Normalizes block numbers to [0,1] range if present
- Choice Processing:
- Shifts choices to start from 0
- Creates one-hot encoded representations
- Reward Processing:
- Normalizes rewards to [0,1] range
- Handles both standard and counterfactual rewards
- Supports multiple reward columns (e.g., ‘reward_0’, ‘reward_1’, etc.)
- Additional Features:
- Processes any specified additional input columns
- Maintains proper alignment of all features
Return Values
The function returns a tuple of four elements:
dataset
: A DatasetRNN instance containing:- Input features (xs): Choices, rewards, blocks, IDs, and additional inputs
- Target actions (ys): One-hot encoded next choices
experiment_list
: List of BanditSession objects containing:- Raw choices and rewards
- Session information
- Trial counts
-
df
: Original DataFrame preserved for reference dynamics
: Tuple of arrays containing behavioral metrics:- Choice probabilities
- Action values
- Reward values
- Choice values (Note: These are only populated for datasets generated with specific tools)
Example Usage with Additional Features
# Example with multiple features and counterfactual rewards
dataset, experiments, df, dynamics = convert_dataset(
file="experiment_data.csv",
device='cuda',
df_participant_id='subject_id',
df_block='block_num',
df_experiment_id='condition',
df_choice='chosen_option',
df_reward='obtained_reward',
additional_inputs=['context', 'stimulus']
)
# Check the shape of processed data
print(f"Input features shape: {dataset.xs.shape}")
print(f"Target actions shape: {dataset.ys.shape}")
print(f"Number of experiments: {len(experiments)}")
Splitting Datasets
SPICE provides two main methods for splitting datasets for training and testing:
1. Splitting Along Time Dimension
This method splits each session’s data into training and testing sets based on time:
from spice.resources.rnn_utils import split_data_along_timedim
# Split dataset with 80% of timesteps for training
train_dataset, test_dataset = split_data_along_timedim(
dataset=dataset,
split_ratio=0.8,
device=torch.device('cpu')
)
This approach:
- Preserves session structure
- Splits each session at the same relative timepoint
- Handles variable-length sessions appropriately
- Maintains padding (-1) for shorter sessions
2. Splitting Along Session Dimension
For splitting based on participants/sessions:
# Assuming dataset.xs and dataset.ys contain your data
n_sessions = dataset.xs.shape[0]
split_idx = int(0.8 * n_sessions)
# Create training dataset
train_dataset = DatasetRNN(
xs=dataset.xs[:split_idx],
ys=dataset.ys[:split_idx],
device=dataset.device
)
# Create test dataset
test_dataset = DatasetRNN(
xs=dataset.xs[split_idx:],
ys=dataset.ys[split_idx:],
device=dataset.device
)
This approach:
- Keeps entire sessions together
- Useful for testing generalization across participants
- Maintains all temporal information within sessions
Best Practices
- Data Preprocessing:
- Always check for missing values in your raw data
- Ensure consistent data types across sessions
- Normalize features when appropriate
- Use sequence_length parameter for long sequences
- Splitting Strategy:
- Use time-based splits when testing temporal generalization
- Use session-based splits when testing participant generalization
- Consider your research question when choosing split method
Example: Complete Data Pipeline
Here’s a complete example showing the entire data preparation pipeline:
import torch
from spice.utils.convert_dataset import convert_dataset
from spice.resources.rnn_utils import split_data_along_timedim
# 1. Load and convert data
dataset, experiments, df, dynamics = convert_dataset(
file="experiment_data.csv",
device=torch.device('cuda' if torch.cuda.is_available() else 'cpu'),
sequence_length=100, # Set fixed sequence length
df_participant_id='participant',
df_choice='choice',
df_reward='reward'
)
# 2. Split data for training and testing
train_dataset, test_dataset = split_data_along_timedim(
dataset=dataset,
split_ratio=0.8,
device=dataset.device
)
# 3. Verify data shapes
print(f"Training data shape: {train_dataset.xs.shape}")
print(f"Testing data shape: {test_dataset.xs.shape}")
# 4. Create DataLoader for batch processing
from torch.utils.data import DataLoader
train_loader = DataLoader(
train_dataset,
batch_size=32,
shuffle=True
)
test_loader = DataLoader(
test_dataset,
batch_size=32,
shuffle=False
)
Next Steps
- Learn how to train a basic Rescorla-Wagner model
- Understand how to incorporate custom mechanisms such as forgetting
- Explore working with hardcoded equations