Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial implementation of data generator for synthetic dataset creation #1

Open
wants to merge 31 commits into
base: master
Choose a base branch
from

Conversation

AirbornBird88
Copy link
Collaborator

@AirbornBird88 AirbornBird88 commented Sep 24, 2024

This pull request introduces the initial implementation of our data generator. Key features include:

  • Implement Markov chain-based sequence generation
  • Add functions for generating initial and transition probabilities
  • Create dataset generation functionality
  • Include joint and conditional probability calculations
  • Update scripts/Project.toml with new dependencies

Inital commit sets up the foundation of the pipeline. Data generator will be used to create synthetic datasets for testing (Tree)RNN models.

TODO:

  • Review probability and dataset generation logic
  • Add RNN/SSM model definition
  • Need to implement (mini)batch mechanism
  • Add training and main pipeline script
  • Implement MLE and RNN sequence generation
  • Add unit tests and documentation

- Implement Markov chain-based sequence generation
- Add functions for generating initial and transition probabilities
- Create dataset generation functionality
- Include joint and conditional probability calculations
- Update scripts/Project.toml with new dependencies

Inital commit sets up the foundation of the pipeline.
Data generator will be used to create synthetic datasets for testing
(Tree)RNN models.
@AirbornBird88 AirbornBird88 added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers labels Sep 24, 2024
@AirbornBird88 AirbornBird88 self-assigned this Sep 24, 2024
@AirbornBird88 AirbornBird88 marked this pull request as draft September 24, 2024 16:22
@AirbornBird88 AirbornBird88 marked this pull request as ready for review September 24, 2024 16:22
- Implement RNNCell and RNN types and their initializaion
- Add forward pass functions for RNNCell and full RNN model
- Add function for data preprocessing
- Add training and evaluation procedures
- Update main functions in rnn_model and data_generator files.

RNN model BOB will be used to learn on synthetic datasets.
Its purpose is to be used as a MVP for testing autoregressive
(Tree)RNN models.
AirbornBird88 and others added 21 commits September 29, 2024 11:37
- Add sequence indices to data preparation and training loop
- Implement RNN state reset between sequences within a batch
- Attempt to maintain hidden state across sequences (not yet solved)
- Update training function to process sequences individually
- Known issues:
  - Zygote differentiation error with mutable RNN state
  - Inter-batch hidden state preservation not implemented
- Implement stateful RNN, GRU, and LSTM models
- Add reset functionality to handle sequence boundaries
- Update training loop to maintain state across batches
- Fix Zygote compatibility issues in reset function
- Prepare groundwork for generative modeling and MLE
- Add Revise.jl for automatic code reloading in REPL

This commit addresses the sequence boundary problem, lays the
foundation for more advanced sequence modeling tasks, and improves
the development workflow by allowing code changes without REPL restarts.
- Fix lstm_model.jl:
  - Combine h and c into single state vector in LSTM struct
  - Ensure compatibility with train function and other models (GRU, RNN)

- Remove outdated pipeline.jl file

- Improve pipeline_index.jl:
  - Add random baseline model for comparison
  - Implement loss calculation per character/token instead of per sequence
  - Improve loss averaging to account for variable sequence lengths
  - Add print statements for initial probabilities and transition matrix

These changes improve model consistency, provide better benchmarking,
and ensure more accurate loss and probability distributions calculations.
Add files test_data generator.jl and test_prepare_data function.jl
for testing data generation process and prepare_data function, respectively
add unsupervised (MLE) pipeline file for density estimation
testing sequentional cells and models.
- Implement sequence generation based on model's learned distribution
- Add functionality to process and learn from Tiny Shakespeare dataset
- Update main pipeline to handle real text data
- Implement extract_model_distribution to obtain learned transition matrix
- Create compare_distributions for KL divergence calculation
- Enable quantitative comparison between model's learned distribution and true distribution
and model distributions
 - Add function to extract empirical (data) transition probabilities
 - Extend functions to extract and compare transition probabilities
 - Calculate KL divergences between distributions per each character/state
 - Display detailed results for each character/state
 - Rename unsupervised_pipeline.jl files
 - Delete target Y from pipeline
- Remove softmax from model architectures for more flexibility
- Move probability transformations to pipeline level
- Use probabilities in log space (log transformation)
- Replace crossentropy with logitcrossentropy
- Use log probabilities in data generation

This change improves numerical stability by working in log space and
separates probability transformations from model architecture. Models
now output raw logits, with final transformations (softmax, etc.)
handled at the pipeline level.
- Add StackedModel to support multiple recurrent layers
- Use composition of existing models (RNN, GRU, LSTM) for better modularity
- Maintain state management within individual layers
- Support arbitrary combinations of layer types and sizes
- Add comprehensive tests for different configurations
- Update tests to reflect raw outputs handling (removing softmax in models definition)

The StackedModel allows combining multiple recurrent layers while
leveraging existing model implementations to ensure compatibility with
existing pipeline. We are following KISS principle
for better maintainability and reliability.
- Add recursive tree probability computation with DFS traversal
- Implement proper conditioning via RNN state:
  * Tp for array nodes from DFS traversal
  * T<w for chain rule in product nodes
  * State reset for bag node independence
- Handle different node types: Array, Bag, Product
- Add tracking of array nodes and DFS path for debugging
- Support left-to-right and right-to-left traversal
Implement input adapter to handle variable-sized input vectors in tree processing:
- Add InputAdapter struct with dynamic preprocessor mapping
- Implement state delegation to base model
- Add automatic dimension adaptation through Dense layer
- Support end-to-end optimization with Flux
- Handle both vector and matrix inputs
- Add proper state reset functionality
- Add new separate folder for sequential models
- Move sequential models to new folder

The adapter maintains model state consistency while allowing processing
of heterogeneous data with different vector sizes in the tree structure.
The new folder for sequential models was added for responsibility separation.
- Add prediction and processing phases for array node computation
- Fix bag node processing to respect AlignedBags indices
- Correct index tracking for descendant array nodes
- Remove duplicate processing in nested bag structures
- Add delayed processing of array nodes after predictions
- Update relative paths to reflect folder structure changes

The changes ensure correct autoregressive modeling by separating
prediction and state updates, while fixing the processing of bag
node children according to their aligned indices.
…apter

- Add data preparation utilities to convert OneHot to Float32
- Initialize MUTAG pipeline with data loading and preprocessing
- Setup basic tree structure for autoregressive modeling
- Add tests for variable input handling and edge cases
- Implement GRU model for batch processing of embedding matrices
- Add log probability computation for categorical and gaussian data
- Support aligned bags expansion for hierarchical conditioning
- Add probability aggregation across tree levels
- Maintain compatibility with existing supervised tree structure

The changes enable autoregressive probability modeling while
preserving the original TreeRNN architecture and batch processing
capabilities. The GRU parameters could be made trainable alongside
the supervised tree parameters.
- Implement GRU and GRUCell for sequential processing of embeddings
- Add log probability computation for both categorical and gaussian data
- Support batch processing with state expansion/reduction for bag nodes
- Extend tree forward passes to handle probability calculations
- Add state management for conditional probability modeling
- Add reduce_state() for efficient hidden state reduction
- Add aggregate_log_probs() for batch log probability processing
- Update TreeRNN bag node forward pass to use new functions
- Replace inline mapreduce operations with pre-allocated solutions
- Reduce memory allocations with optimized array operations
- Move probability layers from sequential model to tree nodes
 * Each array node gets its own probability layer
 * Add prob_layer field to Tree struct
 * Add inner constructor for proper type handling
 * Pass hidden state through tree forward passes
 * Avoiding mutating array issues

- Implement MLE training objective
 * Log probabilities bubble up through tree hierarchy
 * Product nodes sum log probs of children
 * Bag nodes aggregate by bags
 * Calculate -mean(log_probs) in new unsupervised objective function
 * Replace objective function in gd! function

- Refactor state handling
 * Avoid mutating arrays in expand/reduce hidden state and log probs calculation
 * Use mapreduce for log prob aggregation and hidden state reduction

This change improves model trainability by:
1. Making probability parameters part of tree structure
2. Providing MLE objective
3. Ensuring proper gradient flow
4. Avoiding mutating array issues
- Simplify product node embedding aggregation by removing unnecessary NamedTuple conversion
- Add time tracking for training epochs and total runtime
- Update evaluation to handle different batch sizes correctly
- Add proper state resetting for validation/test set processing
- Add basic unsupervised training pipeline with log likelihood monitoring

Performance improvements focus on reducing memory allocations
and adding proper timing measurements for future optimizations.
Training pipeline now supports unsupervised learning with proper
model evaluation across different dataset splits.
- Remove TreeGRUCell processing step in tree forward pass
- Feed seq_model directly with dense layer embeddings
- Add patience-based early stopping mechanism
- Add comprehensive training metrics tracking:
  * Log likelihoods (train/val/test)
  * Training time per epoch
  * Best model checkpointing
  * Track model improvement

The simplified architecture achieves better initial performance
while maintaining good convergence. Early stopping prevents
overfitting and saves computation by halting training when
validation performance plateaus for specified patience period.
Add generation functionality to sample from trained model:
- Add prob_tree.jl with generation implementation
- Implement template-guided structure generation
- Support both categorical and continuous data sampling
- Maintain hierarchical conditional dependencies
- Use trained prob layers for content generation

The generation process:
- Uses existing Mill structure as template
- Leverages trained TreeRecur model's distributions
- Samples content respecting learned dependencies
- Maintains original bag structure for now

Technical details:
- Handle both OneHotArray and Float32 arrays
- Use proper conditioning through seq_model state
- Transform samples through embedding space
Implement HashTree structure for efficient molecule comparison:
- Add recursive hash computation for Mill nodes (array/bag/product)
- Handle categorical data directly and float32 with rounding
- Support Mill-like indexing for nested hash access

Add analysis tools:
- Compute uniqueness/novelty metrics using Set operations
- Add frequency analysis and distribution comparison via Jensen-Shannon divergence
- Add exponential learning rate decay for training stability

Remaining issues:
- Improve float32 (real-valued data) handling beyond simple rounding
- Fine-tune learning rate schedule parameters
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants