Add initial implementation of data generator for synthetic dataset creation #1

AirbornBird88 · 2024-09-24T16:21:55Z

This pull request introduces the initial implementation of our data generator. Key features include:

Implement Markov chain-based sequence generation
Add functions for generating initial and transition probabilities
Create dataset generation functionality
Include joint and conditional probability calculations
Update scripts/Project.toml with new dependencies

Inital commit sets up the foundation of the pipeline. Data generator will be used to create synthetic datasets for testing (Tree)RNN models.

TODO:

Review probability and dataset generation logic
Add RNN/SSM model definition
Need to implement (mini)batch mechanism
Add training and main pipeline script
Implement MLE and RNN sequence generation
Add unit tests and documentation

- Implement Markov chain-based sequence generation - Add functions for generating initial and transition probabilities - Create dataset generation functionality - Include joint and conditional probability calculations - Update scripts/Project.toml with new dependencies Inital commit sets up the foundation of the pipeline. Data generator will be used to create synthetic datasets for testing (Tree)RNN models.

- Implement RNNCell and RNN types and their initializaion - Add forward pass functions for RNNCell and full RNN model - Add function for data preprocessing - Add training and evaluation procedures - Update main functions in rnn_model and data_generator files. RNN model BOB will be used to learn on synthetic datasets. Its purpose is to be used as a MVP for testing autoregressive (Tree)RNN models.

…rocessing in RNN model

- Add sequence indices to data preparation and training loop - Implement RNN state reset between sequences within a batch - Attempt to maintain hidden state across sequences (not yet solved) - Update training function to process sequences individually - Known issues: - Zygote differentiation error with mutable RNN state - Inter-batch hidden state preservation not implemented

- Implement stateful RNN, GRU, and LSTM models - Add reset functionality to handle sequence boundaries - Update training loop to maintain state across batches - Fix Zygote compatibility issues in reset function - Prepare groundwork for generative modeling and MLE - Add Revise.jl for automatic code reloading in REPL This commit addresses the sequence boundary problem, lays the foundation for more advanced sequence modeling tasks, and improves the development workflow by allowing code changes without REPL restarts.

…mlnpapez/TreeRNN.jl into feature-autoregression-sequence

- Fix lstm_model.jl: - Combine h and c into single state vector in LSTM struct - Ensure compatibility with train function and other models (GRU, RNN) - Remove outdated pipeline.jl file - Improve pipeline_index.jl: - Add random baseline model for comparison - Implement loss calculation per character/token instead of per sequence - Improve loss averaging to account for variable sequence lengths - Add print statements for initial probabilities and transition matrix These changes improve model consistency, provide better benchmarking, and ensure more accurate loss and probability distributions calculations.

Add files test_data generator.jl and test_prepare_data function.jl for testing data generation process and prepare_data function, respectively

add unsupervised (MLE) pipeline file for density estimation

testing sequentional cells and models.

- Implement sequence generation based on model's learned distribution - Add functionality to process and learn from Tiny Shakespeare dataset - Update main pipeline to handle real text data

- Implement extract_model_distribution to obtain learned transition matrix - Create compare_distributions for KL divergence calculation - Enable quantitative comparison between model's learned distribution and true distribution

and model distributions - Add function to extract empirical (data) transition probabilities - Extend functions to extract and compare transition probabilities - Calculate KL divergences between distributions per each character/state - Display detailed results for each character/state

- Rename unsupervised_pipeline.jl files - Delete target Y from pipeline

- Remove softmax from model architectures for more flexibility - Move probability transformations to pipeline level - Use probabilities in log space (log transformation) - Replace crossentropy with logitcrossentropy - Use log probabilities in data generation This change improves numerical stability by working in log space and separates probability transformations from model architecture. Models now output raw logits, with final transformations (softmax, etc.) handled at the pipeline level.

- Add StackedModel to support multiple recurrent layers - Use composition of existing models (RNN, GRU, LSTM) for better modularity - Maintain state management within individual layers - Support arbitrary combinations of layer types and sizes - Add comprehensive tests for different configurations - Update tests to reflect raw outputs handling (removing softmax in models definition) The StackedModel allows combining multiple recurrent layers while leveraging existing model implementations to ensure compatibility with existing pipeline. We are following KISS principle for better maintainability and reliability.

- Add recursive tree probability computation with DFS traversal - Implement proper conditioning via RNN state: * Tp for array nodes from DFS traversal * T<w for chain rule in product nodes * State reset for bag node independence - Handle different node types: Array, Bag, Product - Add tracking of array nodes and DFS path for debugging - Support left-to-right and right-to-left traversal

Implement input adapter to handle variable-sized input vectors in tree processing: - Add InputAdapter struct with dynamic preprocessor mapping - Implement state delegation to base model - Add automatic dimension adaptation through Dense layer - Support end-to-end optimization with Flux - Handle both vector and matrix inputs - Add proper state reset functionality - Add new separate folder for sequential models - Move sequential models to new folder The adapter maintains model state consistency while allowing processing of heterogeneous data with different vector sizes in the tree structure. The new folder for sequential models was added for responsibility separation.

- Add prediction and processing phases for array node computation - Fix bag node processing to respect AlignedBags indices - Correct index tracking for descendant array nodes - Remove duplicate processing in nested bag structures - Add delayed processing of array nodes after predictions - Update relative paths to reflect folder structure changes The changes ensure correct autoregressive modeling by separating prediction and state updates, while fixing the processing of bag node children according to their aligned indices.

…apter - Add data preparation utilities to convert OneHot to Float32 - Initialize MUTAG pipeline with data loading and preprocessing - Setup basic tree structure for autoregressive modeling - Add tests for variable input handling and edge cases

- Implement GRU model for batch processing of embedding matrices - Add log probability computation for categorical and gaussian data - Support aligned bags expansion for hierarchical conditioning - Add probability aggregation across tree levels - Maintain compatibility with existing supervised tree structure The changes enable autoregressive probability modeling while preserving the original TreeRNN architecture and batch processing capabilities. The GRU parameters could be made trainable alongside the supervised tree parameters.

- Implement GRU and GRUCell for sequential processing of embeddings - Add log probability computation for both categorical and gaussian data - Support batch processing with state expansion/reduction for bag nodes - Extend tree forward passes to handle probability calculations - Add state management for conditional probability modeling

- Add reduce_state() for efficient hidden state reduction - Add aggregate_log_probs() for batch log probability processing - Update TreeRNN bag node forward pass to use new functions - Replace inline mapreduce operations with pre-allocated solutions - Reduce memory allocations with optimized array operations

- Move probability layers from sequential model to tree nodes * Each array node gets its own probability layer * Add prob_layer field to Tree struct * Add inner constructor for proper type handling * Pass hidden state through tree forward passes * Avoiding mutating array issues - Implement MLE training objective * Log probabilities bubble up through tree hierarchy * Product nodes sum log probs of children * Bag nodes aggregate by bags * Calculate -mean(log_probs) in new unsupervised objective function * Replace objective function in gd! function - Refactor state handling * Avoid mutating arrays in expand/reduce hidden state and log probs calculation * Use mapreduce for log prob aggregation and hidden state reduction This change improves model trainability by: 1. Making probability parameters part of tree structure 2. Providing MLE objective 3. Ensuring proper gradient flow 4. Avoiding mutating array issues

- Simplify product node embedding aggregation by removing unnecessary NamedTuple conversion - Add time tracking for training epochs and total runtime - Update evaluation to handle different batch sizes correctly - Add proper state resetting for validation/test set processing - Add basic unsupervised training pipeline with log likelihood monitoring Performance improvements focus on reducing memory allocations and adding proper timing measurements for future optimizations. Training pipeline now supports unsupervised learning with proper model evaluation across different dataset splits.

- Remove TreeGRUCell processing step in tree forward pass - Feed seq_model directly with dense layer embeddings - Add patience-based early stopping mechanism - Add comprehensive training metrics tracking: * Log likelihoods (train/val/test) * Training time per epoch * Best model checkpointing * Track model improvement The simplified architecture achieves better initial performance while maintaining good convergence. Early stopping prevents overfitting and saves computation by halting training when validation performance plateaus for specified patience period.

Add generation functionality to sample from trained model: - Add prob_tree.jl with generation implementation - Implement template-guided structure generation - Support both categorical and continuous data sampling - Maintain hierarchical conditional dependencies - Use trained prob layers for content generation The generation process: - Uses existing Mill structure as template - Leverages trained TreeRecur model's distributions - Samples content respecting learned dependencies - Maintains original bag structure for now Technical details: - Handle both OneHotArray and Float32 arrays - Use proper conditioning through seq_model state - Transform samples through embedding space

Implement HashTree structure for efficient molecule comparison: - Add recursive hash computation for Mill nodes (array/bag/product) - Handle categorical data directly and float32 with rounding - Support Mill-like indexing for nested hash access Add analysis tools: - Compute uniqueness/novelty metrics using Set operations - Add frequency analysis and distribution comparison via Jensen-Shannon divergence - Add exponential learning rate decay for training stability Remaining issues: - Improve float32 (real-valued data) handling beyond simple rounding - Fine-tune learning rate schedule parameters

AirbornBird88 added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers labels Sep 24, 2024

AirbornBird88 self-assigned this Sep 24, 2024

AirbornBird88 marked this pull request as draft September 24, 2024 16:22

AirbornBird88 marked this pull request as ready for review September 24, 2024 16:22

AirbornBird88 requested a review from mlnpapez September 26, 2024 09:45

AirbornBird88 and others added 21 commits September 29, 2024 11:37

Optimization of data generator by using matrices and add mini-batch p…

a4420f8

…rocessing in RNN model

Add LSTM and GRU cells and refactor the files system

7371be6

Fix the in-place operation

e6c696a

Merge branch 'feature-autoregression-sequence' of https://github.com/…

da7c0dd

…mlnpapez/TreeRNN.jl into feature-autoregression-sequence

Include tests in data_generator.jl and test_prepare_data function.jl

282e08c

Add files test_data generator.jl and test_prepare_data function.jl for testing data generation process and prepare_data function, respectively

Rename the (next sequence) prediction pipeline file and

f404d86

add unsupervised (MLE) pipeline file for density estimation

Fix function parameters types and add test file for

fd33e34

testing sequentional cells and models.

Add sequence generation and real dataset processing

97d90c9

- Implement sequence generation based on model's learned distribution - Add functionality to process and learn from Tiny Shakespeare dataset - Update main pipeline to handle real text data

Add functions for comparing model and true distributions

8f27970

- Implement extract_model_distribution to obtain learned transition matrix - Create compare_distributions for KL divergence calculation - Enable quantitative comparison between model's learned distribution and true distribution

Refactor unsupervised pipeline

f3263e9

- Rename unsupervised_pipeline.jl files - Delete target Y from pipeline

Add procesing counter for tracking nodes hierarchy

e5fa87a

AirbornBird88 added 8 commits November 15, 2024 11:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add initial implementation of data generator for synthetic dataset creation #1

Add initial implementation of data generator for synthetic dataset creation #1

AirbornBird88 commented Sep 24, 2024 •

edited

Loading

Add initial implementation of data generator for synthetic dataset creation #1

Are you sure you want to change the base?

Add initial implementation of data generator for synthetic dataset creation #1

Conversation

AirbornBird88 commented Sep 24, 2024 • edited Loading

AirbornBird88 commented Sep 24, 2024 •

edited

Loading