Changelog

v0.2 (June 17, 2025)

Implemented Features

  • General Refactoring & Modularity:
    - Clear separation between models, optimizers, datasets, and experiment configs.
    - Modular imports for data, models, segmentation, metrics, and configs.
    - Procedural code replaced by class-based logic (notably, new Trainer class in train.py).

  • Dynamic Path Management:
    - Centralized path definitions via settings.
    - Output directories organized by DataPart enumerations.

  • Description Files:
    - Metadata for each dataset partition saved as CSV during ddos process.

  • Comet.ml Integration:
    - Centralized experiment tracking, logging parameters, metrics, and visualizations.
    - Tabular logging of training/validation metrics.

  • Dataset & Dataloader Handling:
    - Dynamic splitting into train/validate/test, explicit DR5 handling.
    - Scalable data.create_dataloaders method.
    - Ready datasets for IR/microwave data now available for download/training.
    - Only IR or microwave data can be used per run; not combined in this version.

  • Code Modularity:
    - Reusable components, class-based ClusterDataset, enums for DataPart and DataSource.

  • Configuration Management:
    - Centralized via settings module.

  • Randomization & Reproducibility:
    - Consistent random seed (settings.SEED).

  • Model Selection:
    - Models loaded via unified load_model.
    - New Baseline, CNN_MLP, and YOLO classification models (YOLO still malfunctions).
    - Fixed structure for all models except YOLO classification.

  • Optimizer Selection:
    - Added NAdam and RAdam; removed Adagrad, Adadelta, LBFGS.
    - Modular optimizer selection via all_optimizers dictionary.

  • Parameterization & Device Management:
    - All script parameters now configurable via CLI.
    - Automatic GPU/CPU assignment.

  • Training Pipeline:
    - Training logic encapsulated in Trainer.
    - Learning rate schedulers, dynamic scheduler step updates.
    - Improved checkpoint management and naming.
    - Learning rate finder integration (Trainer.find_lr).

  • Unified compute_all Method:
    - Centralized computation of loss, predictions, and accuracy.

  • Refined Training Workflow:
    - Metrics logged per epoch to tables and external tools.
    - Integrated validation and dedicated test method for predictions/accuracies in DataFrame format.

  • Predictor Class:
    - New class for inference, outputs DataFrame with predictions, probabilities, and metadata.

  • Performance Enhancements:
    - Implicit torch.cuda.empty_cache() for GPU memory.
    - Dynamic schedulers for learning rate.

  • Segmentation Features:
    - segmentation.create_segmentation_plots for post-training visualizations.

  • Logging & Metrics:
    - Enhanced tracking, JSON-serialized metrics, combined CSV for benchmarking.
    - metrics.combine_metrics for aggregation.
    - New metrics: probability histograms, ROC/PR curves, confusion matrices, recall by redshift, training progress plots, metrics aggregation.
    - Multi-page PDFs for consolidated visualizations.

  • Removed Features:
    - Procedural training/validation loops, redundant val_results, static filepaths, and commented-out code.

Bugs

  • Not all optimizers work.

  • YOLO classification model does not show proper results.

v0.1 (May 7, 2024)

Implemented Features

main.py

  • Dataloader Creation: Utilizes data.py to create a dataloader and splits it into train and validation loaders.

  • Model Selection: Offers a list of easily modifiable models for training.

  • Optimizer Selection: Provides a choice of optimizers from a predefined list.

  • Training Script: Executes training with pre-selected models, number of epochs, optimizer, and its settings such as learning rate (lr) and momentum (mm).

  • Training Progress: Displays training progress and statistics during execution.

  • Results Saving: Saves the results of training for further analysis.

train.py

  • train(): Displays progress and statistics during training. Saves best model weights, checkpoints, and results.

  • validate(): Validates the model and provides statistics.

  • continue_training(): Allows resuming training from checkpoints.

data.py

  • Queries GAIA Star Catalog asynchronously (read_gaia).

  • Reads data from ACT_DR5 and MaDCoWS catalogs, pre-downloading if necessary.

  • Creates positive and negative samples for training: - createNegativeClassDR5, create_data_dr5.

  • Includes data transformations: resizing, rotation, reflection, and normalization.

segmentation.py

  • Implements image dataset class and segmentation map creation: - create_samples, formSegmentationMaps, printSegMaps, printBigSegMap.

  • Predicts probabilities for images using predict_folder and predict_tests.

legacy_for_img.py

  • Downloads image cutouts using multithreading (grab_cutouts).

  • Supports VLASS and unWISE image downloads.

metrics.py

  • Includes plotting functions: - ROC Curve (plot_roc_curve), Precision-Recall Curve (plot_precision_recall).

  • modelPerformance: Calculates and displays metrics such as accuracy, precision, recall, and F1-score.

Known Issues

  • Not all optimizers are functional.

  • Incorrect loss and accuracy calculations with Adam optimizer.

  • segmentation.py: Paths for loaded weights need fixing.

  • data.py: Ensure data folder existence.

  • main.py: - Models are loaded simultaneously; optimize memory usage. - Some models need fixes (e.g., ViTL16).

Directory Structure

.
├── data
│   ├── DATA
│   │   ├── test_dr5
│   │   ├── test_madcows
│   │   ├── train
│   │   └── val
├── models
├── notebooks
├── results
├── screenshots
├── state_dict
└── trained_models

Performance

  • Initial startup: ~1h 9m 43s (Internet speed-dependent).

  • Subsequent runs: ~6m 55s.

Usage

Run the script with the following command:

python3 main.py --model MODEL_NAME --epoch NUM_EPOCH --optimizer OPTIMIZER --lr LR --mm MOMENTUM

Supported models:

  • ResNet18, AlexNet_VGG, SpinalNet_VGG, SpinalNet_ResNet.

Supported optimizers:

  • SGD, Adam, RMSprop, AdamW, Adadelta, DiffGrad.

Example output for AlexNet training with SGD optimizer:

Epoch 1/10. Training AlexNet with SGD optimizer: acc=0.683, loss=0.598
Validation Loss: 0.0075, Validation Accuracy: 0.7926
...
Epoch 10/10. Training AlexNet with SGD optimizer: acc=0.884, loss=0.286
Validation Loss: 0.0057, Validation Accuracy: 0.8445

Bugs

  • Not all optimizers work.

  • Loss and accuracy calculations incorrect with Adam optimizer.

  • Additional feedback from users is welcome.