Models and Datasets from Building Better Activation Oracles
de schamphelaere
ceselder
AI & ML interests
None yet
Recent Activity
updated a dataset 6 days ago
ceselder/qwen3-8b-nla-L24-finefineweb-100k updated a model 6 days ago
ceselder/nanonla-l24-av-qwen3-8b published a model 6 days ago
ceselder/nanonla-l24-av-qwen3-8bOrganizations
LoRAcle OOD eval models
OOD model organisms for LoRAcle emergent-behavior eval — 4 Betley EM LoRAs + Cloud subliminal owl + EM training data.
CoT Oracle Paper Ablations And Baselines
All models used for my LessWrong post. Generally recommended to use latest adam oracle, or the checkpoint confusingly labelled "no DPO"
-
ceselder/adam-reupload-qwen3-8b-latentqa-cls-past-lens
Text Generation • Updated • 42 -
ceselder/adam-reupload-qwen3-8b-full-mix-synthetic-qa-v3-replace-lqa
Text Generation • Updated • 2 -
ceselder/cot-oracle-paper-ablation-adam-recipe-1layer
Text Generation • Updated • 1 -
ceselder/cot-oracle-paper-ablation-ours-1layer
Text Generation • Updated • 2
CoT Oracle Training Data
Training datasets for the CoT Trajectory Oracle. Includes CoT corpora and QA datasets used for oracle fine-tuning.
LoRAcle — training data + eval
LoRAcle artifacts: a meta-model that reads LoRA weight deltas and verbalizes the behavioral change. Training data + OOD eval sub-collection.
Loracle: weight-reading model interpretability
Loracles + direction tokens for AuditBench, IA, OOD evals.
loracle
LoRA Oracles: detect hidden behaviors from weight geometry. Training data for loracle models.
CoT Oracle Evals
Eval datasets for the CoT Trajectory Oracle — detecting unfaithful chain-of-thought reasoning via activation trajectories.
-
ceselder/cot-oracle-eval-decorative-cot
Viewer • Updated • 56 • 16 -
ceselder/cot-oracle-eval-rot13-reconstruction
Viewer • Updated • 100 • 6 -
ceselder/cot-oracle-truthfulqa-hint-admission-unverbalized
Viewer • Updated • 11k • 47 -
ceselder/cot-oracle-truthfulqa-hint-admission-verbalized
Viewer • Updated • 4.38k • 49
Building Better Activation Oracles
Models and Datasets from Building Better Activation Oracles
LoRAcle — training data + eval
LoRAcle artifacts: a meta-model that reads LoRA weight deltas and verbalizes the behavioral change. Training data + OOD eval sub-collection.
LoRAcle OOD eval models
OOD model organisms for LoRAcle emergent-behavior eval — 4 Betley EM LoRAs + Cloud subliminal owl + EM training data.
Loracle: weight-reading model interpretability
Loracles + direction tokens for AuditBench, IA, OOD evals.
CoT Oracle Paper Ablations And Baselines
All models used for my LessWrong post. Generally recommended to use latest adam oracle, or the checkpoint confusingly labelled "no DPO"
-
ceselder/adam-reupload-qwen3-8b-latentqa-cls-past-lens
Text Generation • Updated • 42 -
ceselder/adam-reupload-qwen3-8b-full-mix-synthetic-qa-v3-replace-lqa
Text Generation • Updated • 2 -
ceselder/cot-oracle-paper-ablation-adam-recipe-1layer
Text Generation • Updated • 1 -
ceselder/cot-oracle-paper-ablation-ours-1layer
Text Generation • Updated • 2
loracle
LoRA Oracles: detect hidden behaviors from weight geometry. Training data for loracle models.
CoT Oracle Training Data
Training datasets for the CoT Trajectory Oracle. Includes CoT corpora and QA datasets used for oracle fine-tuning.
CoT Oracle Evals
Eval datasets for the CoT Trajectory Oracle — detecting unfaithful chain-of-thought reasoning via activation trajectories.
-
ceselder/cot-oracle-eval-decorative-cot
Viewer • Updated • 56 • 16 -
ceselder/cot-oracle-eval-rot13-reconstruction
Viewer • Updated • 100 • 6 -
ceselder/cot-oracle-truthfulqa-hint-admission-unverbalized
Viewer • Updated • 11k • 47 -
ceselder/cot-oracle-truthfulqa-hint-admission-verbalized
Viewer • Updated • 4.38k • 49