My First Official AI Research Paper Accepted on SSRN

Posted by assemsabryy@reddit | LocalLLaMA | View on Reddit | 18 comments

Today, my research paper “Stable Training with Adaptive Momentum (STAM)” was officially accepted on SSRN — marking my first documented and official publication as an AI Researcher.

The paper introduces a new optimization algorithm for deep learning training that outperformed several popular optimizers in selected benchmarks, addressed multiple training stability challenges, and achieved up to 50% reduction in computational training cost in some experiments.

This is an important milestone in my research journey, and I’m excited to continue exploring optimization techniques for efficient and stable AI training.

You can read the paper here:
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6699059

[-]

LegacyRemaster@reddit

Congrats well done!

[-]

stonetriangles@reddit

You tested it on an extremely small model with a single GPU. How can you be sure it scales with model size and distributed training?

[-]

rookan@reddit

We introduce STAM

We? It is only you, man

[-]

veinamond@reddit

Gj. However, I need to point out that since it is not peer-reviewed, it is not a full-fledged academic publication where acceptance means being chosen for publication. Not a NIPS/AAAI/IJCAI level even remotely.

[-]

westsunset@reddit

Why do you need to point it out?

[-]

veinamond@reddit

Why not? I work in academia half my life, I know exactly what I am talking about. Even peer-reviewed papers sometimes, let us say, contain unverified results that are hard to reproduce and do not correspond to the made claims. OP making it sound like this is an accomplishment when it is clearly not by normal standards for master/phd students.

[-]

KickLassChewGum@reddit

That, and SSRN is... a weird place to be pre-publishing something like this. It's giving off a "I couldn't get anyone to endorse me for arXiv" energy.

And also, from what I can gleam from skimming, the paper claims a "50% reduction in computational training cost" from... a training pipeline (a) that never sees a single GPU, (b) on a vocab of 64, (c) with a sequence length of 32, (d) evaluated by very specific synthetic tasks rather than vetted/generic datasets, (e) with "results" that are well within noise margins.

[-]

No_Swimming6548@reddit

I have no idea what that means but happy for you OP 🤗

[-]

Few_Painter_5588@reddit

Congrats OP!

[-]

AvidCyclist250@reddit

Congrats!! Affiliation says independent. Did you manage to publish this entirely on your own without formal training? Would give me some hope but for a philosophical paper with an attempt on moral grounding I’ve been not submitting for quite a while now because I’m afraid they’ll tell me to gtfo as a „layperson“ without direct ties to academia in that field at least.

[-]

nuclearbananana@reddit

Adaptive gradient methods such as Adam and AdamW fix the first-order momentum coefficient β 1 (typically 0.9) for all timesteps and all parameters, regardless of gradient dynamics. This causes overshooting in high-variance regimes and misses faster-convergence opportunities near stationarity. We propose Stable Training with Adaptive Momentum (STAM), which adapts β 1 based on a per-tensor gradient variance proxy derived from momentum residuals. High variance reduces β 1 to damp oscillations; low variance preserves or increases β 1 to accelerate convergence. We further introduce STAMLITE, a memory-efficient variant with only O(1) extra state per parameter-half the memory of full STAM and the same footprint as AdamW. Across 16 benchmark phases spanning synthetic tasks, image classification, language modeling, robustness tests, and hyperparameter sweeps, STAM/STAMLITE achieve top-3 performance on 10 of 12 scored phases (83%). Notably, STAMLITE wins outright on hyperparameter robustness benchmarks, demonstrating that adaptive β 1 makes optimization more forgiving to suboptimal hyperparameters. Both variants are implemented as drop-in Optax optimizers and available on PyPI (stam-optimizer).

Congrats OP

[-]

bonobomaster@reddit

Can you ELI5 me?

You probably need to dumb it down to caveman levels please.

[-]

hyperdynesystems@reddit

There's a momentum parameter (β 1) which is typically set to a static value but which can have benefits in different gradient scenarios (high-variance or near-stationarity) if it's varied up or down as the gradient changes.

Could be totally wrong because I haven't ever done any reading on training but that's what the abstract seems to say, from my reading of it.

[-]

En-tro-py@reddit

I'm probably not the best source, but I think it's along the line of - Keep moving in the accumulated direction, but reduce inertia when the gradient signal starts jerking around.

So converges faster in some cases and saves training time/cost.

[-]