Working on Optimizers and Model Architecture Scaling Schemes That Would Allow Predictable and Efficient Training of Transformer Models Containing Hundreds of Billions of Parameters Research Faculty | GradNova