From 87db20cc7ba90c90ad62d1a244a90c233662ad74 Mon Sep 17 00:00:00 2001
From: Carlos Gutierrez <ingecarlos.gutierrez@gmail.com>
Date: Sun, 16 Nov 2025 15:54:13 -0500
Subject: [PATCH] Fixing README

---
 .gitignore           |   5 +++
 llm_test_prompts.txt | 100 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 105 insertions(+)
 create mode 100644 llm_test_prompts.txt

diff --git a/.gitignore b/.gitignore
index c01b932..4b4a317 100644
--- a/.gitignore
+++ b/.gitignore
@@ -201,6 +201,11 @@ training_metrics.json
 wandb/
 tensorboard_logs/
 
+# Inference benchmarks (ignore JSON/CSV, but track PNG files)
+# inference_benchmarks/*.json
+# inference_benchmarks/*.csv
+# !inference_benchmarks/*.png
+
 # Data (already covered but ensure)
 data/
 data_storage/
diff --git a/llm_test_prompts.txt b/llm_test_prompts.txt
new file mode 100644
index 0000000..06b2825
--- /dev/null
+++ b/llm_test_prompts.txt
@@ -0,0 +1,100 @@
+Explain how attention mechanisms work in transformers
+What is the difference between self-attention and cross-attention?
+Describe the transformer architecture and its key components
+How does gradient descent optimize neural network parameters?
+What is the purpose of layer normalization in deep learning?
+Explain the concept of embeddings in natural language processing
+Describe how positional encoding works in transformers
+What is the role of the feed-forward network in transformer blocks?
+How does backpropagation compute gradients through the network?
+Explain the difference between training and inference in machine learning
+What is the purpose of dropout regularization?
+Describe how tokenization converts text into numerical representations
+Explain the concept of batch processing in neural network training
+What is the learning rate and why is it important?
+How does the Adam optimizer differ from standard gradient descent?
+Describe the multi-head attention mechanism
+What is causal masking and why is it used in language models?
+Explain how text generation works in autoregressive models
+What is temperature in sampling and how does it affect generation?
+Describe the difference between top-k and top-p sampling strategies
+How do residual connections help in training deep networks?
+What is the purpose of the softmax function in neural networks?
+Explain how KV caching improves inference performance
+What are the advantages of using pre-normalization in transformers?
+Describe the mathematical formulation of the attention mechanism
+How does mixed precision training reduce memory usage?
+What is gradient clipping and when is it necessary?
+Explain the concept of perplexity as a language model metric
+How does data augmentation improve model generalization?
+What is the difference between fine-tuning and transfer learning?
+Describe how transformers process variable-length sequences
+What is the purpose of the vocabulary in tokenization?
+Explain how learning rate scheduling improves training
+How do transformers handle long-range dependencies in text?
+What is the difference between encoder and decoder architectures?
+Describe the role of activation functions in neural networks
+How does weight initialization affect training dynamics?
+What is the purpose of the loss function in model training?
+Explain how batching improves training efficiency
+Describe the concept of overfitting and how to prevent it
+What is the difference between supervised and unsupervised learning?
+How do transformers compare to recurrent neural networks?
+Explain the concept of transfer learning in deep learning
+What is the purpose of the validation set in machine learning?
+Describe how early stopping prevents overfitting
+What is the difference between training loss and validation loss?
+How does data preprocessing improve model performance?
+Explain the concept of feature extraction in neural networks
+What is the purpose of the test set in model evaluation?
+Describe how cross-validation works in machine learning
+The transformer architecture revolutionized natural language processing by introducing
+Attention mechanisms allow models to focus on relevant parts of the input sequence when
+In deep learning, the optimization process involves iteratively updating model parameters to
+Neural networks learn representations by transforming input data through multiple layers of
+The key innovation of transformers is their ability to process sequences in parallel rather than
+Gradient descent optimization works by computing the derivative of the loss function with respect to
+Layer normalization stabilizes training by normalizing activations across the feature dimension to
+Embeddings convert discrete tokens into continuous vector representations that capture semantic
+Positional encoding provides information about token positions in sequences since transformers lack
+The feed-forward network in transformers applies two linear transformations with a non-linear activation to
+Backpropagation efficiently computes gradients by applying the chain rule of calculus through the computational
+During inference, models generate predictions without updating parameters, which makes it different from
+Dropout regularization randomly sets a fraction of activations to zero during training to prevent
+Tokenization is the process of breaking text into smaller units called tokens that can be processed by
+Batch processing groups multiple examples together to compute gradients more efficiently and stabilize
+The learning rate controls the step size during optimization, and choosing the right value is crucial for
+Adam optimizer combines the benefits of momentum and adaptive learning rates to achieve faster
+Multi-head attention allows the model to attend to information from different representation subspaces
+Causal masking ensures that during training, tokens can only attend to previous tokens, maintaining the
+Autoregressive generation produces text one token at a time, using previously generated tokens as input for
+Temperature scaling adjusts the randomness of sampling by dividing logits before applying softmax, where lower
+Top-k sampling restricts the sampling space to the k most likely tokens, while top-p sampling uses nucleus
+Residual connections enable gradient flow through deep networks by adding the input directly to the output of
+Softmax converts logits into probability distributions by exponentiating and normalizing values, ensuring that
+KV caching stores previously computed key and value matrices to avoid redundant computations during autoregressive
+Pre-normalization places layer normalization before sublayers, which has been shown to improve training stability
+The attention mechanism computes a weighted sum of values, where weights are determined by the similarity between
+Mixed precision training uses both float16 and float32 operations to reduce memory consumption while maintaining
+Gradient clipping prevents exploding gradients by limiting the norm of gradients to a maximum value, which is
+Perplexity measures how well a language model predicts a sequence, with lower values indicating better
+Data augmentation creates variations of training examples to increase dataset diversity and improve model
+Fine-tuning adapts a pre-trained model to a specific task by continuing training on task-specific data, which
+Transformers handle variable-length sequences by using padding and attention masks to ignore irrelevant
+The vocabulary defines the set of all possible tokens that the model can process, and its size affects
+Learning rate scheduling adjusts the learning rate during training, often starting high and decreasing over
+Transformers excel at capturing long-range dependencies because attention can directly connect any two positions
+Encoder architectures are designed for understanding tasks, while decoder architectures are optimized for
+Activation functions introduce non-linearity into neural networks, enabling them to learn complex
+Weight initialization sets the initial values of parameters, and proper initialization is critical for
+The loss function quantifies how far model predictions are from ground truth, guiding the optimization
+Batching improves efficiency by parallelizing computations across multiple examples, reducing the variance of
+Overfitting occurs when a model memorizes training data instead of learning generalizable patterns, which can be
+Supervised learning uses labeled data, while unsupervised learning finds patterns in unlabeled data without
+Transformers replaced RNNs in many applications because they can process sequences in parallel and capture
+Transfer learning leverages knowledge from pre-trained models on large datasets to improve performance on
+The validation set is used to tune hyperparameters and monitor training progress without affecting the
+Early stopping terminates training when validation performance stops improving, preventing overfitting and
+Training loss measures performance on seen data, while validation loss indicates generalization to unseen
+Data preprocessing transforms raw data into a format suitable for model training, including normalization
+Feature extraction learns meaningful representations from raw input data, enabling models to identify patterns