Aether

Aether https://aether.snwy.me/ 2026-06-17T15:44:50+00:00 Aether https://aether.snwy.me/ https://aether.snwy.me/_media/wiki:dokuwiki.svg text/html 2026-04-19T08:27:04+00:00 aethersync (aethersync@undisclosed.example.com) positional_encoding - Synced from Aether https://aether.snwy.me/concepts:positional_encoding?rev=1776587224&do=diff Positional Encoding Since softmax attention is permutation-invariant, positional encodings inject sequence-order information into Transformer inputs. Variants include sinusoidal (original), learned, and rotary (RoPE) encodings. Not directly modified by Attention Residuals, but RoPE interacts with text/html 2026-04-19T08:27:04+00:00 aethersync (aethersync@undisclosed.example.com) backpropagation - Synced from Aether https://aether.snwy.me/concepts:backpropagation?rev=1776587224&do=diff Backpropagation The algorithm for computing gradients of a loss function with respect to network parameters by applying the chain rule layer by layer from output to input. Vanishing gradients occur when these signals decay excessively in deep networks. Residual connections create the gradient highway that preserves gradient flow. text/html 2026-04-19T08:27:04+00:00 aethersync (aethersync@undisclosed.example.com) feedforward_network - Synced from Aether https://aether.snwy.me/concepts:feedforward_network?rev=1776587224&do=diff Feed-Forward Network (FFN) The position-wise MLP applied after each attention layer in a Transformer block. Typically two linear transformations with a nonlinearity: FFN(x) = W2 · act(W1 · x). In MoE architectures, the FFN is replaced by multiple expert FFNs with text/html 2026-04-19T08:27:04+00:00 aethersync (aethersync@undisclosed.example.com) transformer - Synced from Aether https://aether.snwy.me/concepts:transformer?rev=1776587224&do=diff Transformer The dominant architecture for LLMs, built from alternating self-attention and feed-forward layers with residual connections and layer normalization. The original design used PostNorm; modern variants use PreNorm. Attention Residuals modifies how the residual stream accumulates across layers. See also: softmax_attention, multi_head_attention, prenorm, residual_connections text/html 2026-04-19T08:27:04+00:00 aethersync (aethersync@undisclosed.example.com) start - Synced from Aether https://aether.snwy.me/papers:start?rev=1776587224&do=diff Papers Notes and summaries of research papers. * Attention Residuals — Kimi Team technique replacing fixed residual weights with learned softmax attention over layer outputs. text/html 2026-04-19T08:27:04+00:00 aethersync (aethersync@undisclosed.example.com) start - Synced from Aether https://aether.snwy.me/concepts:start?rev=1776587224&do=diff Concepts Definitions and explanations of machine learning terms, organized as an interconnected graph. Architecture * LLM * Softmax Attention * Scaled Dot-Product Attention * Multi-Head Attention * Residual Connections * Gradient Highway * Mixture of Experts * Expert Routing * Linear Attention * RNN Normalization & Stability * Layer Normalization * PreNorm * PostNorm * Hidden-State Growth * Vanishing Gradients Efficiency & Pruning * Model Pruning text/html 2026-04-19T08:27:04+00:00 aethersync (aethersync@undisclosed.example.com) start - Synced from Aether https://aether.snwy.me/start?rev=1776587224&do=diff Aether A personal knowledge base covering machine learning concepts and papers. Namespaces * Concepts — definitions and explanations of ML terms * Papers — paper notes and summaries Recent Additions * vanishing_gradients * chinchilla_scaling * linear_attention * postnorm * rnn About This Host This host runs: text/html 2026-04-19T08:26:49+00:00 aethersync (aethersync@undisclosed.example.com) rnn - Synced from Aether https://aether.snwy.me/concepts:rnn?rev=1776587209&do=diff Recurrent Neural Network (RNN) A network architecture that processes sequences by maintaining a hidden state updated at each time step. Attention Residuals has a recurrent interpretation: the softmax attention over all prior layer outputs can be viewed as a weighted recurrence, connecting transformer depth to recurrent computation. text/html 2026-04-19T08:26:49+00:00 aethersync (aethersync@undisclosed.example.com) postnorm - Synced from Aether https://aether.snwy.me/concepts:postnorm?rev=1776587209&do=diff PostNorm Applying layer normalization after the sublayer and residual addition. Used in the original Transformer but largely replaced by PreNorm in modern LLMs due to training instability at depth. PostNorm does not exhibit hidden-state growth but is harder to train. See also: prenorm, layer_normalization, hidden_state_growth, attention_residuals text/html 2026-04-19T08:26:49+00:00 aethersync (aethersync@undisclosed.example.com) layer_normalization - Synced from Aether https://aether.snwy.me/concepts:layer_normalization?rev=1776587209&do=diff Layer Normalization Normalizing activations across the feature dimension to stabilize training. Applied either before (PreNorm) or after (PostNorm) the sublayer. PreNorm dominates modern LLMs but causes hidden-state growth. See also: prenorm, postnorm, hidden_state_growth, attention_residuals text/html 2026-04-19T08:26:49+00:00 aethersync (aethersync@undisclosed.example.com) pipeline_communication - Synced from Aether https://aether.snwy.me/concepts:pipeline_communication?rev=1776587209&do=diff Pipeline Communication The overhead of passing activations between pipeline stages in pipeline-parallel training. Block AttnRes reduces this by aggregating representations at block boundaries rather than every layer, cutting communication volume proportionally. text/html 2026-04-19T08:26:48+00:00 aethersync (aethersync@undisclosed.example.com) linear_attention - Synced from Aether https://aether.snwy.me/concepts:linear_attention?rev=1776587208&do=diff Linear Attention Attention variants that replace the O(n²) softmax with a kernelized decomposition, reducing sequence-length cost to linear. Used in Kimi Linear to handle long contexts efficiently. Trades some expressiveness for computational savings. See also: text/html 2026-04-19T08:26:48+00:00 aethersync (aethersync@undisclosed.example.com) model_pruning - Synced from Aether https://aether.snwy.me/concepts:model_pruning?rev=1776587208&do=diff Model Pruning Removing parameters or structures from a trained network to reduce size or compute. Includes unstructured pruning (individual weights) and structured pruning (entire heads, layers, or experts). Layer pruning is a form of structured pruning that is surprisingly benign under text/html 2026-04-19T08:26:48+00:00 aethersync (aethersync@undisclosed.example.com) multi_head_attention - Synced from Aether https://aether.snwy.me/concepts:multi_head_attention?rev=1776587208&do=diff Multi-Head Attention Running multiple attention operations in parallel with separate learned projections, then concatenating results. Each head can attend to different positional or semantic relationships. Standard in all modern LLMs. See also: scaled_dot_product_attention, softmax_attention, attention_residuals text/html 2026-04-19T08:26:48+00:00 aethersync (aethersync@undisclosed.example.com) scaled_dot_product_attention - Synced from Aether https://aether.snwy.me/concepts:scaled_dot_product_attention?rev=1776587208&do=diff Scaled Dot-Product Attention The core computation behind softmax_attention: softmax(QK^T / sqrt(d_k)) V. The sqrt(d_k) scaling prevents dot products from growing large in high dimensions, which would push softmax into saturated regions with vanishing gradients. See also: text/html 2026-04-19T08:26:48+00:00 aethersync (aethersync@undisclosed.example.com) llm - Synced from Aether https://aether.snwy.me/concepts:llm?rev=1776587208&do=diff Large Language Model (LLM) A neural network trained on large text corpora to model language. Modern LLMs use transformer architectures with PreNorm and residual connections, and increasingly MoE for efficiency. Scaling laws govern their performance gains. See also: softmax_attention, moe, scaling_laws, prenorm text/html 2026-04-19T08:26:48+00:00 aethersync (aethersync@undisclosed.example.com) expert_routing - Synced from Aether https://aether.snwy.me/concepts:expert_routing?rev=1776587208&do=diff Expert Routing The mechanism in Mixture-of-Experts architectures that selects which experts process a given input. Typically implemented as a learned gating network producing a sparse distribution over experts. The quality of routing directly affects MoE efficiency and performance. text/html 2026-04-19T08:26:48+00:00 aethersync (aethersync@undisclosed.example.com) neural_scaling - Synced from Aether https://aether.snwy.me/concepts:neural_scaling?rev=1776587208&do=diff Neural Scaling The broad phenomenon where model performance improves predictably with increases in parameters, data, or compute. Scaling laws formalize these relationships mathematically. See also: scaling_laws, chinchilla_scaling, attention_residuals text/html 2026-04-19T08:26:48+00:00 aethersync (aethersync@undisclosed.example.com) chinchilla_scaling - Synced from Aether https://aether.snwy.me/concepts:chinchilla_scaling?rev=1776587208&do=diff Chinchilla Scaling The finding from Hoffmann et al. (2022) that LLMs should be trained on roughly 20 tokens per parameter for compute-optimal training. Implies many large models are significantly undertrained. Validates that Attention Residuals improvements hold under this regime. text/html 2026-04-19T08:26:48+00:00 aethersync (aethersync@undisclosed.example.com) vanishing_gradients - Synced from Aether https://aether.snwy.me/concepts:vanishing_gradients?rev=1776587208&do=diff Vanishing Gradients The problem where gradients shrink exponentially as they propagate through many layers during backpropagation, making deep networks difficult or impossible to train. Residual connections mitigate this by providing a direct gradient path — the gradient highway — that preserves signal across arbitrary depth.