M#

Machine Learning (ML)#

Part of AI, AI with first a learning phase! A subset of Artificial Intelligence (AI). Field of study that gives computers the ability to learn without being explicitly programmed 1 or more layers of data, includes but not limited to neural networks. Unsupervised, supervised (classification + regression) , reinforcement. Data --> Model --> Prediction.

A company is using a rule-based system to classify the credit card transactions as fraud or not fraud. Do you think this a machine learning solution?
No! (but why? because it is a rule-based system?)

A company receives thousands of calls everyday to route to the right agent. The routing requires several hops and given the number of calls is expensive.
What is a machine learning solution?
Predict what are the required agent skills based some input parameters
==> That's a multi-class classification problem!

Machine Learning Framework#

Watch for:

eager mode (execute like a python script, from top to bottom)
graph format and execution engine natively has no need for Python, and TensorFlow Lite and TensorFlow Serving address mobile and serving considerations respectively.

More at :

https://thegradient.pub/state-of-ml-frameworks-2019-pytorch-dominates-research-tensorflow-dominates-industry/

See also M, ...

Machine Learning Pipeline#

ML problem framing (aka discovery phase)
data collection and integration
dataset can be incomplete and therefore misleading
data in dataset can be irrelevant
bias in dataset?
dataset may be too small (needs to be at least 10 times the number of features!)
needs to be cleaned?
data preparation
dig into a small to manually critically explore the data.
confirm that all label are relevant to solve the ML problem
What features are there?
Does it match my expectation?
Is there enough information to make accurate prediction?
Is data missing?
Should any label be excluded?
Can some label be combined because of overlap?
Are existing labels accurate or not?
Reviewing questions from discovery phase and continue to interact with domain expert
data visualization and analysis
understand the relationships in a dataset
find outliers (use histograms)
find groups
use imputation to complete data
pie charts, histogram, scatter plots for correlation
try to extract noise from data, noise causes overfitting and reduce accuracy of predictions
feature selection and feature engineering
you want a minimum correlation between the features, but the maximum correlation between the feature and the output
do the feature I use make sense?
very time consuming step!
ex: what was the last item purchased by the customer?
ex: when was the last purchase of the customer?
ex: do the customer owns a kindle? (then do not expect questions related to kindle)
model training
model selection
model evaluation
- confusion matrix for classification
- overfitting ?
deployment for prediction / inference

See also M, Discovery Phase, [Machine Learning], Machine Learning Framework

Machine Learning System On Chip (MLSoc)#

See also M, ...

Machine Learning Type#

unsupervised,
Supervised (Classification, regression)
self-supervised learning
reinforcement.
transfer learning

Machine Unlearning#

Coined as "machine unlearning," this concept represents the converse of [machine learning] — it serves to make a model unlearn or forget. These algorithms, applied to previously trained models, force them to expunge specific portions of the training dataset. The beginnings of machine unlearning lie in responding to the "Right to be Forgotten" legislation, a provision of the European Union's General Data Protection Regulation (GDPR).

Methods

Exact unlearning
- [Reverse Nearest Neighbors (RNN)] - find and adjust the model based on the nearest neighbors of the data points set for removal
- K-Nearest Neighbors (KNN) - remove data points based on their closeness to the nearest neighbors, provide an intricate strategy for effective data point exclusion
Approximate unlearning - more efficient alternative to exact unlearning by utilizing additional data to be expunged or retained after removal
- Local Outlier Factor (LOF) - identifies and purges outliers in the dataset to boost the model's performance
- Isolation Forest (IF) - constructs random forests and determines the anomaly scores of data points, isolating and discarding those with elevated scores

More at:

See also M, ...

Machine Reasoning#

Prompt engineering techniques

Query a knowledge graph = traverse a knowledge graph with queries. Query types are:

one-hop queries
path queries
conjunctive queries

More at:

LLM reasoning ability - https://www.kaggle.com/code/flaussy/large-language-models-reasoning-ability

See also M, Deductive Reasoning, Inductive Reasoning, [Knowledge Graph], Prompt Engineering, [Question Answering Graph Neural Network], Reasoning

Machine Translation#

Google Translate, DeepL, and other machine translation programs use Natural Language Processing (NLP) to evaluate millions of sentences translated by human speakers of different language pairs.

Paradigms:

Magi Model#

New search engine based on an AI model built by Google ?

More at:

https://blog.google/technology/developers/google-io-2023-100-announcements/

See also M, ...

Majority Class#

used to see if the dataset is balanced?

Majority Vote Algorithm#

When you crowdsource a labeling task, how can you be certain that the label is correct? Have several people label the same image/entry and apply this algorithm! An alternative is to use Dawid-Skene algorithm.

Make-A-Video Model#

A state-of-the-art AI system that generates videos from text built by Meta.

More at:

site - https://makeavideo.studio/
paper - https://arxiv.org/abs/2209.14792
articles
- https://aibusiness.com/ml/meta-unveils-ai-model-that-can-generate-videos-from-text-inputs

See also M, ...

Mamba Mechanism#

~ does not use the transformer, but much cheaper to train and use than transformers. Scale to very long sequences.

A LLM that does not used a Transformer Architecture, but a [Recurrent Neural Network] ! The RNNs fight back!

A new [state space model] architecture which shows promising performance on information-dense data such as language modeling, where previous subquadratic models couldn't match [Transformers]. Inspired by structured state space models, it combines a hardware-friendly design with an approach similar to FlashAttention for improved performance.

More at:

paper - https://arxiv.org/abs/2312.00752
code - https://github.com/state-spaces/mamba
colab - https://colab.research.google.com/drive/1dUlEYnRbgJYg4_kofNpsCddLCh6vltNK?usp=sharing#scrollTo=4cMzClx-qN1x
articles
- Mamba vs attention - https://medium.com/data-science-in-your-pocket/mamba-vs-attention-which-mechanism-is-better-for-llm-bbf7699947dc

See also M, Attention Mechanism, [State Space Model]

ManGPT Model#

More at:

https://multiplatform.ai/man-group-the-worlds-largest-publicly-traded-hedge-fund-introduces-mangpt-an-ai-driven-platform-for-idea-formation-and-information-summarization/

See also M, ...

Manhattan Distance#

The Euclidean distance is the length of the shortest path between 2 points. The Manhattan distance instead assumes there is a grid and the total distance is the sum of the delta for each of the dimensions. In other words, ManDist = (X2-X1) + (Y2-Y1) etc

See also M, ...

Manifold#

A lower dimension space

has dimensions
has a shape

See also M, ...

ManyChat Company#

OpenAI Assistant or custom GPT on whatsapp ?

More at:

site - https://manychat.com/

See also M, ...

MapReduce Process#

Move the processing and not the data + process in parallel and combine

At Google, MapReduce was used to completely regenerate Google's index of the World Wide Web. It replaced the old ad hoc programs that updated the index and ran the various analyses.

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.

MapReduce is a framework for processing parallelizable problems across large datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogeneous hardware). Processing can occur on data stored either in a filesystem (unstructured) or in a database (structured). MapReduce can take advantage of the locality of data, processing it near the place it is stored in order to minimize communication overhead.

More at:

https://en.wikipedia.org/wiki/MapReduce

Mark Zuckerberg Person#

CEO of Facebook, later renamed Meta

See also M, ...

Markov Chain#

stochastic model
describe sequence of possible events
what happens next depends on present state
memory-less
transition matrix

More at:

https://setosa.io/ev/markov-chains/

See also M, ...

Markov Chain Monte Carlo (MCMC) Algorithm#

Bring Markov Chain and Monte Carlo Methods together!

Markov Chain => Look at the previous sample and pick the next sample based on it
Monte Carlo => simulate the Markov chain

See also M, ...

Markov Chain Stationary Distribution#

Markov Decision Process (MDP)#

Markov Decision Process is a Reinforcement Learning algorithm that gives us a way to formalize sequential decision making. This formalization is the basis to the problems that are solved by Reinforcement Learning. The components involved in a Markov Decision Process (MDP) is a decision maker called an agent that interacts with the environment it is placed in. These interactions occur sequentially overtime. In each timestamp, the agent will get some representation of the environment state. Given this representation, the agent selects an action to make. The environment is then transitioned into some new state and the agent is given a reward as a consequence of its previous action. The process of selecting an action from a given state, transitioning to a new state and receiving a reward happens sequentially over and over again. This creates something called a trajectory that shows the sequence of states, actions and rewards. Throughout the process, it is the responsibility of the reinforcement learning agent to maximize the total amount of rewards that it received from taking actions in given states of environments. The agent not only wants to maximize the immediate rewards but the [cumulative reward] it receives in the whole process.

More at :

https://en.wikipedia.org/wiki/Markov_decision_process

Markov Random Field#

Masked Language Modeling (MLM)#

Methods: replace 15% of words in corpus with special [MASK] token and ask the NLP model (e.g. BERT) to fill in the blank

Istanbul is a great [MASK] to visit          # Must guess the word, i.e. city

Words are broken into tokens, so whole word masking implies that related tokens need to be masked together

Input Text             : the man jumped up , put his basket on phil ##am ##mon ' s head
Original Masked Input  : [MASK] man [MASK] up , put his [MASK] on phil [MASK] ##mon ' s head
Whole Word Masked Input: the man [MASK] up , put his basket on [MASK] [MASK] [MASK] ' s head

The training is identical -- we still predict each masked WordPiece token independently. The improvement comes from the fact that the original prediction task was too 'easy' for words that had been split into multiple WordPieces.

Masked Self-Attention#

~ Used in the decoder of [transformers]

Attention can only be done to words to the left of the current word. The computation of the attention score is the same for self-attention but for later words, attention score is 0. Used in Decoders of transformers (not encoder!), used by GPT models.

Masked Vision Modeling (MVM)#

Massive Multitask Language Understanding (MMLU) Benchmark#

~ a measure of intelligence and knowledge at the undergraduate level.

~ a benchmark used to compare large language models on many [language understanding] tasks.

Over the year, this benchmark has linked into the training data. The questions are now considered too easy. In addition, only 4 choices are given. For all those reasons, MMLU-Pro benchmark was created

We propose a new test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. We find that while most recent models have near random-chance accuracy, the very largest [GPT-3 model] improves over random chance by almost 20 percentage points on average. However, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy. Models also have lopsided performance and frequently do not know when they are wrong. Worse, they still have near-random accuracy on some socially important subjects such as morality and law. By comprehensively evaluating the breadth and depth of a model's academic and professional understanding, our test can be used to analyze models across many tasks and to identify important shortcomings.

One of the reasons that the government discourages and regulates monopolies is that
(A) producer surplus is lost and consumer surplus is gained.
(B) monopoly prices ensure productive efficiency but cost society allocative efficiency.
(C) monopoly firms do not engage in significant research and development.
(D) consumer surplus is lost with higher prices and lower levels of output.

When you drop a ball from rest it accelerates downward at 9.8 m/s². If you instead throw it
downward assuming no air resistance its acceleration immediately after leaving your hand is
(A) 9.8 m/s²
(B) more than 9.8 m/s²
(C) less than 9.8 m/s²
(D) Cannot say unless the speed of throw is given.

In the complex z-plane, the set of points satisfying the equation z² = |z|² is a
(A) pair of points
(B) circle
(C) half-line
(D) line

More at:

MMLU
- paper - https://arxiv.org/abs/2009.03300
MMLU-Pro

See also M, ...

Matbench Discovery Benchmark#

More at:

site - https://matbench-discovery.materialsproject.org/
leaderboard - https://matbench-discovery.materialsproject.org/models

Matplotlib Python Module#

A python module used for visualization

Simple X,Y plot

import matplotlib.pyplot as plt
import pandas

df = pandas.read_csv('../data/atlantis.csv')
x = df['year']
y = df['population']

plt.plot(x,y)
plt.title("Population of AAtlantis")
plt.xlabel('Year')
plt.ylabel('Population')
plt.show()

import matplotlib.pyplot as plt
import numpy as np

plt.style.use('_mpl-gallery')

# make the data
np.random.seed(3)
x = 4 + np.random.normal(0, 2, 24)
y = 4 + np.random.normal(0, 2, len(x))
# size and color:
sizes = np.random.uniform(15, 80, len(x))
colors = np.random.uniform(15, 80, len(x))

# plot
fig, ax = plt.subplots()

ax.scatter(x, y, s=sizes, c=colors, vmin=0, vmax=100)

ax.set(xlim=(0, 8), xticks=np.arange(1, 8),
       ylim=(0, 8), yticks=np.arange(1, 8))

plt.show()

More at:

See also M, ...

Matrix#

There is no definition of matrix, but only the interpretations of matrices

For an engineer who study circuits, matrix are just a way to solve system of equations
For a statistician, matrices are a representation of a Markovian process.
For a data scientist, a matrix is the representation of a table and facilitate data analysis
For deep learning folks, a matrix is another python function which takes a vector as input and outputs a vector

Special (linear-transformation) matrices

Identity Matrix: A transformation that results in the same output as the input
Scalar Matrix:
Identity matrix, but off by one:
Diagonal matrix: Stretch or scale each axis - sign creates a reflection!
Square zero matrix: All outputs is ZERO regardless of the input matrix
Shear matrix
Orthogonal matrix: Produce an improper rotation (pure rotation with or without reflection = no vector is stretched longer or shorter)
Projection matrix: You project a vector in a lower dimension space you lose info, you cannot invert those matrix!
Invertible matrix: To undo a linear transformation

Special matrices

Invertible matrix: To undo a linear transformation
Transpose matrix: For square matrices only? No. Row becomes columns. Transpose of a MxN matrix is of NxM dimension. Top-bottom diagonal stays the same!
Square matrix: Number of columns = number of rows
Square symmetric matrix: Aij = Aji . Symmetric based on the top-bottom diagonal. S = St ! Eigenvectors are perpendicular/orthogonal!
Non-square (i.e. rectangular or Mcol x Nrow) matrices: projection in higher or lower dimensions. Mcol x Nrow . Ncol = Mrow
Orthogonal matrix: Produce an improper rotation. Q orthogonal => Qt = Qinv or inverse rotation and therefore orthogonal as well!!
Matrix of any dimension Mrow x Ncol
- A . At = square symmetric matrix (with dimension Mrow x Mrow )!!! = Sleft
- At . A = square symmetric matrix (with dimension Ncol x Ncol )!!! = Sright

 Special Non-square matrices

  * Dimension eraser: Keep X and Y, but remove Z     | 1 0 0 |
                                                     | 0 1 0 |

  * Dimension adder: Append zero as Z value          | 1 0 |
                                                     | 0 1 |
                                         | 0 0 |

Special vectors:

Eigenvectors: While every other vector deviates from their initial direction, the eigne vectors stay on their original lines despite the distortion from the matrix.

Special values:

Eigenvalues: By how much the eigenvector is transformed on its original direction by the transformation

See also M, [Linear Transformation], Tensor, Vector

Matrix Composition#

Multiplication of some matrices (Easy)> one matrix

Matrix Multiplication is not a multiplication at all! It is a composition of transformations. N multiplications of matrices = N linear transformations = 1 complex linear transformation (That's what defines linear transformations!) That is why Matrix multiplication is not commutative!

Matrix Decomposition#

One matrix (Difficult)=> multiplication of some matrices

If a matrix is a linear transformation, its decomposition consists in turning that "complex" matrix into "simpler" matrices (or a succession of simpler linear transformations)

Special decompositions:

Square symmetric matrices ==> Spectral decomposition
Jordan Decomposition
QR decomposition
Other matrices (rectangular matrices) ==> [Singular value decomposition]

Matrix Null Space#

From this definition, the null space of A is the set of all vectors such that A.V=0 . Obviously V=[0,0,0,...,0] is part of the null space, so it is always non-empty.

Matrix QR Decomposition#

In linear algebra, a QR decomposition, also known as a QR factorization or QU factorization, is a decomposition of a matrix A into a product A = QR of an orthonormal matrix Q and an upper triangular matrix R. QR decomposition is often used to solve the linear least squares problem and is the basis for a particular eigenvalue algorithm, the QR algorithm.

More at:

https://en.wikipedia.org/wiki/QR_decomposition

See also M, ...

Matrix Rank#

The rank of the matrix is related to the range. It denotes how many columns of 𝐴 are actually "relevant" in determining its range. You may think that removing a column from a matrix will dramatically affect which vectors it can reach, but consider:

 | 1 2 0 |                     | 1 |
 | 1 2 0 | ~~(same range as)~~ | 1 |
 | 1 2 0 |                     | 1 |

You can try to reason (to yourself), that the left matrix can reach the same space of vectors as the right matrix (Why?)

See also M, ...

Matrix Range#

In the simplest terms, the range of a matrix is literally the "range" of it. The crux of this definition is essentially

Given some matrix A , which vectors can be expressed as a linear combination of its columns?

Range (another word for column space) is what is meant by this. If you give me some matrix A that is MxN, the column space is the set of all vectors such that there exists a1,a2,....,an so that a1A1+a2A2+...anAn = V for some vector V .

 | 1 0 0 |  | a1 |     | 5 |
 | 0 1 0 |  | a2 |  =  | 5 |
 | 0 0 1 |  | a3 |     | 5 |

Then V is in the range of A since a1=a2=a3=5 . A better example is when it's not, like:

 | 1 0 3 |   | a1 |     | 5 |
 | 1 1 2 |   | a2 |  =  | 5 |
 | 0 0 0 |   | a3 |     | 5 |

Now it's not... since no a1,a2,a3 will satisfy the condition that V is a linear combination of the columns of A ...I mean, we will always have 0 in the third entry of any linear combination!

More at:

https://math.stackexchange.com/questions/2037602/what-is-range-of-a-matrix

See also M, ...

Matrix Spectral Decomposition#

A special decomposition for square symmetric matrix

S = Q D Qt 
# where
# Q orthogonal with eigenvectors as columns
# D is diagonal with eigenvalues = stretch X axis by EV1, Y axis by EV2, etc.
# Qt also orthogonal with eigenvectors as rows (transpose)
# REACTION ==> improper rotation + stretching + improper rotation

<!> improper rotation can be turned into pure rotation if sign of eigenvectors is cleverly picked
<!> -1 * eigenvectors is also an eigenvector with same eigenvalue (?)

More at:

https://www.youtube.com/watch?v=mhy-ZKSARxI

See also M, ...

Matrix Determinant#

Matrix Multiplication#

See Matrix Composition

Max-Pooling Layer#

~ A pooling layer often seen in CNNs where the maximum is extracted

Downsample the feature map (take a thumbnail/smaller-size of the image with some feature still present). while a convoluted layer extract features. In the illustration below, we take a 2x2 kernel and pool the maximum value. Benefits with this approach are

discovered features in the previous convoluted Layer are preserved as we are keeping the max!
image is smaller <== A stack of images becomes a smaller stack of images
because of down sampling, the position of the exact match is not as important/sensitive

See also M, Convolutional Layer, Convolutional Neural Network, [Fully Connected Layer]

Maximum Marginal Relevance (MMR)#

The Maximal Marginal Relevance (MMR) criterion strives to reduce redundancy while maintaining query relevance in re-ranking retrieved documents. When it comes to retrieving documents (like in a RAG), the majority of methods will do a similarity metric like cosine similarity, euclidean distance, or dot product. All of these will return documents that are most similar to your query/question. However, what if you want similar documents which are also diverse from each other? That is where Maximum Marginal Relevance (MMR) steps in.

More at:

Mean Absolute Error (MAE) Loss Function#

The Mean Absolute Error (MAE) loss function is only slightly different in definition from the MSE, but interestingly provides almost exactly opposite properties! To calculate the MAE, you take the difference between your model’s predictions and the ground truth, apply the absolute value to that difference, and then average it out across the whole dataset. The MAE, like the Mean Square Error (MSE), will never be negative since in this case we are always taking the absolute value of the errors. The MAE is formally defined by the following equation:

## MAE loss function
def mae_loss(y_pred, y_true):
    abs_error = np.abs(y_pred - y_true)
    sum_abs_error = np.sum(abs_error)
    loss = sum_abs_error / y_true.size
    return loss

Pros and Cons:

Advantage: The beauty of the MAE is that its advantage directly covers the MSE disadvantage. Since we are taking the absolute value, all of the errors will be weighted on the same linear scale. Thus, unlike the MSE, we won’t be putting too much weight on our outliers and our loss function provides a generic and even measure of how well our model is performing.
Disadvantage: If we do in fact care about the outlier predictions of our model, then the MAE won’t be as effective. The large errors coming from the outliers end up being weighted the exact same as lower errors. This might results in our model being great most of the time, but making a few very poor predictions every so-often.

Mean Absolute Percentage Error (MAPE)#

Mean Average Precision (MAP) Metric#

MAP measures the quality of ranked retrieval results, particularly for tasks where multiple relevant items exist, such as search engines or recommendation systems.

Mean Square Error (MSE) Loss Function#

MSE loss function is widely used in linear regression as the performance measure. To calculate MSE, you take the difference between your predictions and the ground truth, square it, and average it out across the whole dataset.

where y(i) is the actual expected output and ŷ(i) is the model’s prediction.

def mse_loss(y_pred, y_true):
    squared_error = (y_pred - y_true) ** 2
    sum_squared_error = np.sum(squared_error)
    loss = sum_squared_error / y_true.size
    return loss

Pros and cons:

Advantage: The MSE is great for ensuring that our trained model has no outlier predictions with huge errors, since the MSE puts larger weight on theses errors due to the squaring part of the function.
Disadvantage: If our model makes a single very bad prediction, the squaring part of the function magnifies the error. Yet in many practical cases we don’t care much about these outliers and are aiming for more of a well-rounded model that performs good enough on the majority.

Mechanical Turk#

To label the data!

More at:

figure-eight company https://www.figure-eight.com/

Megatron Model#

More at :

https://developer.nvidia.com/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/

Membership Inference Attack#

A type of attack called “membership inference” makes it possible to detect the data used to train a machine learning model. In many cases, the attackers can stage membership inference attacks without having access to the machine learning model’s parameters and just by observing its output. Membership inference can cause security and privacy concerns in cases where the target model has been trained on sensitive information.

Beware:

This membership inference is due to overfitting!
machine learning models tend to perform better on their training data ==> attack Possible solutions:
differential privacy

More at :

paper - https://arxiv.org/abs/2112.03570

MemGPT Model#

~ An agent that knows how to use Memory Management Tools ==> LLM OS

Personalized AI with

self-editing memory
infinite context window
access to unlimited data
customizable tools
long-term memory

More at:

site - https://memgpt.ai/
site - https://research.memgpt.ai/
code - https://github.com/cpacker/MemGPT
paper - https://arxiv.org/abs/2310.08560
articles
- https://medium.com/@ronaldmannak/goodbye-windows-hello-llms-the-future-of-operating-systems-7ba61ea03e8d

See also M, [Retrieval-Augmented Generation]

Memory Hierarchy#

See also M, ...

Meta Company#

A company previously known as Facebook

People

Mark Zuckerberg

Groups:

https://ai.meta.com/

Models:

CICERO - Strategy game with multiplayer interaction
CM3leon - text-to-image diffusion model
ESMFold - Protein folding
Imagine - a diffusion model
LLaMA - Large Language Model open-sourced
Make-A-Video - Text to video model
Pluribus - Plays poker better than humans
RoBERTa - Optimized version of BERT
[Segment Anything] - Instance segmentation in images
Fairseq Toolkit - Facebook AI Research Toolkit
- Wave2Vec - For Automatic Speech Recognition (ASR)
Voicebox - edit, create, transfer styles between audio recordings

More at:

research on github - https://github.com/facebookresearch

See also M, ...

Meta Model#

A ML model to find the best hyperparameters. Ex: Gaussian process regression models object metric as a function of hyperparameters (beware assume smoothness, works with low data, confidence estimates) + bayesian optimization decides where to search next (explore and exploit and gradient free) .

See also M, [Hyperparameter Optimization], Meta-Learning

Meta-Learning#

Learn how to quickly adapt to new tasks within similar domains.

Is a sample efficient algorithm

Normally you take the x to predict the y and optimize parameters to get as close to y as possible. Here you take the x and y to generate a theta parameter to fit in another model... then use the loss of the aggregate model to

See also M, [Meta Model], [Model Agnostic Meta Learning], Reinforcement Learning, Transfer Learning

Metagenomic Protein#

See also M, ...

Metaheuristic#

Metaverse#

Think of it as the internet brought to life, or at least rendered in 3D. Zuckerberg has described it as a "virtual environment" you can go inside of — instead of just looking at on a screen. Essentially, it's a world of endless, interconnected virtual communities where people can meet, work and play, using virtual reality headsets, augmented reality glasses, smartphone apps or other devices.

More at:

https://www.npr.org/2021/10/28/1050280500/what-metaverse-is-and-how-it-will-work

METEOR Score#

Microsoft Common Object In Context Caption (COCO) Dataset#

A dataset consists of 1.5M captions for 330,000 images. The captions are generated by human annotators. Each image is linked to 5 captions.

More at :

home - https://cocodataset.org/#home
paper - https://arxiv.org/abs/1405.0312
know your data - https://knowyourdata-tfds.withgoogle.com/#dataset=coco_caption

Microsoft Company#

Models

ORCA Model Family -
Phi Model Family - Model to run on devices
Speechx
VASA Model Family - lip sync and face latent model in real time

See also M, Company, DeepSpeed Project, OpenAI

Midjourney Model#

More at:

https://promptomania.com/midjourney-prompt-builder/

Milvus Vector Database#

An open-source vector database that is highly flexible, reliable, and blazing fast.

More at:

site - https://milvus.io/

MindPong Game#

Minerva Model#

Can AI change mathematics?

More at:

Mini-Batch#

Take your training samples. Randomized the samples. Take the first 100 of them (the mini-batch).

/// details | sampling with replacement or without replacement? type:question ///

Mini-Batch Gradient Descent Algorithm#

~ Use a batch of (randomly picked) samples for a forward pass and then adjust weights

~ stochastic gradient descent (SGD) with more than 1 sample that use matrix optimization in computation. A compromise between computing the true gradient (all samples) and the gradient at a single example (aka stochastic gradient descent), is to compute the gradient against more than one training example (called a "mini-batch") at each step. This can perform significantly better than true stochastic gradient descent because the code can make use of vectorization libraries rather than computing each step separately. It may also result in smoother convergence, as the gradient computed at each step uses more training examples.

So, after creating the mini-batches of fixed size, we do the following steps in one epoch:

Pick a mini-batch
Feed it to Neural Network
Calculate the mean gradient of the mini-batch
Use the mean gradient we calculated in step 3 to update the weights
Repeat steps 1–4 for the mini-batches we created

Just like SGD, the average cost over the epochs in mini-batch gradient descent fluctuates because we are averaging a small number of examples at a time. But the mini-batch gradient descent is a good (the best) approximation. Also note that the path is less optimized than the "complete" gradient descent where you take all sample into consideration for one step, but it is a good approximation and the destination/convergence is the same = the local minima is not changed by the path taken to reach it.

See also M, Batch Gradient Descent Algorithm, [Gradient Descent Algorithm], Mini-Batch

Minimax Algorithm#

See also M, ...

Mira Murati Person#

~ Interim CEO of OpenAI during Sam Altman's ouster.

See also M, ...

Missingness#

~ how to deal with missing data

<!> By blindly deleting rows you may lose valuable data

<!> Use the right method to fix the missing data

Possible fix: * Means * Median * KNN * Regression

<!> Missing data can be a signal (e.g. no income reported could mean unemployment!) * if so, encode it in a new feature (field_missing = True) which could be predictive of your target variables

More at: * videos * https://youtu.be/oMc9StPVzOU?t=201

Mistral AI Company#

A small creative team with high scientific standards. They make efficient, helpful and trustworthy AI models through ground-breaking innovations.

Models

More at:

site - https://mistral.ai/

See also M, ...

Mistral Model#

Model developed by the Mistral AI Company

Mistral-7B is a 7.3B parameter model that:

Outperforms Llama 2 13B on all benchmarks
Outperforms Llama 1 34B on many benchmarks
Approaches CodeLlama 7B performance on code, while remaining good at English tasks
Uses Grouped-Query Attention (GQA) for faster inference
Uses Sliding Window Attention (SWA) to handle longer sequences at smaller cost

Mistral-8x7B is using the [mixture of expert] architecture with 8 mistral-7b experts!

See also M, ...

Mixed Precision#

~ precision is not the same for all ...

Automatic Mixed Precision

See also M, ...

Mixed Reality (MR)#

Mixed reality is a blend of physical and digital worlds, unlocking natural and intuitive 3D human, computer, and environmental interactions. This new reality is based on advancements in computer vision, graphical processing, display technologies, input systems, and cloud computing.

Mixed reality is the midpoint in the Virtual Continuum

More at:

See also M, ...

Mixtral Model#

The 8x7B model is architected using a [sparse mixture-of-experts architecture] and therefore its size is 8 times bigger!

More at:

paper - https://arxiv.org/abs/2401.04088
site - https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1
announcement - https://mistral.ai/news/mixtral-of-experts/
demo - https://deepinfra.com/chat

Mixture-Of-Experts (MOE) Architecture#

~ like an ensemble method where all the expert models are run and combine their results. The output is selected by another artificial neural network.

In 1991, MoE was first introduced by a research group that included deep-learning and Switch Transformer creator Geoff Hinton. In 2017, the Google Brain team and Hinton used MoE to create an NLP model based on recurrent neural networks (RNN) of 137 billion parameters, where it achieved state-of-the-art (SOTA) results on language modelling and machine translation benchmarks.

A system of experts and gating network: Each expert is a feedforward network and all the expert receive the same input and have the same number of outputs. The gating network is also a feedforward, and typically receives the same input as the expert networks.

More at :

paper - http://www.cs.toronto.edu/~fritz/absps/jjnh91.pdf
wikipedia - https://en.wikipedia.org/wiki/Mixture_of_experts
articles
- in-depth - https://github.com/huggingface/blog/blob/main/moe.md

See also M, [Sparse Mixture-Of-experts Architecture]

ML Algorithm Evaluation#

To evaluate any technique we generally look at 3 important aspects: (1) Ease to interpret output (2) Calculation time (3) Predictive Power.

Let us take a few examples to place KNN in the scale...

See also M, [K-Nearest Neighbors Algorithm], Random Forest

ML Energy Initiative#

~ models such as the RWKV claims to be the most energy efficient. Is that true?

More at:

site - https://ml.energy/
blog - https://ml.energy/blog/
leaderboard - https://ml.energy/leaderboard/

See also M, ...

MLCommons#

More at:

https://mlcommons.org/

See also M, ...

MLflow Tool#

MLflow Python Module#

MLflow#

MLflow is an open-source MLOps platform. With MLflow, you can track experiments and manage multiple models.

If you’d like to learn more about the end-to-end machine learning cycle, you can try using MLflow. You can contribute to a range of functionalities—artifact stores, model registry APIs, and more. Getting Started: Visit MLflow’s GitHub repository to find issues and learn how to contribute.

Purpose:

Experiment tracking
Model evaluation
[Model deployment]
[Model registry]
MLOps
... with a UI !

More at:

MLOps#

Machine Learning Operations

MLOps = People + Technology + Process

Tools

Kubeflow
[Sagemaker Pipeline]
ModelMesh
[Caikit]
MLFlow
Katib - autoML ?
...
[Istio]

Mobile Aloha Model#

[Imitation learning] from human demonstrations has shown impressive performance in robotics. However, most results focus on table-top manipulation, lacking the mobility and dexterity necessary for generally useful tasks. In this work, we develop a system for imitating mobile manipulation tasks that are bimanual and require whole-body control. We first present Mobile ALOHA, a low-cost and whole-body teleoperation system for data collection. It augments the ALOHA system with a mobile base, and a whole-body teleoperation interface. Using data collected with Mobile ALOHA, we then perform supervised behavior cloning and find that co-training with existing static ALOHA datasets boosts performance on mobile manipulation tasks. With 50 demonstrations for each task, co-training can increase success rates by up to 90%, allowing Mobile ALOHA to autonomously complete complex mobile manipulation tasks such as sauteing and serving a piece of shrimp, opening a two-door wall cabinet to store heavy cooking pots, calling and entering an elevator, and lightly rinsing a used pan using a kitchen faucet.

Use 2 arms from [Trossen Robotics]

More at:

site - https://mobile-aloha.github.io/

See also M, ...

Model#

Model Agnostic Meta Learning (MAML)#

How to pick a better initialization point? Don't start randomly, but use a particular function phy... Gradient descent for initial point = 2 step of normal training.

Model Benchmark#

The result of a model evaluation against a set of reference benchmarks

Model Build Of Material (MBOM)#

What was used to build the model!

See also M, ...

Model Card#

More at:

paper - https://arxiv.org/abs/1810.03993
Edu
- Kaggle - https://www.kaggle.com/code/var0101/model-cards
Articles
- model cards at google - https://modelcards.withgoogle.com/about
facebook LLaMa model card - https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md

Model Checkpoint#

A snapshot of a model with its weights. Is usually used for model evaluation before being released.

See also M, ...

Model Complexity#

~ how many layers should I use in my artificial neural network ? Should I use a single variable or a multi-variable regression? Should I use a linear or a quadratic regression? Etc.

The complexity of the model is optimum when the model generalizes well. If the model is too simple, it will not even fit the training set correctly. If the model is too complex, the training set will fit, but the performance on other input data will not be at par. This is the [bias-variance tradeoff] !

Again the model complexity is optimum when the bias and variance are small.

See also M, ...

Model Compression#

[Large language models (LLM)] have been making waves, demonstrating exceptional performance in many tasks. However, their impressive capabilities come with a significant drawback: high computational costs.

Top-tier models such as LLaMA 2 and Falcon can demand dozens, if not hundreds, of gigabytes of GPU memory. This not only makes them expensive to run but also presents a formidable challenge in terms of setup. Furthermore, their resource-intensive nature makes it nearly impossible to run them on edge devices without access to robust cloud servers.

To overcome these hurdles, researchers have been developing a range of innovative compression techniques. These methods aim to make LLMs more compact, enabling them to fit on devices with limited resources. Additionally, they can enhance the speed of these models and reduce inference latency, making them more efficient.

Methods:

Pruning
Quantization
[Knowledge Distillation]

More at:

paper - https://arxiv.org/abs/2307.02973
articles:
- LLM compression - https://bdtechtalks.com/2023/09/18/what-is-llm-compression/

See also M, ...

Model Context Protocol (MCP)#

The Model Context Protocol (MCP) is an open protocol that enables seamless integration between LLM applications and external data sources and tools. Whether you’re building an AI-powered IDE, enhancing a chat interface, or creating custom AI workflows, MCP provides a standardized way to connect LLMs with the context they need.

Integrates with

databases
file systems
...
github (pull requests, create repos, etc!!!)

More at:

site - https://modelcontextprotocol.io/introduction
code - https://github.com/modelcontextprotocol/
sdks
- python - https://github.com/modelcontextprotocol/python-sdk
articles
- announcement - https://www.anthropic.com/news/model-context-protocol

See also M, ...

Model Convergence#

When we say a machine learning model has converged, it generally means that the model has stabilized during training and additional training is not producing significant changes or improvements in its parameters or performance on the training set. Here are some key points about model convergence:

During training, the model parameters are updated iteratively to minimize a loss function. As this loss gets lower, the model fits the training data better.
In the initial stages of training, the loss decreases rapidly with each update as the model rapidly learns. But over time, the rate of improvement slows and eventually plateaus.
We say the model has converged when the loss flattens out and further parameter updates lead to diminishing or negligible improvements. The model has fit the training data as well as it can.
Convergence happens when the learning rate drops to near zero. The updates to parameters become very small. The model is no longer improving meaningfully.
Checking for convergence helps decide when to stop training. Continuing training post-convergence is wasteful.

In the context of RL, with more experience, the agent gets better and eventually is able to reach the destination reliably. Depending on the exploration-exploitation strategy, the vehicle may still have a small probability of taking random actions to explore the environment.

Well, this is not true, because Convergence does not necessarily mean the model has achieved optimal performance. The model may converge to a local minimum. Regularization and hyperparameter tuning is still important.

In the case of [RL[, the model may have converge, but if the reward function is not programmed correctly, we may see a converged ehavior, but not the desired behavior!

See also M, ...

Model Cost#

More at:

estimates for claude model - https://orenleung.com/anthropic-claude-next-cost

Model Development Life Cycle (MDLC)#

Model Drift#

Model performance changes over time.

Model Data Sheet#

More at:

https://www.microsoft.com/en-us/research/project/datasheets-for-datasets/

Model Denial Of Service (MDoS)#

Model denial of service (MDoS) is a potential attack vector against machine learning systems, with the goal of preventing legitimate use of the model. Some key characteristics of MDoS attacks:

The attacker floods the ML system with malicious input data designed to degrade model performance. This data is engineered to exploit model vulnerabilities.
The constant barrage of malicious data forces the model to expend computational resources on processing it.
Over time, the model's predictions become less accurate, less useful, or fails completely. It is unable to serve legitimate users.
The attack can be carried out remotely through the input channels exposed by ML APIs and services.

Some examples of how MDoS can manifest:

Spamming image classification APIs with adversarial images to degrade confidence scores.
Flooding a fraud detection system with valid but misleading inputs to increase false positives.
Manipulating data to poison the training set of an online learning system.
Barraging video analysis models with carefully crafted corrupted footage.

Defenses against MDoS include:

Input filtering and anomaly detection to catch obvious malicious data.
Rate limiting and throttling to slow down potential attackers.
Resetting and retraining models periodically to clear accumulated poisoned data.
Using robust and adversarially resistant models that degrade gracefully.
Monitoring model performance for sudden changes as indication of attack.

Overall, MDoS demonstrates vulnerabilities related to the data-driven nature of machine learning. Managing input sources and monitoring for degradation helps counter this threat.

Model Evaluation#

Use a model checkpoint with a benchmark tool or process for evaluation.

In the case of classification --> confusion matrix with accuracy, recall, precision, F1 score

In the case of RL, the benchmark process is a simulation in a virtual environment.

In the case of an LLM, an evaluation in prompt engineering refers to the process of evaluating an LLM's performance on a given dataset after it has been trained. You then use evals to:

assess a model's knowledge of a specific domain or capability on a given task
measure progress or change when shifting between model generations

/// details | What is the difference with a [model validation] ? type:question ///

See also M, ...

Model Evaluation Metric#

Always use the appropriate metrics for model performance

The right metric depends on your real world goal/impact!

Pick metrics that align with your actual goals ex: cancer , not cancer and FPR / FNR

Examples:

accuracy or percentage of correct predictions
FPR, FNR
Imbalanced datasets
- sensitivity, recall, hit rate or TPR
- precision
- F1 score = weighted harmonic mean of precision and recall = ( 2 X precision * recall ) / (Precision + recall)
Ranking
- Mean Average Precision (MAP) - binary relevance (e.g. search engines returning only "relevant" or "not relevant" results)
- Normalized Discounted Cumulative Gain (NDCG) - graded relevance (e.g. recommendation systems where some items are more relevant than others)
Regression
- Root Mean Square Error (RMSE)
- Mean Absolute Error (MAE)

This metrics are also used to compute the loss function

Ex: MAP Metric and MAP loss function?

Hyperparameter Tuning

More at:

videos
- using wrong metrics - https://www.youtube.com/watch?v=oMc9StPVzOU&t=245s

See also M, ...

Model Firewall#

A proxy that front the LLM and inspect input and output of queries. Useful for guardrails.

See also M, ...

Model Format#

CoreML Format - Apple format
ONNX Format - Opensource format
TorchScript Format - A PyTorch format with no dependency on Python

See also M, ...

Model Governance#

For 2 aspects:

See also M, [Data Development Life Cycle], Development Life Cycle, [Model Development Life Cycle], Regulatory Landscape, [Software Development Life Cycle]

Model Hub#

Where you can download models from.

See also M, ...

Model Fusion#

See Model Merging

Model Inversion#

Model inversion is a type of attack against machine learning models, particularly in privacy-sensitive applications. The goal of model inversion is to extract private or sensitive attributes used as inputs to train the model.

Here is an overview of how model inversion works:

An adversary has black box access to a trained ML model (e.g. can query it).
The adversary sends carefully crafted queries to the model and observes the outputs.
They analyze the input-output relationships to make inferences about the training data.
With enough queries, they can reconstruct approximations of private attributes in the training data.
For example, a model trained on medical records to predict disease risk. The adversary queries it to infer a patient's probable age, gender, medications, etc. without seeing the actual records.

Some key risks of model inversion:

Reveal demographics, behaviors, identities of individuals in training data.
Expose sensitive attributes meant to be kept private.
Violate regulations on use of personal data like HIPAA.
Enable other attacks like model evasion, poisoning, etc.

Defenses against model inversion include:

Differential privacy - Add noise to inputs/outputs to obfuscate data.
Regularization - Simplicity constraints make model behavior more uniform.
Restrict queries - Limit what outputs can be observed by adversaries.
Federated learning - Train on decentralized data, don't centralize it.

Overall, model inversion exploits the inherent fit between a model and its training data. Managing what models reveal through queries helps mitigate this emerging threat.

Model Leaderboard#

Where you can compare different models based on several metrics/benchmarks such as perplexity, bias, loss, etc.

Model Merging#

We present a novel application of evolutionary algorithms to automate the creation of powerful foundation models. While model merging has emerged as a promising approach for LLM development due to its cost-effectiveness, it currently relies on human intuition and domain knowledge, limiting its potential. Here, we propose an evolutionary approach that overcomes this limitation by automatically discovering effective combinations of diverse open-source models, harnessing their collective intelligence without requiring extensive additional training data or compute. Our approach operates in both [parameter space] and [data flow space], allowing for optimization beyond just the weights of the individual models. This approach even facilitates cross-domain merging, generating models like a Japanese LLM with Math reasoning capabilities. Surprisingly, our Japanese Math LLM achieved state-of-the-art performance on a variety of established Japanese LLM benchmarks, even surpassing models with significantly more parameters, despite not being explicitly trained for such tasks. Furthermore, a culturally-aware Japanese VLM generated through our approach demonstrates its effectiveness in describing Japanese culture-specific content, outperforming previous Japanese VLMs. This work not only contributes new state-of-the-art models back to the open-source community, but also introduces a new paradigm for automated model composition, paving the way for exploring alternative, efficient approaches to foundation model development.

More at:

paper - https://arxiv.org/abs/2403.13187

See also M, ...

Model Predictive Control (MPC) Algorithm#

Used in Look-Ahead Planning ?

See also M, ...

Model Release Card#

~ What is released and how for a given model.

Includes:

overall position on safety-transparency scale
datasets
code
(trained) model
model card
responsible use guide
notebook
research paper
demo/samples

all of whom can be:

not released
gated release for researchers
released
available for non-commercial use

A team should assume that the project will be entirely open-sourced even so that may not be the case due to [responsible AI]

Below is the example for the LLaMa Model

See also M, ...

Model Release Frontier#

A ... approach to conceptually mapping trade-offs of model releases is a frontier which present trade-offs between 2 dimensions:

knowledge = greater understanding of the capability and limitations of model performance
steerability = degree of control to adjust or alter release conditions ex post

The past few years have seen a range of attempts to identify options along the knowledge-steerability frontier.

Seeking to balance the benefits of knowledge gains with

accessibility
recourse
maintenance (of code base)
downstream impacts

More at:

related paper on staged release - https://arxiv.org/abs/1908.09203
CivicScape
- code - https://github.com/BurgerAutomata/CivicScape/blob/master/README.md
- open-source the solution? - https://qz.com/938635/a-predictive-policing-startup-released-all-its-code-so-it-can-be-scoured-for-bias
- mit review - https://www.technologyreview.com/2020/07/17/1005396/predictive-policing-algorithms-racist-dismantled-machine-learning-bias-criminal-justice/
- bloomberg (blocked) - https://www.bloomberg.com/news/features/2017-07-10/the-ex-cop-at-the-center-of-controversy-over-crime-prediction-tech

See also M, ...

Model Risk Management (MRM)#

In 2021, real estate marketplace company Zillow took a USD 304 million inventory write-down and planned to slash a quarter of its workforce following its failed home-buying venture, which was partly caused by the inability of its housing price valuation model to accurately predict home prices.

Model risk can stem from various causes:

Data

A model’s input data might be erroneous, incomplete, outdated or biased. If outdated data is used for a market model, for instance, then it might project skewed trends regarding market performance or market prices.

Also, if training data sets for artificial intelligence (AI) models aren’t evaluated for the presence of bias, these AI models can produce results that reflect and perpetuate the intrinsic bias in the data. For example, job applicant screening systems might favor male or younger applicants, while healthcare prediction software might exhibit racial bias when prioritizing patients in need of immediate care.
Assumptions and variables

Assumptions might be flawed or unrealistic. Irrelevant, wrong, missed or omitted variables or incorrect variable calibrations can affect model output.

For instance, a pricing model that doesn’t factor in market volatility might produce inaccurate estimates, while product demand forecasting models that fail to consider seasonal purchasing behaviors or current economic conditions, such as shipping delays or decreased spending, might lead to poorly managed inventory levels.

Meanwhile, a patient care prediction model that puts a greater weight on a variable such as healthcare spending might result in the model discriminating against those who have lower incomes and thus spend less on healthcare but have a greater need to access it.
Methodology

The chosen methodology might have inherent errors, so model developers need to be knowledgeable about the model and aware of its limitations. For example, statistical methods such as regression modeling can have sampling and standard errors.

This is also where selecting the right model comes in. For instance, even though generative AI is the latest technology, it might not be a strong fit for financial forecasting, where other well-established models can do it for less work and lower cost.
Model implementation

Incomplete or incorrect model development can lead to inaccurate results or model errors. The same is true for programming errors, mistakes in approximations or calculations and other technical errors. Applying any shortcuts or simplifications as a result of model uncertainty and complexity might also affect the outcome.

For example, tight timelines to deploy a predictive analytics model for sales performance might lead to using real-time data feeds of sales numbers. However, because of this decision, the model might fail frequently or be slow to run. In this case, switching to a daily or weekly data snapshot might improve the model’s speed and stability.

Rigorous testing can also help detect errors during implementation, such as accidentally using a different date format for an insurance claims assessment model or another unit of measurement for a healthcare diagnostics model, or inadvertently modifying the currency for a pricing model.
Interpretation of results

Misinterpreting the output of a model can lead to misinformed decision-making and taking the wrong course of action. This is where expert analysis is needed, with subject matter experts evaluating the soundness of a model’s results. Explainability and transparency are also crucial in determining how a model arrived at its conclusions.
Model usage

Models might be misused or the wrong model might be applied to a certain scenario. A model’s design and specifications might also be unfit for a particular business case.

For instance, a model that helps hospitals triage patients faster in a particular state or region might not be suitable for a neighboring state or region due to varying demographics. Meanwhile, models that identify a lung condition in children from their chest scans might not be able to detect the same condition in adults.

More at: * articles * Zillow - https://edition.cnn.com/2021/11/09/tech/zillow-ibuying-home-zestimate/index.html * https://www.ibm.com/think/topics/model-risk-management

See also R, NIST AI Risk Management Framework, ...

Model Robustness#

See Model Stability

Model Size#

For a neural network it is the number of parameters

See also M, ...

Model Size Efficient Frontier#

See Parameter Count Efficient Frontier

Model Stability#

~ model is robust to external changes

How to make sure that your Machine Learning model is still performing as trained overtime.

The Machine Learning models we build are based on a simple premise: the data which we will use during inference times should have the same distribution as the data used on training.

This premise is simple but rather strong because it means we can infer new observations based on previously known ones.

Sometimes, however, some real-life effects can cause the observations to shift. One example is the Covid-19 pandemic. A financial model trained before the spread of the virus might not work well for the population after the pandemic. Families might have had reduced income, and even those that didn’t probably gave a second thought before spending their money.

These effects can be silent, and to avoid them we need to evaluate the stability of the models, or in other words, if the model is still effective after the changes in the real-life scenario.

There are 2 metrics that can do that:

More at:

articles
- PSI and CSI - https://towardsdatascience.com/checking-model-stability-and-population-shift-with-psi-and-csi-6d12af008783

See also M, ...

Model Scanning#

A step required in model threat analysis when importing open-source models like LlaMa , Falcon , or others less popular models. Looking at a model to find vulnerabilities, a bit like an antivirus would!

See also M, ...

Model Threat Analysis#

Covers:

Data Poisoning
Supply Chain Vulnerabilities calls for Model Scanning
Inference
Automation
- Output Handling
- Overreliance
- Excessive Agency

Model Tuning#

done with gradient descent?

Model Type#

More at:

https://youtu.be/J8Eh7RqggsU?list=PLoROMvodv4rO1NB9TD4iUZ3qghGEGtqNX&t=2568

See also M, Reflex Model, [Logic Model], State Model, Variable Model

Model Uplift#

Improvement of the model performance due to the usage of synthetic data.

Comparison of the performance of a model trained and tested with real data vs the performance of the same model trained with real and synthesize data and tested on real data.

You do not test on synthetic data!

See also M, ...

Model Validation#

Model validation is the process of assessing a machine learning model's performance during the training phase. It involves using a separate subset of the training data, called the validation dataset, to tune and optimize the model's hyperparameters. Hyperparameters are parameters that are not learned by the model during training but are set before training begins. Examples include learning rate, regularization strength, and the number of hidden units in a neural network.

During validation, the model is trained on the training dataset and its performance is evaluated on the validation dataset. This evaluation helps in selecting the best hyperparameters and preventing overfitting. Overfitting occurs when a model learns to perform well on the training data but fails to generalize to new, unseen data.

Validation vs Evaluation:

Purpose - Model validation is used to fine-tune hyperparameters and prevent overfitting during the training phase, while model evaluation assesses the model's performance on completely new, unseen data.
Data - Model validation uses the validation dataset (a subset of the training data), while model evaluation uses a separate testing dataset that was not part of training or validation.

See also M, ...

Model-Based Reinforcement Learning#

~ If, after learning, the agent can make predictions about what the next state and reward will be before it takes each action, it's a model-based RL algorithm.

Model-based RL approaches explicitly learn a model of state transitions. Model-free RL methods like Q-learning learn without modeling transitions.

In Reinforcement Learning, you learns model of environment transitions & rewards, then optimizes policy through planning. e.g. Dyna, AlphaGo.

Given a state and an action, the model might predict the next state and next reward. Models are used for deciding on a course of actions, by taking into account possible future situations before they are actually experienced.

Model-based reinforcement learning has an agent try to understand the world and create a model to represent it. Here the model is trying to capture 2 functions, the transition function from states 𝑇 and the reward function 𝑅. From this model, the agent has a reference and can plan accordingly. A simple check to see if an RL algorithm is model-based or model-free is:

If, after learning, the agent can make predictions about what the next state and reward will be before it takes each action, it's a model-based RL algorithm.
If it can't, then it’s a model-free algorithm.

~ If, after learning, the agent can make predictions about what the next state and reward will be before it takes each action, it's a model-based RL algorithm.

More at:

https://ai.stackexchange.com/questions/4456/whats-the-difference-between-model-free-and-model-based-reinforcement-learning

See also M, ...

Model-Centric AI#

Focus on the algorithm, assume the data is static! (like a kaggle dataset). The opposite of data-centric AI

See also M, ...

Model-Free Reinforcement Learning#

~ If, after learning, the agent can make predictions about what the next state and reward will be before it takes each action, it's a model-based RL algorithm.

Model-based RL approaches explicitly learn a model of state transitions. Model-free RL methods like Q-learning learn without modeling transitions. The agents here are explicitly trial-and-error learners. (?)

Model-free learning is a category of reinforcement learning (RL) algorithms that do not require a complete model of the environment to make decisions. In model-free learning, the agent learns directly from interacting with the environment, without explicitly building or using a model that represents the environment's dynamics or transition probabilities.

Instead of modeling the environment, model-free learning algorithms focus on estimating value functions or directly learning policies through trial-and-error interactions. The two primary components in model-free learning are the policy and the value function:

Policy: The policy determines the agent's behavior by mapping states or state-action pairs to actions. The policy can be deterministic (selecting a single action) or stochastic (selecting actions based on probabilities). Model-free learning algorithms aim to find an optimal policy that maximizes the expected cumulative reward over time.
Value Function: The value function estimates the expected long-term return or cumulative reward associated with being in a particular state or taking a specific action in a given state. It represents the desirability or utility of states or actions. Value functions can be estimated through various techniques such as Monte Carlo methods, temporal difference learning, or function approximation.
Model-free learning algorithms learn by iteratively updating the value function or policy based on the observed rewards and states during interactions with the environment. These updates are typically driven by optimization principles, such as maximizing cumulative rewards or minimizing the difference between estimated and observed values.

Model-free learning is suitable in scenarios where it is difficult or impractical to obtain a complete model of the environment, and the agent must learn directly from experience.

A model-free RL algorithm can be thought of as an "explicit" trial-and-error algorithm. A simple check to see if an RL algorithm is model-based or model-free is:

If, after learning, the agent can make predictions about what the next state and reward will be before it takes each action, it's a model-based RL algorithm.
If it can't, then it’s a model-free algorithm.

Algorithms that purely sample from experience such as Monte Carlo Control, SARSA, Q-learning, Actor-Critic are "model free" RL algorithms. They rely on real samples from the environment and never use generated predictions of next state and next reward to alter behaviour (although they might sample from experience memory, which is close to being a model).

~ If, after learning, the agent can make predictions about what the next state and reward will be before it takes each action, it's a model-based RL algorithm.

Examples of model-free learning algorithms include:

Q-learning,
State-Action-Reward-State-Action (SARSA),
and REINFORCE.

More at: * https://ai.stackexchange.com/questions/4456/whats-the-difference-between-model-free-and-model-based-reinforcement-learning

See also M, ...

ModelMesh Framework#

~ a multi-model serving framework, recently open-sourced by IBM. It underpins many watson cloud services.

See also M, ...

Modified National Institute of Standards and Technology (MNIST) Dataset#

Pictures of numbers written by college student taken by the post office to be able to sort the zip codes. Conversion from every image to matrix was done by hand.

First approach to solve the problem was to use One of the first neural networks,

layer 1: 200 neurons
2nd layer: 100
3rd: 60
4th: 30
5th: 10 # 0-9 digit
with sigmoid activation function

The artificial neural network has a inverted tree like structure. Unfortunately the accuracy of 92% could be improved!

/// details | weren't the image 28 x 28 or 784 pixels (matrix elements)? type:question

number of images were 60.000
number of output neurons = 10 (from 0 to 9)
labels are using one-hot encoding on those 10 values <-- What about dummy variable trap, shouldn't we have 9 values? ///

A better model found later was the Convoluted Neural Network (CNN)

The first part of the CNN is called feature extraction, the second part is the classification.

More at:

MNIST dataset - https://knowyourdata-tfds.withgoogle.com/dataset=mnist
EMNIST dataset - https://knowyourdata-tfds.withgoogle.com/dataset=emnist
https://pair.withgoogle.com/explorables/private-and-fair/

Modular Reasoning Knowledge and Language (MRKL) Agent#

Several types:

LLM embedded code or SQL queries in the returned response
[ReAct Agent]
Program-aided Language Models (PAL)

More at:

articles

See also M, [Modular Reasoning Knowledge and Language Architecture]

Modular Reasoning Knowledge and Language (MRKL) System#

This architecture focuses on combining LLMs with external tools and reasoning processes. It's modular, meaning different components can be added or removed as needed. MRKL agents can use tools like search engines or databases to gather current or specialized information, and then apply reasoning to this information within the context of a language model's response.

While both MRKL and [Retrieval Augmented Generation (RAG)] aim to augment the capabilities of LLMs with external information, MRKL emphasizes a more modular approach that includes discrete reasoning and the use of various tools, whereas RAG focuses on integrating real-time retrieval of information into the generation process.

More at:

See also M, [MRKL Autonomous Agent], ReACT Prompting

Modular Retrieval Augmented Generation (M-RAG) System#

See also M, ...

Monte Carlo Control#

Monte Carlo control methods are a set of techniques used in the field of reinforcement learning, a branch of [machine learning]. These methods are used to find optimal policies for decision-making problems, particularly in environments where the model (i.e., the complete knowledge of the environment's dynamics) is not known. They rely on the Monte Carlo approach, which involves learning from complete episodes of experience without requiring a model of the environment's dynamics.

Key Concepts:

Episodic Tasks: Monte Carlo methods are applied to episodic tasks, where the interaction is broken down into episodes, and each episode eventually ends in a terminal state.
Experience Sampling: The agent learns from episodes of experience. Each episode consists of states, actions, and rewards, and the agent's task is to learn a policy that maximizes the total reward.
Exploration vs. Exploitation: Monte Carlo control methods must balance exploration (trying new actions to discover their effects) with exploitation (choosing known actions that yield high rewards).

How Monte Carlo Control Works:

Policy Evaluation: The value of each state is estimated based on the average returns (total rewards) received after visiting that state. This is done by sampling episodes and calculating the return for each state-action pair.
Policy Improvement: The policy is then improved by choosing actions that lead to higher-value states. This is typically done using a greedy approach, where the action with the highest estimated value is chosen.

Types of Monte Carlo Control Methods:

On-Policy Methods: These methods, like Monte Carlo with Exploring Starts or Monte Carlo On-Policy Control (e.g., ε-greedy policies), improve the policy that is used to make decisions.
Off-Policy Methods: These methods, such as Importance Sampling, learn the value of the optimal policy independently of the agent's actions (i.e., they can learn from data generated by a different policy).

Advantages and Challenges:

Model-Free: Monte Carlo methods do not require a model of the environment and learn directly from experience.
Simplicity: These methods are conceptually simple and easy to implement.
High Variance: Returns can have high variance, which might slow down learning or require many episodes to converge.
Episode Dependency: They are only applicable to episodic tasks and require complete episodes to update the policy.

Monte Carlo control methods are particularly useful in situations where the environment is complex or unknown, and the agent must learn entirely from its interaction with the environment. They form a fundamental part of the reinforcement learning toolkit and are used in various applications, from game playing to robotics.

See also M, ...

Monte Carlo Method#

~ simulate 1 million experiments to get statistics!

Beware

Not always fast!

not interpretable, but find a numerical result!

not generalizable, if you change the input parameters, you need to rerun the simulation to get the result. From one result, you cannot deduce the others from one previous result.

Monte Carlo methods are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. The underlying concept is to use randomness to solve problems that might be deterministic in principle. They are particularly useful in three main areas:

Mathematical Integration: Monte Carlo methods are used to approximate the value of integrals, especially when dealing with high-dimensional spaces where traditional integration techniques become infeasible.
Optimization: These methods are useful in finding optimal solutions in various fields, such as finance and logistics, where the solution space is large and complex.
Simulation: Monte Carlo simulations are widely used in various fields like physics, engineering, finance, and biology to model complex systems and processes. For example, in finance, they are used to model the behavior of markets or to assess risks.

The name "Monte Carlo" was coined during the 1940s by scientists working on nuclear weapon projects in the Los Alamos National Laboratory. It refers to the Monte Carlo Casino in Monaco since chance and randomness are central to the casino games, much like the methods themselves.

Monte Carlo methods are powerful because they allow for the modeling of complex phenomena and systems that are analytically intractable or difficult to model with deterministic methods. However, they typically require a large number of samples to produce accurate results and are computationally intensive.

Monte Carlo methods encompass a variety of techniques, each tailored to specific types of problems or simulations. Here are some notable examples:

Simple Monte Carlo Integration: This is used for estimating the value of integrals. It involves randomly sampling points in the domain and averaging the function values at these points.
Markov Chain Monte Carlo (MCMC): These methods, including the Metropolis-Hastings algorithm and the Gibbs sampling, are used for generating samples from a probability distribution. They are particularly useful in Bayesian statistics and machine learning for tasks like parameter estimation and model selection.
Monte Carlo Localization: Used in robotics, this method helps in estimating the position of a robot by generating a set of sample points (hypotheses) about its position and updating these points as the robot moves.
Bootstrap Methods: These are used for statistical inference. They involve repeatedly resampling a dataset with replacement and calculating a statistic (like the mean) for each sample. This helps in understanding the variability and confidence of the estimate.
Monte Carlo Tree Search (MCTS): This method is used in decision processes, notably in AI for board games like Go and chess. It involves building a search tree and using random simulations to estimate the long-term potential of each move.
Monte Carlo Ray Tracing: In computer graphics, this method is used for generating photorealistic images by simulating the path of light as pixels in an image plane, producing a very high degree of visual realism.
Monte Carlo Methods in Finance: Used for option pricing and risk assessment. These methods simulate various economic scenarios to estimate the price of financial instruments or the risk in investment portfolios.
Particle Filters: Used in signal processing and Bayesian inference, these methods involve representing the required probability distribution by a set of random samples (particles) and updating these samples over time.

Each of these methods applies the core principle of Monte Carlo — using randomness and statistical sampling to solve problems that might be deterministic but are too complex for analytical solutions.

See also M, ...

Monte Carlo Policy Gradient Algorithm#

See REINFORCE Algorithm

Monte Carlo Tree Search (MCTS) Algorithm#

To do Look-Ahead Planning ?

Monte Carlo Tree Search (MCTS) is an algorithm used to make optimal decisions in artificial intelligence systems, often for game playing. It works by constructing a tree of possible future states and then randomly sampling some paths through that tree to estimate which option has the best outcome.

Specifically, it works like this:

Starts with the current state as the root node of the tree.
From the root, uses a policy/heuristic to select some child nodes (possible next states) and evaluates them to expand the tree.
Then starts simulated playouts - it selects a leaf node and plays out a "simulated" game randomly till reaching an outcome.
The results of the playouts are backpropagated up the tree to inform node evaluations. Nodes with better outcomes from playouts get better evaluations.
Over many iterations, the algorithm balances exploring new nodes and exploiting better evaluations to pick the best moves. The most promising nodes are expanded deeper as the tree grows.
After enough samples, the decision is made by picking the best evaluated child node from the root.A

Some key advantages are that MCTS allows creating a search tree selectively, avoids brute force search of all possibilities, works well with stochastic environments, and is relatively simple to implement. Overall, it uses the power of random simulation to efficiently find optimal decisions.

See also M, ...

MuLan Model#

Similar to the CLIP model, but for music! Music to text model!

More at:

paper - https://arxiv.org/abs/2208.12415
code - https://github.com/lucidrains/musiclm-pytorch

See also M, ...

Multi-Agent#

Agents that interact with each other in a multi-agent environment. Each agent behaves in accordance with its policy. The community of agent could benefit from learning from their interactions through multi-agent reinforcement learning

Framework

MetaGPT - https://github.com/geekan/MetaGPT
ChatDEV - https://github.com/OpenBMB/ChatDev/tree/main
Agentverse - https://github.com/OpenBMB/AgentVerse
Camel - https://github.com/camel-ai/camel
Autogen - https://github.com/JayZeeDesign/microsoft-autogen-experiments

More at:

articles
- https://natural20.com/build-an-entire-ai-agent-workforce-chatdev-and-google-brain-society-of-mind-agi-user-interface/

See also M, ...

Multi-Agent Environment#

An environment where several agent can coexist? a Society Of Mind

Multi-Agent Reinforcement Learning#

Learns policies for multiple interacting agents. Emergent coordination & competition.

See also M, ...

Multi-Class Classification#

ML for classification with more than 2 categories. Categories are mutually exclusive.

More at: * articles * XGBoost - https://towardsdatascience.com/xgboost-for-multi-class-classification-799d96bcd368

Multi-Head Attention#

~ the brains of the Transformer and responsible for performance through parallelism. Multi-Head Attention consists of several attention layers running in parallel. The Attention layer takes its input in the form of three parameters, known as the Query, Key, and Value (aka Q,K,V). All three parameters are similar in structure, with each word in the sequence represented by a vector. In transformers is used for encoder and decoder.

Q: * independent multi-layer perceptron (?) that takes a vector as input (?) and return an output ~ a bit like asking questions on the input vector (queen --(attention)--> vector --(feedforward network)--> is it a noun? = output-1, is it english = output-2, does it refer to a person = o-3, is it an mount? = o-4, is the tone ascertive = o-5, is it a piece of a bigger word? = o-6, iis it part of a quote? = o-7, ...)

More at:

code - https://nlp.seas.harvard.edu/annotated-transformer/

Multi-Instance GPU (MIG)#

~ Seven independent instances in a single GPU.

~ enable inference, training, high-performance computing (HPC) workloads to run at the same time on a single GPU with deterministic latency and throughput.

Multi-Instance GPU (MIG) expands the performance and value of NVIDIA Blackwell and Hopper generation GPUs. MIG can partition the GPU into as many as seven instances, each fully isolated with its own high-bandwidth memory, cache, and compute cores. This gives administrators the ability to support every workload, from the smallest to the largest, with guaranteed quality of service (QoS) and extending the reach of accelerated computing resources to every user.

Benefit: * Expand GPU Access * Optimize GPU utilization * Run simultaneous workloads

Without MIG, different jobs running on the same GPU, such as different AI inference requests, compete for the same resources. A job consuming larger memory bandwidth starves others, resulting in several jobs missing their latency targets. With MIG, jobs run simultaneously on different instances, each with dedicated resources for compute, memory, and memory bandwidth, resulting in predictable performance with QoS and maximum GPU utilization.

Provision and Configure Instances as Needed
A GPU can be partitioned into different-sized MIG instances. For example, on an NVIDIA GB200, an administrator could create two instances with 95GB of memory each, four instances with 45GB each, or seven instances with 23GB each.
MIG instances can also be dynamically reconfigured, enabling administrators to shift GPU resources in response to changing user and business demands. For example, seven MIG instances can be used during the day for low-throughput inference and reconfigured to one large MIG instance at night for deep learning training.
Run Workloads in Parallel, Securely
With a dedicated set of hardware resources for compute, memory, and cache, each MIG instance delivers guaranteed QoS and fault isolation. That means that a failure in an application running on one instance doesn’t impact applications running on other instances.
It also means that different instances can run different types of workloads—interactive model development, deep learning training, AI inference, or HPC applications. Since the instances run in parallel, the workloads also run in parallel—but separate and isolated—on the same physical GPU.

MIG in Blackwell GPUs:

Blackwell and Hopper GPUs support MIG with multi-tenant, multi-user configurations in virtualized environments across up to seven GPU instances, securely isolating each instance with confidential computing at the hardware and hypervisor level. Dedicated video decoders for each MIG instance deliver secure, high-throughput intelligent video analytics (IVA) on shared infrastructure. With concurrent MIG profiling, administrators can monitor right-sized GPU acceleration and allocate resources for multiple users.
For researchers with smaller workloads, rather than renting a full cloud instance, they can use MIG to isolate a portion of a GPU securely while being assured that their data is secure at rest, in transit, and in use. This improves flexibility for cloud service providers to price and address smaller customer opportunities.

See also M, ...

Multi-Label Classification#

~ an input is mapped to one or more labels. Labels are not mutually exclusive (otherwise multi-class classification) !!

While training the [Binary cross-entropy loss function] is used

Multi-Task Learning (MTL)#

~ not enough data for individual algo --> reuse data from other similar algo Train all parameters at the same time, but each of the task have shared parameters. One example is a spam-filter, which can be treated as distinct but related classification tasks across different users. To make this more concrete, consider that different people have different distributions of features which distinguish spam emails from legitimate ones, for example an English speaker may find that all emails in Russian are spam, not so for Russian speakers. Yet there is a definite commonality in this classification task across users, for example one common feature might be text related to money transfer. Solving each user's spam classification problem jointly via MTL can let the solutions inform each other and improve performance.

/// details | How does that relate to transfer learning? type:question ///

Multi-Turn Question Set Benchmark (MT-Bench)#

Despite the base LLaMA models showing competitive performance on conventional benchmarks, its answers to open-ended questions are often not preferred by humans. This misalignment of conventional benchmarks underscores the core problem driving this paper: the need for a robust and scalable automated method to evaluate LLM alignment with human preferences.

Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong [LLMs as judges] (such as [GTP-4]) to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Chatbot Arena, a crowdsourced battle platform. Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans. Hence, LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain.

More at:

paper - https://arxiv.org/abs/2306.05685
code - https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge
leaderboard - https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard

See also M, ...

Multi-Vector Retrieval#

retrieval based on document summaries

More at:

https://community.fullstackretrieval.com/index/multi-vector

See also M, ...

Multiattribute Objective#

Example: A self-driving car needs to

take you from point A to point B
in a safe manner
in a comfortable manner
but as quickly as possible
without killing a pedestrian or anybody
do not trigger road rage of other driver

Multidimensional Scaling (MDS)#

~ a dimensionality reduction technique

See alse M, ...

Multilayer Perceptron (MLP) Architecture#

A fully connected multi-layer neural network with static weights and fixed activation function is called a Multilayer Perceptron.

import torch
import torch.nn as nn

# Helper class for MLP
class MLP(nn.Module):
    def __init__(self,in_dim,out_dim):
        super(MLP, self).__init__()
        self.layers = nn.Sequential(
            nn.Linear(in_dim, in_dim),
            nn.ReLU(),
            nn.Linear(in_dim, out_dim)
        )

    def forward(self, x):
        x = x.view(x.size(0), -1)
        x = self.layers(x)
        return x

See also M, Artificial Neural Network, [Kolmogorov-Arnold Network], Vision Transformer

Multimodal Alignment#

See also M, Multimodal Learning

Multimodal Co-learning#

Enhance the classification of one modality with the help of another. For example: help to recognize a dog.

See also M, Multimodal Learning

Multimodal Distribution#

In math, use a distribution that has multiple maxima = multiple modes, i.e. distinct "peak (local maxima) in the probability density function".

See also M, Multimodal Learning

Multimodal Framework (MMF)#

A frameowkr build on top of Pytorch for language and vision models. Features include (1) modular, (2) distributed, (3) pre-trained multimodal transformer.

See also M, ...

Multimodal Fusion#

Bring the representation of each mode into the same embedding space. Methods:

dot product (multiply the vectors of one modality with the other) BRing everything into a scalar value?
concat output + fully connected layers after the concatenation
tensor fusion network (paper published in 2017) - https://sentic.net/tensor-fusion-network-for-multimodal-sentiment-analysis.pdf
minimize the cosine loss between the 2 modalities

See also M, Multimodal Learning

Multimodal Large Language Layer (MLLM)#

See also M, ...

Multimodal Learning#

For a human, multimodal = using all our senses. For a computer, using input text, input image, etc. generating output text based on image, output image based on input text? Examples: (1) a robot taking instruction (text + voice recognition) and performing various actions in the physical space (2) visual common sense reasoning - answering common sense questions about a given image (3) an email network graph containing nodes as users, IP address connected by email sent and received edges. Not multimodal learning examples: (1) spoke language translation.

AI that can understand the relationships between images, text and more

Google's Pathways architecture is trying to solve this.

Multimodal Model#

Multimodal models go beyond traditional language models by incorporating various types of data, such as images, videos, audio, and more. They create joint embeddings that capture information from both text and visual inputs, enabling them to solve complex tasks that require a combination of visual and textual understanding.

Multimodal models enhance robotics by enabling robots to understand and interact with their environment more effectively. By integrating visual and textual inputs, robots can navigate complex environments, recognize objects, and respond to commands in a more human-like manner. This opens up new possibilities for applications in areas such as autonomous navigation and human-robot interaction.

See also M, ...

Multinomial Naive Bayes Classifier#

Multimodal Retrieval Augmented Generation (M-RAG) System#

How to use Multimodal RAG to chat with Nvidia's investor slide deck from last year? The slide deck is 39 pages with a combination of text, visuals, tables, charts and annotations. The document structure and templates vary from page to page and is quite difficult to RAG over using traditional methods.

More at:

https://github.com/togethercomputer/together-cookbook/blob/main/MultiModal_RAG_with_Nvidia_Investor_Slide_Deck.ipynb

See also M, ...

Multimodal Space#

A latent/embedded space where both modality have the same representation.

Multimodal Translation#

How to translate one mode to the other. Ex: captioning an image.

Multiple Linear Regression#

A linear regression with multiple input / independent variable.

Y = a + b.X1 + c.X2

More at:

sample paper - https://www.mdpi.com/2073-4395/11/5/885

MUM Model#

MUM was developed by Google, uses the T5 text-to-text framework, and is 1,000 times more powerful than BERT.

T5 uses transformer-based architecture, just like BERT, but instead uses a text-to-text approach. What that means is with T5, the input (query) and output (result) are always text strings, in contrast to BERT-style models that can only output either a classification label or the span of the input into a question and answer format. This means that the output with BERT, while undeniably impressive, was in comparison to MUM still rather abstract.

The T5 text-to-text process includes more in-depth machine translation, document summarization, question answering, and classification tasks (e.g., sentiment analysis).

More at:

See also M, ...

Muse Model#

Muse is a fast, state-of-the-art text-to-image generation and editing model built by Google.

We present Muse, a text-to-image Transformer model that achieves state-of-the-art image generation performance while being significantly more efficient than diffusion or autoregressive models. Muse is trained on a masked modeling task in discrete token space: given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. Compared to pixel-space diffusion models, such as [Imagen] and [DALL-E 2], Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations; compared to autoregressive models, such as [Parti], Muse is more efficient due to the use of parallel decoding. The use of a pre-trained LLM enables fine-grained language understanding, translating to high-fidelity image generation and the understanding of visual concepts such as objects, their spatial relationships, pose, cardinality, etc. Our 900M parameter model achieves a new SOTA on [CC3M], with an [FID score] of 6.06. The Muse 3B parameter model achieves an FID of 7.88 on zero-shot COCO evaluation, along with a [CLIP score] of 0.32. Muse also directly enables a number of image editing applications without the need to fine-tune or invert the model: inpainting, outpainting, and mask-free editing.

More at:

site - https://muse-model.github.io/
paper https://arxiv.org/abs/2301.00704
articles
- https://venturebeat.com/ai/googles-muse-model-could-be-the-next-big-thing-for-generative-ai/

See also M, ...

MuseGAN#

4 inputs: (1) Chords input (with temporal network) (2) style input (3) melody input (temporal network) (4) groove input

See also M, ...

Music ABC Notation#

More at:

See also M, ...

Music Generator#

More at:

with transformer - https://towardsdatascience.com/creating-a-pop-music-generator-with-the-transformer-5867511b382a
with LSTM - https://medium.com/analytics-vidhya/music-generation-using-deep-learning-a2b2848ab177
with RNN - https://medium.com/analytics-vidhya/music-generation-using-deep-learning-a2b2848ab177
with transformer - https://crfm.stanford.edu/2023/06/16/anticipatory-music-transformer.html

See also M, ...

MusicLM Model#

An impressive new AI system from Google can generate music in any genre given a text description. Called MusicLM, Google’s certainly isn’t the first generative artificial intelligence system for song. There have been other attempts, including Riffusion, an AI that composes music by visualizing it, as well as Dance Diffusion, Google’s own AudioML and OpenAI’s Jukebox. But owing to technical limitations and limited training data, none have been able to produce songs particularly complex in composition or high-fidelity. MusicLM is perhaps the first that can.

More at:

twitter - https://twitter.com/keunwoochoi/status/1618809167573286912
paper - https://arxiv.org/abs/2301.11325
example - https://google-research.github.io/seanet/musiclm/examples/
dataset - https://www.kaggle.com/datasets/googleai/musiccaps
techcrunch article - https://techcrunch.com/2023/01/27/google-created-an-ai-that-can-generate-music-from-text-descriptions-but-wont-release-it/
code - https://github.com/lucidrains/musiclm-pytorch

See also M, Jukebox Model, Riffusion Model

Mustafa Suleyman Person#

One of the 3 founders of DeepMind and founder of Inflection AI

See also M, ...

MuZero Model#

A model built by DeepMind

More at:

nature article - https://www.nature.com/articles/s41586-020-03051-4.epdf

See also M, ...