• Python for Machine Learning
  • Machine Learning with R
  • Machine Learning Algorithms
  • Math for Machine Learning
  • Machine Learning Interview Questions
  • ML Projects
  • Deep Learning
  • Computer vision
  • Data Science
  • Artificial Intelligence

Hypothesis in Machine Learning

  • Demystifying Machine Learning
  • Bayes Theorem in Machine learning
  • What is Machine Learning?
  • Best IDEs For Machine Learning
  • Learn Machine Learning in 45 Days
  • Interpolation in Machine Learning
  • How does Machine Learning Works?
  • Machine Learning for Healthcare
  • Applications of Machine Learning
  • Machine Learning - Learning VS Designing
  • Continual Learning in Machine Learning
  • Meta-Learning in Machine Learning
  • P-value in Machine Learning
  • Why Machine Learning is The Future?
  • How Does NASA Use Machine Learning?
  • Few-shot learning in Machine Learning
  • Machine Learning Jobs in Hyderabad

The concept of a hypothesis is fundamental in Machine Learning and data science endeavours. In the realm of machine learning, a hypothesis serves as an initial assumption made by data scientists and ML professionals when attempting to address a problem. Machine learning involves conducting experiments based on past experiences, and these hypotheses are crucial in formulating potential solutions.

It’s important to note that in machine learning discussions, the terms “hypothesis” and “model” are sometimes used interchangeably. However, a hypothesis represents an assumption, while a model is a mathematical representation employed to test that hypothesis. This section on “Hypothesis in Machine Learning” explores key aspects related to hypotheses in machine learning and their significance.

Table of Content

How does a Hypothesis work?

Hypothesis space and representation in machine learning, hypothesis in statistics, faqs on hypothesis in machine learning.

A hypothesis in machine learning is the model’s presumption regarding the connection between the input features and the result. It is an illustration of the mapping function that the algorithm is attempting to discover using the training set. To minimize the discrepancy between the expected and actual outputs, the learning process involves modifying the weights that parameterize the hypothesis. The objective is to optimize the model’s parameters to achieve the best predictive performance on new, unseen data, and a cost function is used to assess the hypothesis’ accuracy.

In most supervised machine learning algorithms, our main goal is to find a possible hypothesis from the hypothesis space that could map out the inputs to the proper outputs. The following figure shows the common method to find out the possible hypothesis from the Hypothesis space:

Hypothesis-Geeksforgeeks

Hypothesis Space (H)

Hypothesis space is the set of all the possible legal hypothesis. This is the set from which the machine learning algorithm would determine the best possible (only one) which would best describe the target function or the outputs.

Hypothesis (h)

A hypothesis is a function that best describes the target in supervised machine learning. The hypothesis that an algorithm would come up depends upon the data and also depends upon the restrictions and bias that we have imposed on the data.

The Hypothesis can be calculated as:

[Tex]y = mx + b [/Tex]

  • m = slope of the lines
  • b = intercept

To better understand the Hypothesis Space and Hypothesis consider the following coordinate that shows the distribution of some data:

Hypothesis_Geeksforgeeks

Say suppose we have test data for which we have to determine the outputs or results. The test data is as shown below:

what is hypothesis space in machine learning

We can predict the outcomes by dividing the coordinate as shown below:

what is hypothesis space in machine learning

So the test data would yield the following result:

what is hypothesis space in machine learning

But note here that we could have divided the coordinate plane as:

what is hypothesis space in machine learning

The way in which the coordinate would be divided depends on the data, algorithm and constraints.

  • All these legal possible ways in which we can divide the coordinate plane to predict the outcome of the test data composes of the Hypothesis Space.
  • Each individual possible way is known as the hypothesis.

Hence, in this example the hypothesis space would be like:

Possible hypothesis-Geeksforgeeks

The hypothesis space comprises all possible legal hypotheses that a machine learning algorithm can consider. Hypotheses are formulated based on various algorithms and techniques, including linear regression, decision trees, and neural networks. These hypotheses capture the mapping function transforming input data into predictions.

Hypothesis Formulation and Representation in Machine Learning

Hypotheses in machine learning are formulated based on various algorithms and techniques, each with its representation. For example:

  • Linear Regression : [Tex] h(X) = \theta_0 + \theta_1 X_1 + \theta_2 X_2 + … + \theta_n X_n[/Tex]
  • Decision Trees : [Tex]h(X) = \text{Tree}(X)[/Tex]
  • Neural Networks : [Tex]h(X) = \text{NN}(X)[/Tex]

In the case of complex models like neural networks, the hypothesis may involve multiple layers of interconnected nodes, each performing a specific computation.

Hypothesis Evaluation:

The process of machine learning involves not only formulating hypotheses but also evaluating their performance. This evaluation is typically done using a loss function or an evaluation metric that quantifies the disparity between predicted outputs and ground truth labels. Common evaluation metrics include mean squared error (MSE), accuracy, precision, recall, F1-score, and others. By comparing the predictions of the hypothesis with the actual outcomes on a validation or test dataset, one can assess the effectiveness of the model.

Hypothesis Testing and Generalization:

Once a hypothesis is formulated and evaluated, the next step is to test its generalization capabilities. Generalization refers to the ability of a model to make accurate predictions on unseen data. A hypothesis that performs well on the training dataset but fails to generalize to new instances is said to suffer from overfitting. Conversely, a hypothesis that generalizes well to unseen data is deemed robust and reliable.

The process of hypothesis formulation, evaluation, testing, and generalization is often iterative in nature. It involves refining the hypothesis based on insights gained from model performance, feature importance, and domain knowledge. Techniques such as hyperparameter tuning, feature engineering, and model selection play a crucial role in this iterative refinement process.

In statistics , a hypothesis refers to a statement or assumption about a population parameter. It is a proposition or educated guess that helps guide statistical analyses. There are two types of hypotheses: the null hypothesis (H0) and the alternative hypothesis (H1 or Ha).

  • Null Hypothesis(H 0 ): This hypothesis suggests that there is no significant difference or effect, and any observed results are due to chance. It often represents the status quo or a baseline assumption.
  • Aternative Hypothesis(H 1 or H a ): This hypothesis contradicts the null hypothesis, proposing that there is a significant difference or effect in the population. It is what researchers aim to support with evidence.

Q. How does the training process use the hypothesis?

The learning algorithm uses the hypothesis as a guide to minimise the discrepancy between expected and actual outputs by adjusting its parameters during training.

Q. How is the hypothesis’s accuracy assessed?

Usually, a cost function that calculates the difference between expected and actual values is used to assess accuracy. Optimising the model to reduce this expense is the aim.

Q. What is Hypothesis testing?

Hypothesis testing is a statistical method for determining whether or not a hypothesis is correct. The hypothesis can be about two variables in a dataset, about an association between two groups, or about a situation.

Q. What distinguishes the null hypothesis from the alternative hypothesis in machine learning experiments?

The null hypothesis (H0) assumes no significant effect, while the alternative hypothesis (H1 or Ha) contradicts H0, suggesting a meaningful impact. Statistical testing is employed to decide between these hypotheses.

Please Login to comment...

Similar reads.

author

  • Machine Learning

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

Javatpoint Logo

Machine Learning

Artificial Intelligence

Control System

Supervised Learning

Classification, miscellaneous, related tutorials.

Interview Questions

JavaTpoint

  • Send your Feedback to [email protected]

Help Others, Please Share

facebook

Learn Latest Tutorials

Splunk tutorial

Transact-SQL

Tumblr tutorial

Reinforcement Learning

R Programming tutorial

R Programming

RxJS tutorial

React Native

Python Design Patterns

Python Design Patterns

Python Pillow tutorial

Python Pillow

Python Turtle tutorial

Python Turtle

Keras tutorial

Preparation

Aptitude

Verbal Ability

Interview Questions

Company Questions

Trending Technologies

Artificial Intelligence

Cloud Computing

Hadoop tutorial

Data Science

Angular 7 Tutorial

B.Tech / MCA

DBMS tutorial

Data Structures

DAA tutorial

Operating System

Computer Network tutorial

Computer Network

Compiler Design tutorial

Compiler Design

Computer Organization and Architecture

Computer Organization

Discrete Mathematics Tutorial

Discrete Mathematics

Ethical Hacking

Ethical Hacking

Computer Graphics Tutorial

Computer Graphics

Software Engineering

Software Engineering

html tutorial

Web Technology

Cyber Security tutorial

Cyber Security

Automata Tutorial

C Programming

C++ tutorial

Data Mining

Data Warehouse Tutorial

Data Warehouse

RSS Feed

eml header

Best Guesses: Understanding The Hypothesis in Machine Learning

Stewart Kaplan

  • February 22, 2024
  • General , Supervised Learning , Unsupervised Learning

Machine learning is a vast and complex field that has inherited many terms from other places all over the mathematical domain.

It can sometimes be challenging to get your head around all the different terminologies, never mind trying to understand how everything comes together.

In this blog post, we will focus on one particular concept: the hypothesis.

While you may think this is simple, there is a little caveat regarding machine learning.

The statistics side and the learning side.

Don’t worry; we’ll do a full breakdown below.

You’ll learn the following:

What Is a Hypothesis in Machine Learning?

  • Is This any different than the hypothesis in statistics?
  • What is the difference between the alternative hypothesis and the null?
  • Why do we restrict hypothesis space in artificial intelligence?
  • Example code performing hypothesis testing in machine learning

learning together

In machine learning, the term ‘hypothesis’ can refer to two things.

First, it can refer to the hypothesis space, the set of all possible training examples that could be used to predict or answer a new instance.

Second, it can refer to the traditional null and alternative hypotheses from statistics.

Since machine learning works so closely with statistics, 90% of the time, when someone is referencing the hypothesis, they’re referencing hypothesis tests from statistics.

Is This Any Different Than The Hypothesis In Statistics?

In statistics, the hypothesis is an assumption made about a population parameter.

The statistician’s goal is to prove it true or disprove it.

prove them wrong

This will take the form of two different hypotheses, one called the null, and one called the alternative.

Usually, you’ll establish your null hypothesis as an assumption that it equals some value.

For example, in Welch’s T-Test Of Unequal Variance, our null hypothesis is that the two means we are testing (population parameter) are equal.

This means our null hypothesis is that the two population means are the same.

We run our statistical tests, and if our p-value is significant (very low), we reject the null hypothesis.

This would mean that their population means are unequal for the two samples you are testing.

Usually, statisticians will use the significance level of .05 (a 5% risk of being wrong) when deciding what to use as the p-value cut-off.

What Is The Difference Between The Alternative Hypothesis And The Null?

The null hypothesis is our default assumption, which we are trying to prove correct.

The alternate hypothesis is usually the opposite of our null and is much broader in scope.

For most statistical tests, the null and alternative hypotheses are already defined.

You are then just trying to find “significant” evidence we can use to reject our null hypothesis.

can you prove it

These two hypotheses are easy to spot by their specific notation. The null hypothesis is usually denoted by H₀, while H₁ denotes the alternative hypothesis.

Example Code Performing Hypothesis Testing In Machine Learning

Since there are many different hypothesis tests in machine learning and data science, we will focus on one of my favorites.

This test is Welch’s T-Test Of Unequal Variance, where we are trying to determine if the population means of these two samples are different.

There are a couple of assumptions for this test, but we will ignore those for now and show the code.

You can read more about this here in our other post, Welch’s T-Test of Unequal Variance .

We see that our p-value is very low, and we reject the null hypothesis.

welch t test result with p-value

What Is The Difference Between The Biased And Unbiased Hypothesis Spaces?

The difference between the Biased and Unbiased hypothesis space is the number of possible training examples your algorithm has to predict.

The unbiased space has all of them, and the biased space only has the training examples you’ve supplied.

Since neither of these is optimal (one is too small, one is much too big), your algorithm creates generalized rules (inductive learning) to be able to handle examples it hasn’t seen before.

Here’s an example of each:

Example of The Biased Hypothesis Space In Machine Learning

The Biased Hypothesis space in machine learning is a biased subspace where your algorithm does not consider all training examples to make predictions.

This is easiest to see with an example.

Let’s say you have the following data:

Happy  and  Sunny  and  Stomach Full  = True

Whenever your algorithm sees those three together in the biased hypothesis space, it’ll automatically default to true.

This means when your algorithm sees:

Sad  and  Sunny  And  Stomach Full  = False

It’ll automatically default to False since it didn’t appear in our subspace.

This is a greedy approach, but it has some practical applications.

greedy

Example of the Unbiased Hypothesis Space In Machine Learning

The unbiased hypothesis space is a space where all combinations are stored.

We can use re-use our example above:

This would start to breakdown as

Happy  = True

Happy  and  Sunny  = True

Happy  and  Stomach Full  = True

Let’s say you have four options for each of the three choices.

This would mean our subspace would need 2^12 instances (4096) just for our little three-word problem.

This is practically impossible; the space would become huge.

subspace

So while it would be highly accurate, this has no scalability.

More reading on this idea can be found in our post, Inductive Bias In Machine Learning .

Why Do We Restrict Hypothesis Space In Artificial Intelligence?

We have to restrict the hypothesis space in machine learning. Without any restrictions, our domain becomes much too large, and we lose any form of scalability.

This is why our algorithm creates rules to handle examples that are seen in production. 

This gives our algorithms a generalized approach that will be able to handle all new examples that are in the same format.

Other Quick Machine Learning Tutorials

At EML, we have a ton of cool data science tutorials that break things down so anyone can understand them.

Below we’ve listed a few that are similar to this guide:

  • Instance-Based Learning in Machine Learning
  • Types of Data For Machine Learning
  • Verbose in Machine Learning
  • Generalization In Machine Learning
  • Epoch In Machine Learning
  • Inductive Bias in Machine Learning
  • Understanding The Hypothesis In Machine Learning
  • Zip Codes In Machine Learning
  • get_dummies() in Machine Learning
  • Bootstrapping In Machine Learning
  • X and Y in Machine Learning
  • F1 Score in Machine Learning
  • Recent Posts

Stewart Kaplan

  • Can You Use CORSAIR Mouse Without Software? [Unlock Hidden Secrets!] - May 10, 2024
  • Master How to Become a Data Scientist [Unlock Your Potential] - May 10, 2024
  • How much do Ab Initio software field consultants make? [Get Ready to Earn Big!] - May 10, 2024

Trending now

Multivariate Polynomial Regression Python

Hypothesis Space

  • Reference work entry
  • Cite this reference work entry

what is hypothesis space in machine learning

  • Eyke Hüllermeier 5 ,
  • Thomas Fober 5 &
  • Marco Mernberger 5  

125 Accesses

In machine learning, the goal of a supervised learning algorithm is to perform induction, i.e., to generalize a (finite) set of observations (the training data) into a general model of the domain. In this regard, the hypothesis space is defined as the set of candidate models considered by the algorithm.

More specifically, consider the problem of learning a mapping (model) \( f \in F = Y^X \) from an input space X to an output space Y , given a set of training data \( D = \left\{ {\left( {{x_1},{y_1}} \right),...,\left( {{x_n},{y_n}} \right)} \right\} \subset X \times Y \) . A learning algorithm A takes D as an input and produces a function (model, hypothesis) f ∈ H ⊂ F as an output, where H is the hypothesis space. This subset is determined by the formalism used to represent models (e.g., as logical formulas, linear functions, or non-linear functions implemented as artificial neural networks or decision trees ). Thus, the choice of the hypothesis space produces a representation...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Author information

Authors and affiliations.

Philipps-Universität Marburg, Hans-Meerwein-Straße, Marburg, Germany

Eyke Hüllermeier, Thomas Fober & Marco Mernberger

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Eyke Hüllermeier .

Editor information

Editors and affiliations.

Biomedical Sciences Research Institute, University of Ulster, Coleraine, UK

Werner Dubitzky

Department of Computer Science, University of Rostock, Rostock, Germany

Olaf Wolkenhauer

Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea

Kwang-Hyun Cho

Department of Biomedical Engineering, Rensselaer Polytechnic Institute, Troy, NY, USA

Hiroki Yokota

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Science+Business Media, LLC

About this entry

Cite this entry.

Hüllermeier, E., Fober, T., Mernberger, M. (2013). Hypothesis Space. In: Dubitzky, W., Wolkenhauer, O., Cho, KH., Yokota, H. (eds) Encyclopedia of Systems Biology. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-9863-7_926

Download citation

DOI : https://doi.org/10.1007/978-1-4419-9863-7_926

Publisher Name : Springer, New York, NY

Print ISBN : 978-1-4419-9862-0

Online ISBN : 978-1-4419-9863-7

eBook Packages : Biomedical and Life Sciences Reference Module Biomedical and Life Sciences

Share this entry

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Programmathically

Introduction to the hypothesis space and the bias-variance tradeoff in machine learning.

what is hypothesis space in machine learning

In this post, we introduce the hypothesis space and discuss how machine learning models function as hypotheses. Furthermore, we discuss the challenges encountered when choosing an appropriate machine learning hypothesis and building a model, such as overfitting, underfitting, and the bias-variance tradeoff.

The hypothesis space in machine learning is a set of all possible models that can be used to explain a data distribution given the limitations of that space. A linear hypothesis space is limited to the set of all linear models. If the data distribution follows a non-linear distribution, the linear hypothesis space might not contain a model that is appropriate for our needs.

To understand the concept of a hypothesis space, we need to learn to think of machine learning models as hypotheses.

The Machine Learning Model as Hypothesis

Generally speaking, a hypothesis is a potential explanation for an outcome or a phenomenon. In scientific inquiry, we test hypotheses to figure out how well and if at all they explain an outcome. In supervised machine learning, we are concerned with finding a function that maps from inputs to outputs.

But machine learning is inherently probabilistic. It is the art and science of deriving useful hypotheses from limited or incomplete data. Our functions are not axioms that explain the data perfectly, and for most real-life problems, we will never have all the data that exists. Accordingly, we will not find the one true function that perfectly describes the data. Instead, we find a function through training a model to map from known training input to known training output. This way, the model gradually approximates the assumed true function that describes the distribution of the data. So we treat our model as a hypothesis that needs to be tested as to how well it explains the output from a given input. We do this using a test or validation data set.

The Hypothesis Space

During the training process, we select a model from a hypothesis space that is subject to our constraints. For example, a linear hypothesis space only provides linear models. We can approximate data that follows a quadratic distribution using a model from the linear hypothesis space.

model from a linear hypothesis space

Of course, a linear model will never have the same predictive performance as a quadratic model, so we can adjust our hypothesis space to also include non-linear models or at least quadratic models.

model from a quadratic hypothesis space

The Data Generating Process

The data generating process describes a hypothetical process subject to some assumptions that make training a machine learning model possible. We need to assume that the data points are from the same distribution but are independent of each other. When these requirements are met, we say that the data is independent and identically distributed (i.i.d.).

Independent and Identically Distributed Data

How can we assume that a model trained on a training set will perform better than random guessing on new and previously unseen data? First of all, the training data needs to come from the same or at least a similar problem domain. If you want your model to predict stock prices, you need to train the model on stock price data or data that is similarly distributed. It wouldn’t make much sense to train it on whether data. Statistically, this means the data is identically distributed . But if data comes from the same problem, training data and test data might not be completely independent. To account for this, we need to make sure that the test data is not in any way influenced by the training data or vice versa. If you use a subset of the training data as your test set, the test data evidently is not independent of the training data. Statistically, we say the data must be independently distributed .

Overfitting and Underfitting

We want to select a model from the hypothesis space that explains the data sufficiently well. During training, we can make a model so complex that it perfectly fits every data point in the training dataset. But ultimately, the model should be able to predict outputs on previously unseen input data. The ability to do well when predicting outputs on previously unseen data is also known as generalization. There is an inherent conflict between those two requirements.

If we make the model so complex that it fits every point in the training data, it will pick up lots of noise and random variation specific to the training set, which might obscure the larger underlying patterns. As a result, it will be more sensitive to random fluctuations in new data and predict values that are far off. A model with this problem is said to overfit the training data and, as a result, to suffer from high variance .

a model that overfits the data

To avoid the problem of overfitting, we can choose a simpler model or use regularization techniques to prevent the model from fitting the training data too closely. The model should then be less influenced by random fluctuations and instead, focus on the larger underlying patterns in the data. The patterns are expected to be found in any dataset that comes from the same distribution. As a consequence, the model should generalize better on previously unseen data.

a model that underfits the data

But if we go too far, the model might become too simple or too constrained by regularization to accurately capture the patterns in the data. Then the model will neither generalize well nor fit the training data well. A model that exhibits this problem is said to underfit the data and to suffer from high bias . If the model is too simple to accurately capture the patterns in the data (for example, when using a linear model to fit non-linear data), its capacity is insufficient for the task at hand.

When training neural networks, for example, we go through multiple iterations of training in which the model learns to fit an increasingly complex function to the data. Typically, your training error will decrease during learning the more complex your model becomes and the better it learns to fit the data. In the beginning, the training error decreases rapidly. In later training iterations, it typically flattens out as it approaches the minimum possible error. Your test or generalization error should initially decrease as well, albeit likely at a slower pace than the training error. As long as the generalization error is decreasing, your model is underfitting because it doesn’t live up to its full capacity. After a number of training iterations, the generalization error will likely reach a trough and start to increase again. Once it starts to increase, your model is overfitting, and it is time to stop training.

overfitting vs underfitting

Ideally, you should stop training once your model reaches the lowest point of the generalization error. The gap between the minimum generalization error and no error at all is an irreducible error term known as the Bayes error that we won’t be able to completely get rid of in a probabilistic setting. But if the error term seems too large, you might be able to reduce it further by collecting more data, manipulating your model’s hyperparameters, or altogether picking a different model.

Bias Variance Tradeoff

We’ve talked about bias and variance in the previous section. Now it is time to clarify what we actually mean by these terms.

Understanding Bias and Variance

In a nutshell, bias measures if there is any systematic deviation from the correct value in a specific direction. If we could repeat the same process of constructing a model several times over, and the results predicted by our model always deviate in a certain direction, we would call the result biased.

Variance measures how much the results vary between model predictions. If you repeat the modeling process several times over and the results are scattered all across the board, the model exhibits high variance.

In their book “Noise” Daniel Kahnemann and his co-authors provide an intuitive example that helps understand the concept of bias and variance. Imagine you have four teams at the shooting range.

bias and variance

Team B is biased because the shots of its team members all deviate in a certain direction from the center. Team B also exhibits low variance because the shots of all the team members are relatively concentrated in one location. Team C has the opposite problem. The shots are scattered across the target with no discernible bias in a certain direction. Team D is both biased and has high variance. Team A would be the equivalent of a good model. The shots are in the center with little bias in one direction and little variance between the team members.

Generally speaking, linear models such as linear regression exhibit high bias and low variance. Nonlinear algorithms such as decision trees are more prone to overfitting the training data and thus exhibit high variance and low bias.

A linear model used with non-linear data would exhibit a bias to predict data points along a straight line instead of accomodating the curves. But they are not as susceptible to random fluctuations in the data. A nonlinear algorithm that is trained on noisy data with lots of deviations would be more capable of avoiding bias but more prone to incorporate the noise into its predictions. As a result, a small deviation in the test data might lead to very different predictions.

To get our model to learn the patterns in data, we need to reduce the training error while at the same time reducing the gap between the training and the testing error. In other words, we want to reduce both bias and variance. To a certain extent, we can reduce both by picking an appropriate model, collecting enough training data, selecting appropriate training features and hyperparameter values. At some point, we have to trade-off between minimizing bias and minimizing variance. How you balance this trade-off is up to you.

bias variance trade-off

The Bias Variance Decomposition

Mathematically, the total error can be decomposed into the bias and the variance according to the following formula.

Remember that Bayes’ error is an error that cannot be eliminated.

Our machine learning model represents an estimating function \hat f(X) for the true data generating function f(X) where X represents the predictors and y the output values.

Now the mean squared error of our model is the expected value of the squared difference of the output produced by the estimating function \hat f(X) and the true output Y.

The bias is a systematic deviation from the true value. We can measure it as the squared difference between the expected value produced by the estimating function (the model) and the values produced by the true data-generating function.

Of course, we don’t know the true data generating function, but we do know the observed outputs Y, which correspond to the values generated by f(x) plus an error term.

The variance of the model is the squared difference between the expected value and the actual values of the model.

Now that we have the bias and the variance, we can add them up along with the irreducible error to get the total error.

A machine learning model represents an approximation to the hypothesized function that generated the data. The chosen model is a hypothesis since we hypothesize that this model represents the true data generating function.

We choose the hypothesis from a hypothesis space that may be subject to certain constraints. For example, we can constrain the hypothesis space to the set of linear models.

When choosing a model, we aim to reduce the bias and the variance to prevent our model from either overfitting or underfitting the data. In the real world, we cannot completely eliminate bias and variance, and we have to trade-off between them. The total error produced by a model can be decomposed into the bias, the variance, and irreducible (Bayes) error.

what is hypothesis space in machine learning

About Author

what is hypothesis space in machine learning

Related Posts

types of machine learning

Hypothesis in Machine Learning: Comprehensive Overview(2021)

img

Introduction

Supervised machine learning (ML) is regularly portrayed as the issue of approximating an objective capacity that maps inputs to outputs. This portrayal is described as looking through and assessing competitor hypothesis from hypothesis spaces. 

The conversation of hypothesis in machine learning can be confused for a novice, particularly when “hypothesis” has a discrete, but correlated significance in statistics and all the more comprehensively in science.

Hypothesis Space (H)

The hypothesis space utilized by an ML system is the arrangement of all hypotheses that may be returned by it. It is ordinarily characterized by a Hypothesis Language, conceivably related to a Language Bias. 

Many ML algorithms depend on some sort of search methodology: given a set of perceptions and a space of all potential hypotheses that may be thought in the hypothesis space. They see in this space for those hypotheses that adequately furnish the data or are ideal concerning some other quality standard.

ML can be portrayed as the need to utilize accessible data objects to discover a function that most reliable maps inputs to output, alluded to as function estimate, where we surmised an anonymous objective function that can most reliably map inputs to outputs on all expected perceptions from the difficult domain. An illustration of a model that approximates the performs mappings and target function of inputs to outputs is known as hypothesis testing in machine learning.

The hypothesis in machine learning of all potential hypothesis that you are looking over, paying little mind to their structure. For the wellbeing of accommodation, the hypothesis class is normally compelled to be just each sort of function or model in turn, since learning techniques regularly just work on each type at a time. This doesn’t need to be the situation, however:

  • Hypothesis classes don’t need to comprise just one kind of function. If you’re looking through exponential, quadratic, and overall linear functions, those are what your joined hypothesis class contains.
  • Hypothesis classes additionally don’t need to comprise of just straightforward functions. If you figure out how to look over all piecewise-tanh2 functions, those functions are what your hypothesis class incorporates.

The enormous trade-off is that the bigger your hypothesis class in   machine learning, the better the best hypothesis models the basic genuine function, yet the harder it is to locate that best hypothesis. This is identified with the bias-variance trade-off.

  • Hypothesis (h)

A hypothesis function in machine learning is best describes the target. The hypothesis that an algorithm would concoct relies on the data and relies on the bias and restrictions that we have forced on the data.

The hypothesis formula in machine learning:

  • y  is range
  • m  changes in y divided by change in x
  • x  is domain
  • b  is intercept

The purpose of restricting hypothesis space in machine learning is so that these can fit well with the general data that is needed by the user. It checks the reality or deception of observations or inputs and examinations them appropriately. Subsequently, it is extremely helpful and it plays out the valuable function of mapping all the inputs till they come out as outputs. Consequently, the target functions are deliberately examined and restricted dependent on the outcomes (regardless of whether they are free of bias), in ML.

The hypothesis in machine learning space and inductive bias in machine learning is that the hypothesis space is a collection of valid Hypothesis, for example, every single desirable function, on the opposite side the inductive bias (otherwise called learning bias) of a learning algorithm is the series of expectations that the learner uses to foresee outputs of given sources of inputs that it has not experienced. Regression and Classification are a kind of realizing which relies upon continuous-valued and discrete-valued sequentially. This sort of issues (learnings) is called inductive learning issues since we distinguish a function by inducting it on data.

In the Maximum a Posteriori or MAP hypothesis in machine learning, enhancement gives a Bayesian probability structure to fitting model parameters to training data and another option and sibling may be a more normal Maximum Likelihood Estimation system. MAP learning chooses a solitary in all probability theory given the data. The hypothesis in machine learning earlier is as yet utilized and the technique is regularly more manageable than full Bayesian learning. 

Bayesian techniques can be utilized to decide the most plausible hypothesis in machine learning given the data the MAP hypothesis. This is the ideal hypothesis as no other hypothesis is more probable.

Hypothesis in machine learning or ML the applicant model that approximates a target function for mapping instances of inputs to outputs.

Hypothesis in statistics probabilistic clarification about the presence of a connection between observations. 

Hypothesis in science is a temporary clarification that fits the proof and can be disproved or confirmed. We can see that a hypothesis in machine learning draws upon the meaning of the hypothesis all the more extensively in science.

There are no right or wrong ways of learning AI and ML technologies – the more, the better! These valuable resources can be the starting point for your journey on how to learn Artificial Intelligence and Machine Learning. Do pursuing AI and ML interest you? If you want to step into the world of emerging tech, you can accelerate your career with this  Machine Learning And AI Courses   by Jigsaw Academy.

  • XGBoost Algorithm: An Easy Overview For 2021

tag-img

Fill in the details to know more

facebook

PEOPLE ALSO READ

staffing pyramid, Understanding the Staffing Pyramid!

Related Articles

what is hypothesis space in machine learning

From The Eyes Of Emerging Technologies: IPL Through The Ages

April 29, 2023

 width=

Personalized Teaching with AI: Revolutionizing Traditional Teaching Methods

April 28, 2023

img

Metaverse: The Virtual Universe and its impact on the World of Finance

April 13, 2023

img

Artificial Intelligence – Learning To Manage The Mind Created By The Human Mind!

March 22, 2023

, Wake Up to the Importance of Sleep: Celebrating World Sleep Day!

Wake Up to the Importance of Sleep: Celebrating World Sleep Day!

March 18, 2023

, Operations Management and AI: How Do They Work?

Operations Management and AI: How Do They Work?

March 15, 2023

img

How Does BYOP(Bring Your Own Project) Help In Building Your Portfolio?

, What Are the Ethics in Artificial Intelligence (AI)?

What Are the Ethics in Artificial Intelligence (AI)?

November 25, 2022

what is hypothesis space in machine learning

What is Epoch in Machine Learning?| UNext

November 24, 2022

, The Impact Of Artificial Intelligence (AI) in Cloud Computing

The Impact Of Artificial Intelligence (AI) in Cloud Computing

November 18, 2022

, Role of Artificial Intelligence and Machine Learning in Supply Chain Management 

Role of Artificial Intelligence and Machine Learning in Supply Chain Management 

November 11, 2022

, Best Python Libraries for Machine Learning in 2022

Best Python Libraries for Machine Learning in 2022

November 7, 2022

share

Are you ready to build your own career?

arrow

Query? Ask Us

what is hypothesis space in machine learning

Enter Your Details ×

Machine Learning Theory - Part 2: Generalization Bounds

Last time we concluded by noticing that minimizing the empirical risk (or the training error) is not in itself a solution to the learning problem, it could only be considered a solution if we can guarantee that the difference between the training error and the generalization error (which is also called the generalization gap ) is small enough. We formalized such requirement using the probability:

That is if this probability is small, we can guarantee that the difference between the errors is not much, and hence the learning problem can be solved.

In this part we’ll start investigating that probability at depth and see if it indeed can be small, but before starting you should note that I skipped a lot of the mathematical proofs here. You’ll often see phrases like “It can be proved that …”, “One can prove …”, “It can be shown that …”, … etc without giving the actual proof. This is to make the post easier to read and to focus all the effort on the conceptual understanding of the subject. In case you wish to get your hands dirty with proofs, you can find all of them in the additional readings, or on the Internet of course!

Independently, and Identically Distributed

The world can be a very messy place! This is a problem that faces any theoretical analysis of a real world phenomenon; because usually we can’t really capture all the messiness in mathematical terms, and even if we’re able to; we usually don’t have the tools to get any results from such a messy mathematical model.

So in order for theoretical analysis to move forward, some assumptions must be made to simplify the situation at hand, we can then use the theoretical results from that simplification to infer about reality.

Assumptions are common practice in theoretical work. Assumptions are not bad in themselves, only bad assumptions are bad! As long as our assumptions are reasonable and not crazy, they’ll hold significant truth about reality.

A reasonable assumption we can make about the problem we have at hand is that our training dataset samples are independently, and identically distributed (or i.i.d. for short), that means that all the samples are drawn from the same probability distribution and that each sample is independent from the others.

This assumption is essential for us. We need it to start using the tools form probability theory to investigate our generalization probability, and it’s a very reasonable assumption because:

  • It’s more likely for a dataset used for inferring about an underlying probability distribution to be all sampled for that same distribution. If this is not the case, then the statistics we get from the dataset will be noisy and won’t correctly reflect the target underlying distribution.
  • It’s more likely that each sample in the dataset is chosen without considering any other sample that has been chosen before or will be chosen after. If that’s not the case and the samples are dependent, then the dataset will suffer from a bias towards a specific direction in the distribution, and hence will fail to reflect the underlying distribution correctly.

So we can build upon that assumption with no fear.

The Law of Large Numbers

Most of us, since we were kids, know that if we tossed a fair coin a large number of times, roughly half of the times we’re gonna get heads. This is an instance of wildly known fact about probability that if we retried an experiment for a sufficiency large amount of times, the average outcome of these experiments (or, more formally, the sample mean ) will be very close to the true mean of the underlying distribution. This fact is formally captured into what we call The law of large numbers :

If $x_1, x_2, …, x_m$ are $m$ i.i.d. samples of a random variable $X$ distributed by $P$. then for a small positive non-zero value $\epsilon$: \[\lim_{m \rightarrow \infty} \mathbb{P}\left[\left|\mathop{\mathbb{E}}_{X \sim P}[X] - \frac{1}{m}\sum_{i=1}^{m}x_i \right| > \epsilon\right] = 0\]

This version of the law is called the weak law of large numbers . It’s weak because it guarantees that as the sample size goes larger, the sample and true means will likely be very close to each other by a non-zero distance no greater than epsilon. On the other hand, the strong version says that with very large sample size, the sample mean is almost surely equal to the true mean.

The formulation of the weak law lends itself naturally to use with our generalization probability. By recalling that the empirical risk is actually the sample mean of the errors and the risk is the true mean, for a single hypothesis $h$ we can say that:

Well, that’s a progress, A pretty small one, but still a progress! Can we do any better?

Hoeffding’s inequality

The law of large numbers is like someone pointing the directions to you when you’re lost, they tell you that by following that road you’ll eventually reach your destination, but they provide no information about how fast you’re gonna reach your destination, what is the most convenient vehicle, should you walk or take a cab, and so on.

To our destination of ensuring that the training and generalization errors do not differ much, we need to know more info about the how the road down the law of large numbers look like. These info are provided by what we call the concentration inequalities . This is a set of inequalities that quantifies how much random variables (or function of them) deviate from their expected values (or, also, functions of them). One inequality of those is Heoffding’s inequality :

If $x_1, x_2, …, x_m$ are $m$ i.i.d. samples of a random variable $X$ distributed by $P$, and $a \leq x_i \leq b$ for every $i$, then for a small positive non-zero value $\epsilon$: \[\mathbb{P}\left[\left|\mathop{\mathbb{E}}_{X \sim P}[X] - \frac{1}{m}\sum_{i=0}^{m}x_i\right| > \epsilon\right] \leq 2\exp\left(\frac{-2m\epsilon^2}{(b -a)^2}\right)\]

You probably see why we specifically chose Heoffding’s inequality from among the others. We can naturally apply this inequality to our generalization probability, assuming that our errors are bounded between 0 and 1 (which is a reasonable assumption, as we can get that using a 0/1 loss function or by squashing any other loss between 0 and 1) and get for a single hypothesis $h$:

This means that the probability of the difference between the training and the generalization errors exceeding $\epsilon$ exponentially decays as the dataset size goes larger. This should align well with our practical experience that the bigger the dataset gets, the better the results become.

If you noticed, all our analysis up till now was focusing on a single hypothesis $h$. But the learning problem doesn’t know that single hypothesis beforehand, it needs to pick one out of an entire hypothesis space $\mathcal{H}$, so we need a generalization bound that reflects the challenge of choosing the right hypothesis.

Generalization Bound: 1st Attempt

In order for the entire hypothesis space to have a generalization gap bigger than $\epsilon$, at least one of its hypothesis: $h_1$ or $h_2$ or $h_3$ or … etc should have. This can be expressed formally by stating that:

Where $\bigcup$ denotes the union of the events, which also corresponds to the logical OR operator. Using the union bound inequality , we get:

We exactly know the bound on the probability under the summation from our analysis using the Heoffding’s inequality, so we end up with:

Where $|\mathcal{H}|$ is the size of the hypothesis space. By denoting the right hand side of the above inequality by $\delta$, we can say that with a confidence $1 - \delta$:

And with some basic algebra, we can express $\epsilon$ in terms of $\delta$ and get:

This is our first generalization bound, it states that the generalization error is bounded by the training error plus a function of the hypothesis space size and the dataset size. We can also see that the the bigger the hypothesis space gets, the bigger the generalization error becomes. This explains why the memorization hypothesis form last time, which theoretically has $|\mathcal{H}| = \infty$, fails miserably as a solution to the learning problem despite having $R_\text{emp} = 0$; because for the memorization hypothesis $h_\text{mem}$:

But wait a second! For a linear hypothesis of the form $h(x) = wx + b$, we also have $|\mathcal{H}| = \infty$ as there is infinitely many lines that can be drawn. So the generalization error of the linear hypothesis space should be unbounded just as the memorization hypothesis! If that’s true, why does perceptrons, logistic regression, support vector machines and essentially any ML model that uses a linear hypothesis work?

Our theoretical result was able to account for some phenomena (the memorization hypothesis, and any finite hypothesis space) but not for others (the linear hypothesis, or other infinite hypothesis spaces that empirically work). This means that there’s still something missing from our theoretical model, and it’s time for us to revise our steps. A good starting point is from the source of the problem itself, which is the infinity in $|\mathcal{H}|$.

Notice that the term $|\mathcal{H}|$ resulted from our use of the union bound. The basic idea of the union bound is that it bounds the probability by the worst case possible, which is when all the events under union are mutually independent. This bound gets more tight as the events under consideration get less dependent. In our case, for the bound to be tight and reasonable, we need the following to be true:

For every two hypothesis $h_1, h_2 \in \mathcal{H}$ the two events $|R(h_1) - R_\text{emp}(h_1)| > \epsilon$ and $|R(h_2) - R_\text{emp}(h_2)| > \epsilon$ are likely to be independent. This means that the event that $h_1$ has a generalization gap bigger than $\epsilon$ should be independent of the event that also $h_2$ has a generalization gap bigger than $\epsilon$, no matter how much $h_1$ and $h_2$ are close or related; the events should be coincidental.

But is that true?

Examining the Independence Assumption

The first question we need to ask here is why do we need to consider every possible hypothesis in $\mathcal{H}$? This may seem like a trivial question; as the answer is simply that because the learning algorithm can search the entire hypothesis space looking for its optimal solution. While this answer is correct, we need a more formal answer in light of the generalization inequality we’re studying.

The formulation of the generalization inequality reveals a main reason why we need to consider all the hypothesis in $\mathcal{H}$. It has to do with the existence of $\sup_{h \in \mathcal{H}}$. The supremum in the inequality guarantees that there’s a very little chance that the biggest generalization gap possible is greater than $\epsilon$; this is a strong claim and if we omit a single hypothesis out of $\mathcal{H}$, we might miss that “biggest generalization gap possible” and lose that strength, and that’s something we cannot afford to lose. We need to be able to make that claim to ensure that the learning algorithm would never land on a hypothesis with a bigger generalization gap than $\epsilon$.

what is hypothesis space in machine learning

Looking at the above plot of binary classification problem, it’s clear that this rainbow of hypothesis produces the same classification on the data points, so all of them have the same empirical risk. So one might think, as they all have the same $R_\text{emp}$, why not choose one and omit the others?!

This would be a very good solution if we’re only interested in the empirical risk, but our inequality takes into its consideration the out-of-sample risk as well, which is expressed as:

This is an integration over every possible combination of the whole input and output spaces $\mathcal{X, Y}$. So in order to ensure our supremum claim, we need the hypothesis to cover the whole of $\mathcal{X \times Y}$, hence we need all the possible hypotheses in $\mathcal{H}$.

Now that we’ve established that we do need to consider every single hypothesis in $\mathcal{H}$, we can ask ourselves: are the events of each hypothesis having a big generalization gap are likely to be independent?

Well, Not even close! Take for example the rainbow of hypotheses in the above plot, it’s very clear that if the red hypothesis has a generalization gap greater than $\epsilon$, then, with 100% certainty, every hypothesis with the same slope in the region above it will also have that. The same argument can be made for many different regions in the $\mathcal{X \times Y}$ space with different degrees of certainty as in the following figure.

what is hypothesis space in machine learning

But this is not helpful for our mathematical analysis, as the regions seems to be dependent on the distribution of the sample points and there is no way we can precisely capture these dependencies mathematically, and we cannot make assumptions about them without risking to compromise the supremum claim.

So the union bound and the independence assumption seem like the best approximation we can make,but it highly overestimates the probability and makes the bound very loose, and very pessimistic!

However, what if somehow we can get a very good estimate of the risk $R(h)$ without needing to go over the whole of the $\mathcal{X \times Y}$ space, would there be any hope to get a better bound?

The Symmetrization Lemma

Let’s think for a moment about something we do usually in machine learning practice. In order to measure the accuracy of our model, we hold out a part of the training set to evaluate the model on after training, and we consider the model’s accuracy on this left out portion as an estimate for the generalization error. This works because we assume that this test set is drawn i.i.d. from the same distribution of the training set (this is why we usually shuffle the whole dataset beforehand to break any correlation between the samples).

It turns out that we can do a similar thing mathematically, but instead of taking out a portion of our dataset $S$, we imagine that we have another dataset $S’$ with also size $m$, we call this the ghost dataset . Note that this has no practical implications, we don’t need to have another dataset at training, it’s just a mathematical trick we’re gonna use to git rid of the restrictions of $R(h)$ in the inequality.

We’re not gonna go over the proof here, but using that ghost dataset one can actually prove that:

where $R_\text{emp}’(h)$ is the empirical risk of hypothesis $h$ on the ghost dataset. This means that the probability of the largest generalization gap being bigger than $\epsilon$ is at most twice the probability that the empirical risk difference between $S, S’$ is larger than $\frac{\epsilon}{2}$. Now that the right hand side in expressed only in terms of empirical risks, we can bound it without needing to consider the the whole of $\mathcal{X \times Y}$, and hence we can bound the term with the risk $R(h)$ without considering the whole of input and output spaces!

This, which is called the symmetrization lemma , was one of the two key parts in the work of Vapnik-Chervonenkis (1971).

The Growth Function

Now that we are bounding only the empirical risk, if we have many hypotheses that have the same empirical risk (a.k.a. producing the same labels/values on the data points), we can safely choose one of them as a representative of the whole group, we’ll call that an effective hypothesis, and discard all the others.

By only choosing the distinct effective hypotheses on the dataset $S$, we restrict the hypothesis space $\mathcal{H}$ to a smaller subspace that depends on the dataset $\mathcal{H}_{|S}$.

We can assume the independence of the hypotheses in $\mathcal{H}_{|S}$ like we did before with $\mathcal{H}$ (but it’s more plausible now), and use the union bound to get that:

Notice that the hypothesis space is restricted by $S \cup S’$ because we using the empirical risk on both the original dataset $S$ and the ghost $S’$. The question now is what is the maximum size of a restricted hypothesis space? The answer is very simple; we consider a hypothesis to be a new effective one if it produces new labels/values on the dataset samples, then the maximum number of distinct hypothesis (a.k.a the maximum number of the restricted space) is the maximum number of distinct labels/values the dataset points can take. A cool feature about that maximum size is that its a combinatorial measure, so we don’t need to worry about how the samples are distributed!

For simplicity, we’ll focus now on the case of binary classification, in which $\mathcal{Y}=\{-1, +1\}$. Later we’ll show that the same concepts can be extended to both multiclass classification and regression. In that case, for a dataset with $m$ samples, each of which can take one of two labels: either -1 or +1, the maximum number of distinct labellings is $2^m$.

We’ll define the maximum number of distinct labellings/values on a dataset $S$ of size $m$ by a hypothesis space $\mathcal{H}$ as the growth function of $\mathcal{H}$ given $m$, and we’ll denote that by $\Delta_\mathcal{H}(m)$. It’s called the growth function because it’s value for a single hypothesis space $\mathcal{H}$ (aka the size of the restricted subspace $\mathcal{H_{|S}}$) grows as the size of the dataset grows. Now we can say that:

Notice that we used $2m$ because we have two datasets $S,S’$ each with size $m$.

For the binary classification case, we can say that:

But $2^m$ is exponential in $m$ and would grow too fast for large datasets, which makes the odds in our inequality go too bad too fast! Is that the best bound we can get on that growth function?

The VC-Dimension

The $2^m$ bound is based on the fact that the hypothesis space $\mathcal{H}$ can produce all the possible labellings on the $m$ data points. If a hypothesis space can indeed produce all the possible labels on a set of data points, we say that the hypothesis space shatters that set.

But can any hypothesis space shatter any dataset of any size? Let’s investigate that with the binary classification case and the $\mathcal{H}$ of linear classifiers $\mathrm{sign}(wx + b)$. The following animation shows how many ways a linear classifier in 2D can label 3 points (on the left) and 4 points (on the right).

In the animation, the whole space of possible effective hypotheses is swept. For the the three points, the hypothesis shattered the set of points and produced all the possible $2^3 = 8$ labellings. However for the four points,the hypothesis couldn’t get more than 14 and never reached $2^4 = 16$, so it failed to shatter this set of points. Actually, no linear classifier in 2D can shatter any set of 4 points, not just that set; because there will always be two labellings that cannot be produced by a linear classifier which is depicted in the following figure.

4-points

From the decision boundary plot (on the right), it’s clear why no linear classifier can produce such labellings; as no linear classifier can divide the space in this way. So it’s possible for a hypothesis space $\mathcal{H}$ to be unable to shatter all sizes. This fact can be used to get a better bound on the growth function, and this is done using Sauer’s lemma :

If a hypothesis space $\mathcal{H}$ cannot shatter any dataset with size more than $k$, then: \[\Delta_{\mathcal{H}}(m) \leq \sum_{i=0}^{k}\binom{m}{i}\]

This was the other key part of Vapnik-Chervonenkis work (1971), but it’s named after another mathematician, Norbert Sauer; because it was independently proved by him around the same time (1972). However, Vapnik and Chervonenkis weren’t completely left out from this contribution; as that $k$, which is the maximum number of points that can be shattered by $\mathcal{H}$, is now called the Vapnik-Chervonenkis-dimension or the VC-dimension $d_{\mathrm{vc}}$ of $\mathcal{H}$.

For the case of the linear classifier in 2D, $d_\mathrm{vc} = 3$. In general, it can be proved that hyperplane classifiers (the higher-dimensional generalization of line classifiers) in $\mathbb{R}^n$ space has $d_\mathrm{vc} = n + 1$.

The bound on the growth function provided by sauer’s lemma is indeed much better than the exponential one we already have, it’s actually polynomial! Using algebraic manipulation, we can prove that:

Where $O$ refers to the Big-O notation for functions asymptotic (near the limits) behavior, and $e$ is the mathematical constant.

Thus we can use the VC-dimension as a proxy for growth function and, hence, for the size of the restricted space $\mathcal{H_{|S}}$. In that case, $d_\mathrm{vc}$ would be a measure of the complexity or richness of the hypothesis space.

The VC Generalization Bound

With a little change in the constants, it can be shown that Heoffding’s inequality is applicable on the probability $\mathbb{P}\left[|R_\mathrm{emp}(h) - R_\mathrm{emp}’(h)| > \frac{\epsilon}{2}\right]$. With that, and by combining inequalities (1) and (2), the Vapnik-Chervonenkis theory follows:

This can be re-expressed as a bound on the generalization error, just as we did earlier with the previous bound, to get the VC generalization bound :

or, by using the bound on growth function in terms of $d_\mathrm{vc}$ as:

what is hypothesis space in machine learning

Professor Vapnik standing in front of a white board that has a form of the VC-bound and the phrase “All your bayes are belong to us”, which is a play on the broken english phrase found in the classic video game Zero Wing in a claim that the VC framework of inference is superior to that of Bayesian inference . [Courtesy of Yann LeCunn ].

This is a significant result! It’s a clear and concise mathematical statement that the learning problem is solvable, and that for infinite hypotheses spaces there is a finite bound on the their generalization error! Furthermore, this bound can be described in term of a quantity ($d_\mathrm{vc}$), that solely depends on the hypothesis space and not on the distribution of the data points!

Now, in light of these results, is there’s any hope for the memorization hypothesis?

It turns out that there’s still no hope! The memorization hypothesis can shatter any dataset no matter how big it is, that means that its $d_\mathrm{vc}$ is infinite, yielding an infinite bound on $R(h_\mathrm{mem})$ as before. However, the success of linear hypothesis can now be explained by the fact that they have a finite $d_\mathrm{vc} = n + 1$ in $\mathbb{R}^n$. The theory is now consistent with the empirical observations.

Distribution-Based Bounds

The fact that $d_\mathrm{vc}$ is distribution-free comes with a price: by not exploiting the structure and the distribution of the data samples, the bound tends to get loose. Consider for example the case of linear binary classifiers in a very higher n-dimensional feature space, using the distribution-free $d_\mathrm{vc} = n + 1$ means that the bound on the generalization error would be poor unless the size of the dataset $N$ is also very large to balance the effect of the large $d_\mathrm{vc}$. This is the good old curse of dimensionality we all know and endure.

However, a careful investigation into the distribution of the data samples can bring more hope to the situation. For example, For data points that are linearly separable, contained in a ball of radius $R$, with a margin $\rho$ between the closest points in the two classes, one can prove that for a hyperplane classifier:

It follows that the larger the margin, the lower the $d_\mathrm{vc}$ of the hypothesis. This is theoretical motivation behind Support Vector Machines (SVMs) which attempts to classify data using the maximum margin hyperplane. This was also proved by Vapnik and Chervonenkis.

One Inequality to Rule Them All

Up until this point, all our analysis was for the case of binary classification. And it’s indeed true that the form of the vc bound we arrived at here only works for the binary classification case. However, the conceptual framework of VC (that is: shattering, growth function and dimension) generalizes very well to both multi-class classification and regression.

Due to the work of Natarajan (1989), the Natarajan dimension is defined as a generalization of the VC-dimension for multiple classes classification, and a bound similar to the VC-Bound is derived in terms of it. Also, through the work of Pollard (1984), the pseudo-dimension generalizes the VC-dimension for the regression case with a bound on the generalization error also similar to VC’s.

There is also Rademacher’s complexity , which is a relatively new tool (devised in the 2000s) that measures the richness of a hypothesis space by measuring how well it can fit to random noise. The cool thing about Rademacher’s complexity is that it’s flexible enough to be adapted to any learning problem, and it yields very similar generalization bounds to the other methods mentioned.

However, no matter what the exact form of the bound produced by any of these methods is, it always takes the form:

where $C$ is a function of the hypothesis space complexity (or size, or richness), $N$ the size of the dataset, and the confidence $1 - \delta$ about the bound. This inequality basically says the generalization error can be decomposed into two parts: the empirical training error, and the complexity of the learning model.

This form of the inequality holds to any learning problem no matter the exact form of the bound, and this is the one we’re gonna use throughout the rest of the series to guide us through the process of machine learning.

References and Additional Readings

  • Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2012.
  • Shalev-Shwartz, Shai, and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014.
  • Abu-Mostafa, Y. S., Magdon-Ismail, M., & Lin, H. (2012). Learning from data: a short course.

Mostafa Samir

Mostafa Samir

Wandering in a lifelong journey seeking after truth.

  • Comprehensive Learning Paths
  • 150+ Hours of Videos
  • Complete Access to Jupyter notebooks, Datasets, References.

Rating

Hypothesis Testing – A Deep Dive into Hypothesis Testing, The Backbone of Statistical Inference

  • September 21, 2023

Explore the intricacies of hypothesis testing, a cornerstone of statistical analysis. Dive into methods, interpretations, and applications for making data-driven decisions.

what is hypothesis space in machine learning

In this Blog post we will learn:

  • What is Hypothesis Testing?
  • Steps in Hypothesis Testing 2.1. Set up Hypotheses: Null and Alternative 2.2. Choose a Significance Level (α) 2.3. Calculate a test statistic and P-Value 2.4. Make a Decision
  • Example : Testing a new drug.
  • Example in python

1. What is Hypothesis Testing?

In simple terms, hypothesis testing is a method used to make decisions or inferences about population parameters based on sample data. Imagine being handed a dice and asked if it’s biased. By rolling it a few times and analyzing the outcomes, you’d be engaging in the essence of hypothesis testing.

Think of hypothesis testing as the scientific method of the statistics world. Suppose you hear claims like “This new drug works wonders!” or “Our new website design boosts sales.” How do you know if these statements hold water? Enter hypothesis testing.

2. Steps in Hypothesis Testing

  • Set up Hypotheses : Begin with a null hypothesis (H0) and an alternative hypothesis (Ha).
  • Choose a Significance Level (α) : Typically 0.05, this is the probability of rejecting the null hypothesis when it’s actually true. Think of it as the chance of accusing an innocent person.
  • Calculate Test statistic and P-Value : Gather evidence (data) and calculate a test statistic.
  • p-value : This is the probability of observing the data, given that the null hypothesis is true. A small p-value (typically ≤ 0.05) suggests the data is inconsistent with the null hypothesis.
  • Decision Rule : If the p-value is less than or equal to α, you reject the null hypothesis in favor of the alternative.

2.1. Set up Hypotheses: Null and Alternative

Before diving into testing, we must formulate hypotheses. The null hypothesis (H0) represents the default assumption, while the alternative hypothesis (H1) challenges it.

For instance, in drug testing, H0 : “The new drug is no better than the existing one,” H1 : “The new drug is superior .”

2.2. Choose a Significance Level (α)

When You collect and analyze data to test H0 and H1 hypotheses. Based on your analysis, you decide whether to reject the null hypothesis in favor of the alternative, or fail to reject / Accept the null hypothesis.

The significance level, often denoted by $α$, represents the probability of rejecting the null hypothesis when it is actually true.

In other words, it’s the risk you’re willing to take of making a Type I error (false positive).

Type I Error (False Positive) :

  • Symbolized by the Greek letter alpha (α).
  • Occurs when you incorrectly reject a true null hypothesis . In other words, you conclude that there is an effect or difference when, in reality, there isn’t.
  • The probability of making a Type I error is denoted by the significance level of a test. Commonly, tests are conducted at the 0.05 significance level , which means there’s a 5% chance of making a Type I error .
  • Commonly used significance levels are 0.01, 0.05, and 0.10, but the choice depends on the context of the study and the level of risk one is willing to accept.

Example : If a drug is not effective (truth), but a clinical trial incorrectly concludes that it is effective (based on the sample data), then a Type I error has occurred.

Type II Error (False Negative) :

  • Symbolized by the Greek letter beta (β).
  • Occurs when you accept a false null hypothesis . This means you conclude there is no effect or difference when, in reality, there is.
  • The probability of making a Type II error is denoted by β. The power of a test (1 – β) represents the probability of correctly rejecting a false null hypothesis.

Example : If a drug is effective (truth), but a clinical trial incorrectly concludes that it is not effective (based on the sample data), then a Type II error has occurred.

Balancing the Errors :

what is hypothesis space in machine learning

In practice, there’s a trade-off between Type I and Type II errors. Reducing the risk of one typically increases the risk of the other. For example, if you want to decrease the probability of a Type I error (by setting a lower significance level), you might increase the probability of a Type II error unless you compensate by collecting more data or making other adjustments.

It’s essential to understand the consequences of both types of errors in any given context. In some situations, a Type I error might be more severe, while in others, a Type II error might be of greater concern. This understanding guides researchers in designing their experiments and choosing appropriate significance levels.

2.3. Calculate a test statistic and P-Value

Test statistic : A test statistic is a single number that helps us understand how far our sample data is from what we’d expect under a null hypothesis (a basic assumption we’re trying to test against). Generally, the larger the test statistic, the more evidence we have against our null hypothesis. It helps us decide whether the differences we observe in our data are due to random chance or if there’s an actual effect.

P-value : The P-value tells us how likely we would get our observed results (or something more extreme) if the null hypothesis were true. It’s a value between 0 and 1. – A smaller P-value (typically below 0.05) means that the observation is rare under the null hypothesis, so we might reject the null hypothesis. – A larger P-value suggests that what we observed could easily happen by random chance, so we might not reject the null hypothesis.

2.4. Make a Decision

Relationship between $α$ and P-Value

When conducting a hypothesis test:

We then calculate the p-value from our sample data and the test statistic.

Finally, we compare the p-value to our chosen $α$:

  • If $p−value≤α$: We reject the null hypothesis in favor of the alternative hypothesis. The result is said to be statistically significant.
  • If $p−value>α$: We fail to reject the null hypothesis. There isn’t enough statistical evidence to support the alternative hypothesis.

3. Example : Testing a new drug.

Imagine we are investigating whether a new drug is effective at treating headaches faster than drug B.

Setting Up the Experiment : You gather 100 people who suffer from headaches. Half of them (50 people) are given the new drug (let’s call this the ‘Drug Group’), and the other half are given a sugar pill, which doesn’t contain any medication.

  • Set up Hypotheses : Before starting, you make a prediction:
  • Null Hypothesis (H0): The new drug has no effect. Any difference in healing time between the two groups is just due to random chance.
  • Alternative Hypothesis (H1): The new drug does have an effect. The difference in healing time between the two groups is significant and not just by chance.

Calculate Test statistic and P-Value : After the experiment, you analyze the data. The “test statistic” is a number that helps you understand the difference between the two groups in terms of standard units.

For instance, let’s say:

  • The average healing time in the Drug Group is 2 hours.
  • The average healing time in the Placebo Group is 3 hours.

The test statistic helps you understand how significant this 1-hour difference is. If the groups are large and the spread of healing times in each group is small, then this difference might be significant. But if there’s a huge variation in healing times, the 1-hour difference might not be so special.

Imagine the P-value as answering this question: “If the new drug had NO real effect, what’s the probability that I’d see a difference as extreme (or more extreme) as the one I found, just by random chance?”

For instance:

  • P-value of 0.01 means there’s a 1% chance that the observed difference (or a more extreme difference) would occur if the drug had no effect. That’s pretty rare, so we might consider the drug effective.
  • P-value of 0.5 means there’s a 50% chance you’d see this difference just by chance. That’s pretty high, so we might not be convinced the drug is doing much.
  • If the P-value is less than ($α$) 0.05: the results are “statistically significant,” and they might reject the null hypothesis , believing the new drug has an effect.
  • If the P-value is greater than ($α$) 0.05: the results are not statistically significant, and they don’t reject the null hypothesis , remaining unsure if the drug has a genuine effect.

4. Example in python

For simplicity, let’s say we’re using a t-test (common for comparing means). Let’s dive into Python:

Making a Decision : “The results are statistically significant! p-value < 0.05 , The drug seems to have an effect!” If not, we’d say, “Looks like the drug isn’t as miraculous as we thought.”

5. Conclusion

Hypothesis testing is an indispensable tool in data science, allowing us to make data-driven decisions with confidence. By understanding its principles, conducting tests properly, and considering real-world applications, you can harness the power of hypothesis testing to unlock valuable insights from your data.

More Articles

Correlation – connecting the dots, the role of correlation in data analysis, sampling and sampling distributions – a comprehensive guide on sampling and sampling distributions, law of large numbers – a deep dive into the world of statistics, central limit theorem – a deep dive into central limit theorem and its significance in statistics, skewness and kurtosis – peaks and tails, understanding data through skewness and kurtosis”, similar articles, complete introduction to linear regression in r, how to implement common statistical significance tests and find the p value, logistic regression – a complete tutorial with examples in r.

Subscribe to Machine Learning Plus for high value data science content

© Machinelearningplus. All rights reserved.

what is hypothesis space in machine learning

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free sample videos:.

what is hypothesis space in machine learning

Hypothesis Spaces for Deep Learning

This paper introduces a hypothesis space for deep learning that employs deep neural networks (DNNs). By treating a DNN as a function of two variables, the physical variable and parameter variable, we consider the primitive set of the DNNs for the parameter variable located in a set of the weight matrices and biases determined by a prescribed depth and widths of the DNNs. We then complete the linear span of the primitive DNN set in a weak* topology to construct a Banach space of functions of the physical variable. We prove that the Banach space so constructed is a reproducing kernel Banach space (RKBS) and construct its reproducing kernel. We investigate two learning models, regularized learning and minimum interpolation problem in the resulting RKBS, by establishing representer theorems for solutions of the learning models. The representer theorems unfold that solutions of these learning models can be expressed as linear combination of a finite number of kernel sessions determined by given data and the reproducing kernel.

Key words : Reproducing kernel Banach space, deep learning, deep neural network, representer theorem for deep learning

1 Introduction

Deep learning has been a huge success in applications. Mathematically, its success is due to the use of deep neural networks (DNNs), neural networks of multiple layers, to describe decision functions. Various mathematical aspects of DNNs as an approximation tool were investigated recently in a number of studies [ 9 , 11 , 13 , 16 , 20 , 27 , 28 , 31 ] . As pointed out in [ 8 ] , learning processes do not take place in a vacuum. Classical learning methods took place in a reproducing kernel Hilbert space (RKHS) [ 1 ] , which leads to representation of learning solutions in terms of a combination of a finite number of kernel sessions [ 19 ] of a universal kernel [ 17 ] . Reproducing kernel Hilbert spaces as appropriate hypothesis spaces for classical learning methods provide a foundation for mathematical analysis of the learning methods. A natural and imperative question is what are appropriate hypothesis spaces for deep learning. Although hypothesis spaces for learning with shallow neural networks (networks of one hidden layer) were investigated recently in a number of studies, (e.g. [ 2 , 6 , 18 , 21 ] ), appropriate hypothesis spaces for deep learning are still absent. The goal of the present study is to understand this imperative theoretical issue.

The road-map of constructing the hypothesis space for deep learning may be described as follows. We treat a DNN as a function of two variables, one being the physical variable and the other being the parameter variable. We then consider the set of the DNNs as functions of the physical variable for the parameter variable taking all elements of the set of the weight matrices and biases determined by a prescribed depth and widths of the DNNs. Upon completing the linear span of the DNN set in a weak* topology, we construct a Banach space of functions of the physical variable. We establish that the resulting Banach space is a reproducing kernel Banach space (RKBS), on which point-evaluation functionals are continuous, and construct an asymmetric reproducing kernel, for the space, which is a function of the two variables, the physical variable and the parameter variable. We regard the constructed RKBS as the hypothesis space for deep learning. We remark that when deep neural networks reduce to shallow network (having only one hidden layer), our hypothesis space coincides the space for shallow learning studied in [ 2 ] .

Upon introducing the hypothesis space for deep learning, we investigate two learning models, the regularized learning and minimum interpolation problem in the resulting RKBS. We establish representer theorems for solutions of the learning models by employing theory of the reproducing kernel Banach space developed in [ 25 , 26 , 29 ] and representer theorems for solutions of learning in a general RKBS established in [ 4 , 23 , 24 ] . Like the representer theorems for the classical learning in RKHSs, the resulting representer theorems for the two deep learning models in the RKBS reveal that although the learning models are of infinite dimension, their solutions lay in finite dimensional manifolds. More specifically, they can be expressed as a linear combination of a finite number of kernel sessions, the reproducing kernel evaluated the parameter variable at points determined by given data. The representer theorems established in this paper is data-dependent. Even when deep neural networks reduce to a shallow network, the corresponding representer theorem is still new to our best acknowledge. The hypothesis space and the representer theorems for the two deep learning models in it provide us prosperous insights of deep learning and supply deep learning a sound mathematical foundation for further investigation.

We organize this paper in six sections. We describe in Section 2 an innate deep learning model with DNNs. Aiming at formulating reproducing kernel Banach spaces as hypothesis spaces for deep learning, in Section 3 we elucidate the notion of vector-valued reproducing kernel Banach spaces. Section 4 is entirely devoted to the development of the hypothesis space for deep learning. We specifically show that the completion of the linear span of the primitive DNN set, pertaining to the innate learning model, in a weak* topology is an RKBS, which constitutes the hypothesis space for deep learning. In Section 5, we study learning models in the RKBS, establishing representer theorems for solutions of two learning models (regularized learning and minimum norm interpolation) in the hypothesis space. We conclude this paper in Section 6 with remarks on advantages of learning in the proposed hypothesis space.

2 Learning with Deep Neural Networks

We describe in this section an innate learning model with DNNs, considered wildly in the machine learning community.

We first recall the notation of DNNs. Let s 𝑠 s italic_s and t 𝑡 t italic_t be positive integers. A DNN is a vector-valued function from ℝ s superscript ℝ 𝑠 \mathbb{R}^{s} blackboard_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT to ℝ t superscript ℝ 𝑡 \mathbb{R}^{t} blackboard_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT formed by compositions of functions, each of which is defined by an activation function applied to an affine map. Specifically, for a given univariate function σ : ℝ → ℝ : 𝜎 → ℝ ℝ \sigma:\mathbb{R}\to\mathbb{R} italic_σ : blackboard_R → blackboard_R , we define a vector-valued function by

𝑗 1 f_{j+1} italic_f start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT , for j ∈ ℕ k − 1 𝑗 subscript ℕ 𝑘 1 j\in\mathbb{N}_{k-1} italic_j ∈ blackboard_N start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , we denote the consecutive composition of f j subscript 𝑓 𝑗 f_{j} italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , j ∈ ℕ k 𝑗 subscript ℕ 𝑘 j\in\mathbb{N}_{k} italic_j ∈ blackboard_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , by

whose domain is that of f 1 subscript 𝑓 1 f_{1} italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . Suppose that D ∈ ℕ 𝐷 ℕ D\in\mathbb{N} italic_D ∈ blackboard_N is prescribed and fixed. Throughout this paper, we always let m 0 := s assign subscript 𝑚 0 𝑠 m_{0}:=s italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT := italic_s and m D := t assign subscript 𝑚 𝐷 𝑡 m_{D}:=t italic_m start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT := italic_t . We specify positive integers m j subscript 𝑚 𝑗 m_{j} italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , j ∈ ℕ D − 1 𝑗 subscript ℕ 𝐷 1 j\in\mathbb{N}_{D-1} italic_j ∈ blackboard_N start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT . For 𝐖 j ∈ ℝ m j × m j − 1 subscript 𝐖 𝑗 superscript ℝ subscript 𝑚 𝑗 subscript 𝑚 𝑗 1 \mathbf{W}_{j}\in\mathbb{R}^{m_{j}\times m_{j-1}} bold_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT × italic_m start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝐛 j ∈ ℝ m j subscript 𝐛 𝑗 superscript ℝ subscript 𝑚 𝑗 \mathbf{b}_{j}\in\mathbb{R}^{m_{j}} bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , j ∈ ℕ D 𝑗 subscript ℕ 𝐷 j\in\mathbb{N}_{D} italic_j ∈ blackboard_N start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT , a DNN is a function defined by

Note that x 𝑥 x italic_x is the input vector and 𝒩 D superscript 𝒩 𝐷 \mathcal{N}^{D} caligraphic_N start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT has D − 1 𝐷 1 D-1 italic_D - 1 hidden layers and an output layer, which is the D 𝐷 D italic_D -th layer.

A DNN may be represented in a recursive manner. From definition ( 1 ), a DNN can be defined recursively by

We write 𝒩 D superscript 𝒩 𝐷 \mathcal{N}^{D} caligraphic_N start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT as 𝒩 D ⁢ ( ⋅ , { 𝐖 j , 𝐛 j } j = 1 D ) superscript 𝒩 𝐷 ⋅ superscript subscript subscript 𝐖 𝑗 subscript 𝐛 𝑗 𝑗 1 𝐷 \mathcal{N}^{D}(\cdot,\{\mathbf{W}_{j},\mathbf{b}_{j}\}_{j=1}^{D}) caligraphic_N start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ( ⋅ , { bold_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) when it is necessary to indicate the dependence of DNNs on the parameters. In this paper, when we write the set { 𝐖 j , 𝐛 j } j = 1 D superscript subscript subscript 𝐖 𝑗 subscript 𝐛 𝑗 𝑗 1 𝐷 \{\mathbf{W}_{j},\mathbf{b}_{j}\}_{j=1}^{D} { bold_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT associated with the neural network 𝒩 D superscript 𝒩 𝐷 \mathcal{N}^{D} caligraphic_N start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT , we implicitly give it the order inherited from the definition of 𝒩 D superscript 𝒩 𝐷 \mathcal{N}^{D} caligraphic_N start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT . Throughout this paper, we assume that the activation function σ 𝜎 \sigma italic_σ is continuous.

It is advantageous to consider the DNN 𝒩 D superscript 𝒩 𝐷 \mathcal{N}^{D} caligraphic_N start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT defined above as a function of two variables, one being the physical variable x ∈ ℝ s 𝑥 superscript ℝ 𝑠 x\in\mathbb{R}^{s} italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and the other being the parameter variable θ := { 𝐖 j , 𝐛 j } j = 1 D assign 𝜃 superscript subscript subscript 𝐖 𝑗 subscript 𝐛 𝑗 𝑗 1 𝐷 \theta:=\{\mathbf{W}_{j},\mathbf{b}_{j}\}_{j=1}^{D} italic_θ := { bold_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT . Given positive integers m j subscript 𝑚 𝑗 m_{j} italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , j ∈ ℕ D − 1 𝑗 subscript ℕ 𝐷 1 j\in\mathbb{N}_{D-1} italic_j ∈ blackboard_N start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT , we let

denote the width set and define the primitive set of DNNs of D 𝐷 D italic_D layers by

Clearly, the set 𝒜 𝕎 subscript 𝒜 𝕎 \mathcal{A}_{\mathbb{W}} caligraphic_A start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT defined by ( 3 ) depends not only on 𝕎 𝕎 \mathbb{W} blackboard_W but also on D 𝐷 D italic_D . For the sake of simplicity, we will not indicate the dependence on D 𝐷 D italic_D in our notation when ambiguity is not caused. For example, we will use 𝒩 𝒩 \mathcal{N} caligraphic_N for 𝒩 D superscript 𝒩 𝐷 \mathcal{N}^{D} caligraphic_N start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT . Moreover, an element of 𝒜 𝕎 subscript 𝒜 𝕎 \mathcal{A}_{\mathbb{W}} caligraphic_A start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT is a vector-valued function mapping from ℝ s superscript ℝ 𝑠 \mathbb{R}^{s} blackboard_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT to ℝ t superscript ℝ 𝑡 \mathbb{R}^{t} blackboard_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT . We shall understand the set 𝒜 𝕎 subscript 𝒜 𝕎 \mathcal{A}_{\mathbb{W}} caligraphic_A start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT . To this end, we define the parameter space Θ Θ \Theta roman_Θ by letting

Note that Θ Θ \Theta roman_Θ is measurable. For x ∈ ℝ s 𝑥 superscript ℝ 𝑠 {x}\in\mathbb{R}^{s} italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and θ ∈ Θ 𝜃 Θ \theta\in\Theta italic_θ ∈ roman_Θ , we define

For x ∈ ℝ s 𝑥 superscript ℝ 𝑠 {x}\in\mathbb{R}^{s} italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and θ ∈ Θ 𝜃 Θ \theta\in\Theta italic_θ ∈ roman_Θ , there holds 𝒩 ⁢ ( x , θ ) ∈ ℝ t 𝒩 𝑥 𝜃 superscript ℝ 𝑡 \mathcal{N}({x},\theta)\in\mathbb{R}^{t} caligraphic_N ( italic_x , italic_θ ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT . In this notation, set 𝒜 𝕎 subscript 𝒜 𝕎 \mathcal{A}_{\mathbb{W}} caligraphic_A start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT may be written as

We now describe the innate learning model with DNNs. Suppose that a training dataset

is given and we would like to train a neural network from the dataset. We denote by ℒ ⁢ ( 𝒩 , 𝔻 m ) : Θ → ℝ : ℒ 𝒩 subscript 𝔻 𝑚 → Θ ℝ \mathcal{L}(\mathcal{N},\mathbb{D}_{m}):\Theta\to\mathbb{R} caligraphic_L ( caligraphic_N , blackboard_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) : roman_Θ → blackboard_R a loss function determined by the dataset 𝔻 m subscript 𝔻 𝑚 \mathbb{D}_{m} blackboard_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT . For example, a loss function may take the form

where ∥ ⋅ ∥ \|\cdot\| ∥ ⋅ ∥ is a norm of ℝ t superscript ℝ 𝑡 \mathbb{R}^{t} blackboard_R start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT . Given a loss function, a typical deep learning model is to train the parameters θ ∈ Θ 𝕎 𝜃 subscript Θ 𝕎 \theta\in\Theta_{\mathbb{W}} italic_θ ∈ roman_Θ start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT from the training dataset 𝔻 m subscript 𝔻 𝑚 \mathbb{D}_{m} blackboard_D start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT by solving the optimization problem

where 𝒩 𝒩 \mathcal{N} caligraphic_N has the form in equation ( 5 ). Equivalently, optimization problem ( 7 ) may be written as

Model ( 8 ) is an innate learning model considered wildly in the machine learning community. Note that the set 𝒜 𝕎 subscript 𝒜 𝕎 \mathcal{A}_{\mathbb{W}} caligraphic_A start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT lacks either algebraic or topological structures. It is difficult to conduct mathematical analysis for learning model ( 8 ). Even the existence of its solution is not guaranteed.

We introduce a vector space that contains 𝒜 𝕎 subscript 𝒜 𝕎 \mathcal{A}_{\mathbb{W}} caligraphic_A start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT and consider learning in the vector space. For this purpose, given a set 𝕎 𝕎 \mathbb{W} blackboard_W of weight widths defined by ( 2 ), we define the set

In the next proposition, we present properties of ℬ 𝕎 subscript ℬ 𝕎 \mathcal{B}_{\mathbb{W}} caligraphic_B start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT .

Proposition 1 .

If 𝕎 𝕎 \mathbb{W} blackboard_W is the width set defined by ( 2 ), then

(i) ℬ 𝕎 subscript ℬ 𝕎 \mathcal{B}_{\mathbb{W}} caligraphic_B start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT defined by ( 9 ) is the smallest vector space on ℝ ℝ \mathbb{R} blackboard_R that contains the set 𝒜 𝕎 subscript 𝒜 𝕎 \mathcal{A}_{\mathbb{W}} caligraphic_A start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT ,

(ii) ℬ 𝕎 subscript ℬ 𝕎 \mathcal{B}_{\mathbb{W}} caligraphic_B start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT is of infinite dimension,

(iii) ℬ 𝕎 ⊂ ⋃ n ∈ ℕ 𝒜 n ⁢ 𝕎 subscript ℬ 𝕎 subscript 𝑛 ℕ subscript 𝒜 𝑛 𝕎 \mathcal{B}_{\mathbb{W}}\subset\bigcup_{n\in\mathbb{N}}\mathcal{A}_{n\mathbb{W}} caligraphic_B start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT ⊂ ⋃ start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT caligraphic_A start_POSTSUBSCRIPT italic_n blackboard_W end_POSTSUBSCRIPT .

It is clear that ℬ 𝕎 subscript ℬ 𝕎 \mathcal{B}_{\mathbb{W}} caligraphic_B start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT may be identified as the linear span of 𝒜 𝕎 subscript 𝒜 𝕎 \mathcal{A}_{\mathbb{W}} caligraphic_A start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT , that is,

Thus, ℬ 𝕎 subscript ℬ 𝕎 \mathcal{B}_{\mathbb{W}} caligraphic_B start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT is the smallest vector space containing 𝒜 𝕎 subscript 𝒜 𝕎 \mathcal{A}_{\mathbb{W}} caligraphic_A start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT . Item (ii) follows directly from the definition ( 9 ) of ℬ 𝕎 subscript ℬ 𝕎 \mathcal{B}_{\mathbb{W}} caligraphic_B start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT .

It remains to prove Item (iii). To this end, we let f ∈ ℬ 𝕎 𝑓 subscript ℬ 𝕎 f\in\mathcal{B}_{\mathbb{W}} italic_f ∈ caligraphic_B start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT . By the definition ( 9 ) of ℬ 𝕎 subscript ℬ 𝕎 \mathcal{B}_{\mathbb{W}} caligraphic_B start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT , there exist n ′ ∈ ℕ superscript 𝑛 ′ ℕ n^{\prime}\in\mathbb{N} italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_N , c l ∈ ℝ subscript 𝑐 𝑙 ℝ c_{l}\in\mathbb{R} italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R , θ l ∈ Θ 𝕎 subscript 𝜃 𝑙 subscript Θ 𝕎 \theta_{l}\in\Theta_{\mathbb{W}} italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ roman_Θ start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT , for l ∈ ℕ n ′ 𝑙 subscript ℕ superscript 𝑛 ′ l\in\mathbb{N}_{n^{\prime}} italic_l ∈ blackboard_N start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT such that

It suffices to show that f ∈ 𝒜 n ′ ⁢ 𝕎 𝑓 subscript 𝒜 superscript 𝑛 ′ 𝕎 f\in\mathcal{A}_{n^{\prime}\mathbb{W}} italic_f ∈ caligraphic_A start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT blackboard_W end_POSTSUBSCRIPT . Noting that θ l := { 𝐖 j l , 𝐛 j l } j = 1 D assign subscript 𝜃 𝑙 superscript subscript superscript subscript 𝐖 𝑗 𝑙 superscript subscript 𝐛 𝑗 𝑙 𝑗 1 𝐷 \theta_{l}:=\{\mathbf{W}_{j}^{l},\mathbf{b}_{j}^{l}\}_{j=1}^{D} italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT := { bold_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT , for l ∈ ℕ n ′ 𝑙 subscript ℕ superscript 𝑛 ′ l\in\mathbb{N}_{n^{\prime}} italic_l ∈ blackboard_N start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , we set

Clearly, we have that 𝐖 ~ 1 ∈ ℝ ( n ′ ⁢ m 1 ) × m 0 subscript ~ 𝐖 1 superscript ℝ superscript 𝑛 ′ subscript 𝑚 1 subscript 𝑚 0 \widetilde{\mathbf{W}}_{1}\in\mathbb{R}^{(n^{\prime}m_{1})\times{m_{0}}} over~ start_ARG bold_W end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) × italic_m start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , 𝐛 ~ j ∈ ℝ n ′ ⁢ m j subscript ~ 𝐛 𝑗 superscript ℝ superscript 𝑛 ′ subscript 𝑚 𝑗 \widetilde{\mathbf{b}}_{j}\in\mathbb{R}^{n^{\prime}m_{j}} over~ start_ARG bold_b end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , j ∈ ℕ D − 1 𝑗 subscript ℕ 𝐷 1 j\in\mathbb{N}_{D-1} italic_j ∈ blackboard_N start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT , 𝐖 ~ j ∈ ℝ ( n ′ ⁢ m j ) × ( n ′ ⁢ m j − 1 ) subscript ~ 𝐖 𝑗 superscript ℝ superscript 𝑛 ′ subscript 𝑚 𝑗 superscript 𝑛 ′ subscript 𝑚 𝑗 1 \widetilde{\mathbf{W}}_{j}\in\mathbb{R}^{(n^{\prime}m_{j})\times(n^{\prime}m_{% j-1})} over~ start_ARG bold_W end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) × ( italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT , j ∈ ℕ D − 1 \ { 1 } 𝑗 \ subscript ℕ 𝐷 1 1 j\in\mathbb{N}_{D-1}\backslash\{1\} italic_j ∈ blackboard_N start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT \ { 1 } , 𝐖 ~ D ∈ ℝ m D × ( n ′ ⁢ m D − 1 ) subscript ~ 𝐖 𝐷 superscript ℝ subscript 𝑚 𝐷 superscript 𝑛 ′ subscript 𝑚 𝐷 1 \widetilde{\mathbf{W}}_{D}\in\mathbb{R}^{m_{D}\times(n^{\prime}m_{D-1})} over~ start_ARG bold_W end_ARG start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT × ( italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT , and 𝐛 ~ D ∈ ℝ m D subscript ~ 𝐛 𝐷 superscript ℝ subscript 𝑚 𝐷 \widetilde{\mathbf{b}}_{D}\in\mathbb{R}^{m_{D}} over~ start_ARG bold_b end_ARG start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUPERSCRIPT . Direct computation confirms that f ⁢ ( ⋅ ) = 𝒩 ⁢ ( ⋅ , θ ~ ) 𝑓 ⋅ 𝒩 ⋅ ~ 𝜃 f(\cdot)=\mathcal{N}(\cdot,\widetilde{\theta}) italic_f ( ⋅ ) = caligraphic_N ( ⋅ , over~ start_ARG italic_θ end_ARG ) with θ ~ := { 𝐖 ~ j , 𝐛 ~ j } j = 1 D assign ~ 𝜃 superscript subscript subscript ~ 𝐖 𝑗 subscript ~ 𝐛 𝑗 𝑗 1 𝐷 \widetilde{\theta}:=\{\widetilde{\mathbf{W}}_{j},\widetilde{\mathbf{b}}_{j}\}_% {j=1}^{D} over~ start_ARG italic_θ end_ARG := { over~ start_ARG bold_W end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG bold_b end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT . By definition ( 3 ), f ∈ 𝒜 n ′ ⁢ 𝕎 𝑓 subscript 𝒜 superscript 𝑛 ′ 𝕎 f\in\mathcal{A}_{n^{\prime}\mathbb{W}} italic_f ∈ caligraphic_A start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT blackboard_W end_POSTSUBSCRIPT . ∎

Proposition 1 reveals that ℬ 𝕎 subscript ℬ 𝕎 \mathcal{B}_{\mathbb{W}} caligraphic_B start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT is the smallest vector space that contains 𝒜 𝕎 subscript 𝒜 𝕎 \mathcal{A}_{\mathbb{W}} caligraphic_A start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT . Hence, it is a reasonable substitute of 𝒜 𝕎 subscript 𝒜 𝕎 \mathcal{A}_{\mathbb{W}} caligraphic_A start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT . Motivated by Proposition 1 , we propose the following alternative learning model

For a given width set 𝕎 𝕎 \mathbb{W} blackboard_W , unlike learning model ( 8 ) which searches a minimizer in set 𝒜 𝕎 subscript 𝒜 𝕎 \mathcal{A}_{\mathbb{W}} caligraphic_A start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT , learning model ( 10 ) seeks a minimizer in the vector space ℬ 𝕎 subscript ℬ 𝕎 \mathcal{B}_{\mathbb{W}} caligraphic_B start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT , which contains 𝒜 𝕎 subscript 𝒜 𝕎 \mathcal{A}_{\mathbb{W}} caligraphic_A start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT and is contained in 𝒜 := ⋃ n ∈ ℕ 𝒜 n ⁢ 𝕎 assign 𝒜 subscript 𝑛 ℕ subscript 𝒜 𝑛 𝕎 \mathcal{A}:=\bigcup_{n\in\mathbb{N}}\mathcal{A}_{n\mathbb{W}} caligraphic_A := ⋃ start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT caligraphic_A start_POSTSUBSCRIPT italic_n blackboard_W end_POSTSUBSCRIPT . According to Proposition 1 , learning model ( 10 ) is “semi-equivalent” to learning model ( 8 ) in the sense that

where 𝒩 ℬ 𝕎 subscript 𝒩 subscript ℬ 𝕎 \mathcal{N}_{\mathcal{B}_{\mathbb{W}}} caligraphic_N start_POSTSUBSCRIPT caligraphic_B start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT end_POSTSUBSCRIPT is a minimizer of model ( 10 ), 𝒩 𝒜 𝕎 subscript 𝒩 subscript 𝒜 𝕎 \mathcal{N}_{\mathcal{A}_{\mathbb{W}}} caligraphic_N start_POSTSUBSCRIPT caligraphic_A start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT end_POSTSUBSCRIPT and 𝒩 𝒜 subscript 𝒩 𝒜 \mathcal{N}_{\mathcal{A}} caligraphic_N start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT are the minimizers of model ( 8 ) and model ( 8 ) with the set 𝒜 𝕎 subscript 𝒜 𝕎 \mathcal{A}_{\mathbb{W}} caligraphic_A start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT replaced by 𝒜 𝒜 \mathcal{A} caligraphic_A , respectively. One might argue that since model ( 8 ) is a finite dimension optimization problem while model ( 10 ) is an infinite dimensional one, the alternative model ( 10 ) may add unnecessary complexity to the original model. Although model ( 10 ) is of infinite dimension, the algebraic structure of the vector space ℬ 𝕎 subscript ℬ 𝕎 \mathcal{B}_{\mathbb{W}} caligraphic_B start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT and its topological structure that will be equipped later provide us with great advantages for mathematical analysis of learning on the space. As a matter of fact, the vector-valued RKBS to be obtained by completing the vector space ℬ 𝕎 subscript ℬ 𝕎 \mathcal{B}_{\mathbb{W}} caligraphic_B start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT in a weak* topology will lead to the miraculous representer theorem, of the learned solution, which reduces the infinite dimensional optimization problem to a finite dimension one. This addresses the challenges caused by the infinite dimension of the space ℬ 𝕎 subscript ℬ 𝕎 \mathcal{B}_{\mathbb{W}} caligraphic_B start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT .

3 Vector-Valued Reproducing Kernel Banach Space

It was proved in the last section that for a given width set 𝕎 𝕎 \mathbb{W} blackboard_W , the set ℬ 𝕎 subscript ℬ 𝕎 \mathcal{B}_{\mathbb{W}} caligraphic_B start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT defined by ( 9 ) is the smallest vector space that contains the primitive set 𝒜 𝕎 subscript 𝒜 𝕎 \mathcal{A}_{\mathbb{W}} caligraphic_A start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT . One of the aims of this paper is to establish that the vector space ℬ 𝕎 subscript ℬ 𝕎 \mathcal{B}_{\mathbb{W}} caligraphic_B start_POSTSUBSCRIPT blackboard_W end_POSTSUBSCRIPT is dense in a weak* topology in a vector-valued RKBS. For this purpose, in this section we describe the notion of vector-valued RKBSs.

A Banach space ℬ ℬ \mathcal{B} caligraphic_B with the norm ∥ ⋅ ∥ ℬ \|\cdot\|_{\mathcal{B}} ∥ ⋅ ∥ start_POSTSUBSCRIPT caligraphic_B end_POSTSUBSCRIPT is called a space of vector-valued functions on a prescribed set X 𝑋 X italic_X if ℬ ℬ \mathcal{B} caligraphic_B is composed of vector-valued functions defined on X 𝑋 X italic_X and for each f ∈ ℬ 𝑓 ℬ f\in\mathcal{B} italic_f ∈ caligraphic_B , ‖ f ‖ ℬ = 0 subscript norm 𝑓 ℬ 0 \|f\|_{\mathcal{B}}=0 ∥ italic_f ∥ start_POSTSUBSCRIPT caligraphic_B end_POSTSUBSCRIPT = 0 implies that f ⁢ ( x ) = 𝟎 𝑓 𝑥 0 f({x})=\mathbf{0} italic_f ( italic_x ) = bold_0 for all x ∈ X 𝑥 𝑋 {x}\in X italic_x ∈ italic_X . For each x ∈ X 𝑥 𝑋 {x}\in X italic_x ∈ italic_X , we define the point evaluation operator δ x : ℬ → ℝ n : subscript 𝛿 𝑥 → ℬ superscript ℝ 𝑛 \delta_{{x}}:\mathcal{B}\to\mathbb{R}^{n} italic_δ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT : caligraphic_B → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT as

We provide the definition of vector-valued RKBSs below.

Definition 2 .

A Banach space ℬ ℬ \mathcal{B} caligraphic_B of vector-valued functions from X 𝑋 X italic_X to ℝ n superscript ℝ 𝑛 \mathbb{R}^{n} blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is called a vector-valued RKBS if there exists a norm ∥ ⋅ ∥ \|\cdot\| ∥ ⋅ ∥ of ℝ n superscript ℝ 𝑛 \mathbb{R}^{n} blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT such that for each x ∈ X 𝑥 𝑋 x\in X italic_x ∈ italic_X , the point evaluation operator δ x subscript 𝛿 𝑥 \delta_{x} italic_δ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT is continuous with respect to the norm ∥ ⋅ ∥ \|\cdot\| ∥ ⋅ ∥ of ℝ n superscript ℝ 𝑛 \mathbb{R}^{n} blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT on ℬ ℬ \mathcal{B} caligraphic_B , that is, for each x ∈ X 𝑥 𝑋 x\in X italic_x ∈ italic_X , there exists a constant C x > 0 subscript 𝐶 𝑥 0 C_{x}>0 italic_C start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT > 0 such that

Note that since all norms of ℝ n superscript ℝ 𝑛 \mathbb{R}^{n} blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT are equivalent, if a Banach space ℬ ℬ \mathcal{B} caligraphic_B of vector-valued functions from X 𝑋 X italic_X to ℝ n superscript ℝ 𝑛 \mathbb{R}^{n} blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is a vector-valued RKBS with respect to a norm of ℝ n superscript ℝ 𝑛 \mathbb{R}^{n} blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , then it must be a vector-valued RKBS with respect to any other norm of ℝ n superscript ℝ 𝑛 \mathbb{R}^{n} blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT . Thus, the property of point evaluation operators being continuous on space ℬ ℬ \mathcal{B} caligraphic_B is independent of the choice of the norm of the output space ℝ n superscript ℝ 𝑛 \mathbb{R}^{n} blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT .

The notion of RKBSs was originally introduced in [ 29 ] , to guarantee the stability of sampling process and to serve as a hypothesis space for sparse machine learning. Vector-valued RKBSs were studied in [ 14 , 30 ] , in which the definition of the vector-valued RKBS involves an abstract Banach space, with a specific norm, as the output space of functions. In Definition 2 , we limit the output space to the Euclidean space ℝ n superscript ℝ 𝑛 \mathbb{R}^{n} blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT without specifying a norm, due to the special property that norms on ℝ n superscript ℝ 𝑛 \mathbb{R}^{n} blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT are all equivalent.

We reveal in the next proposition that point evaluation operators are continuous if and only if component-wise point evaluation functionals are continuous. To this end, for a vector-valued function f : X → ℝ n : 𝑓 → 𝑋 superscript ℝ 𝑛 f:X\to\mathbb{R}^{n} italic_f : italic_X → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , for each j ∈ ℕ n 𝑗 subscript ℕ 𝑛 j\in\mathbb{N}_{n} italic_j ∈ blackboard_N start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , we denote by f j : X → ℝ : subscript 𝑓 𝑗 → 𝑋 ℝ f_{j}:X\to\mathbb{R} italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT : italic_X → blackboard_R the j 𝑗 j italic_j -th component of f 𝑓 f italic_f , that is,

Proposition 3 .

We next identify a reproducing kernel for a vector-valued RKBS. We need the notion of the δ 𝛿 \delta italic_δ -dual space of a vector-valued RKBS. For a Banach space B 𝐵 B italic_B with a norm ∥ ⋅ ∥ B \|\cdot\|_{B} ∥ ⋅ ∥ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT , we denote by B * superscript 𝐵 B^{*} italic_B start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT the dual space of B 𝐵 B italic_B , which is composed of all continuous linear functionals on B 𝐵 B italic_B endowed with the norm

Suppose that ℬ ℬ \mathcal{B} caligraphic_B is a vector-valued RKBS of functions from X 𝑋 X italic_X to ℝ n superscript ℝ 𝑛 \mathbb{R}^{n} blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , with the dual space ℬ * superscript ℬ \mathcal{B}^{*} caligraphic_B start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT . We set

ID3 Algorithm and Hypothesis space in Decision Tree Learning

The collection of potential decision trees is the hypothesis space searched by ID3. ID3 searches this hypothesis space in a hill-climbing fashion, starting with the empty tree and moving on to increasingly detailed hypotheses in pursuit of a decision tree that properly classifies the training data.

In this blog, we’ll have a look at the Hypothesis space in Decision Trees and the ID3 Algorithm. 

ID3 Algorithm: 

The ID3 algorithm (Iterative Dichotomiser 3) is a classification technique that uses a greedy approach to create a decision tree by picking the optimal attribute that delivers the most Information Gain (IG) or the lowest Entropy (H).

What is Information Gain and Entropy?  

Information gain: .

The assessment of changes in entropy after segmenting a dataset based on a characteristic is known as information gain.

It establishes how much information a feature provides about a class.

We divided the node and built the decision tree based on the value of information gained.

The greatest information gain node/attribute is split first in a decision tree method, which always strives to maximize the value of information gain. 

The formula for Information Gain: 

Entropy is a metric for determining the degree of impurity in a particular property. It denotes the unpredictability of data. The following formula may be used to compute entropy:

S stands for “total number of samples.”

P(yes) denotes the likelihood of a yes answer.

P(no) denotes the likelihood of a negative outcome.

  • Calculate the dataset’s entropy.
  • For each feature/attribute.

Determine the entropy for each of the category values.

Calculate the feature’s information gain.

  • Find the feature that provides the most information.
  • Repeat it till we get the tree we want.

Characteristics of ID3: 

  • ID3 takes a greedy approach, which means it might become caught in local optimums and hence cannot guarantee an optimal result.
  • ID3 has the potential to overfit the training data (to avoid overfitting, smaller decision trees should be preferred over larger ones).
  • This method creates tiny trees most of the time, however, it does not always yield the shortest tree feasible.
  • On continuous data, ID3 is not easy to use (if the values of any given attribute are continuous, then there are many more places to split the data on this attribute, and searching for the best value to split by takes a lot of time).

Over Fitting:  

Good generalization is the desired property in our decision trees (and, indeed, in all classification problems), as we noted before. 

This implies we want the model fit on the labeled training data to generate predictions that are as accurate as they are on new, unseen observations.

Capabilities and Limitations of ID3:

  • In relation to the given characteristics, ID3’s hypothesis space for all decision trees is a full set of finite discrete-valued functions.
  • As it searches across the space of decision trees, ID3 keeps just one current hypothesis. This differs from the prior version space candidate Elimination approach, which keeps the set of all hypotheses compatible with the training instances provided.
  • ID3 loses the capabilities that come with explicitly describing all consistent hypotheses by identifying only one hypothesis. It is unable to establish how many different decision trees are compatible with the supplied training data.
  • One benefit of incorporating all of the instances’ statistical features (e.g., information gain) is that the final search is less vulnerable to faults in individual training examples.
  • By altering its termination criterion to allow hypotheses that inadequately match the training data, ID3 may simply be modified to handle noisy training data.
  • In its purest form, ID3 does not go backward in its search. It never goes back to evaluate a choice after it has chosen an attribute to test at a specific level in the tree. As a result, it is vulnerable to the standard dangers of hill-climbing search without backtracking, resulting in local optimum but not globally optimal solutions.
  • At each stage of the search, ID3 uses all training instances to make statistically based judgments on how to refine its current hypothesis. This is in contrast to approaches that make incremental judgments based on individual training instances (e.g., FIND-S or CANDIDATE-ELIMINATION ).

Hypothesis Space Search by ID3: 

  • ID3 climbs the hill of knowledge acquisition by searching the space of feasible decision trees.
  • It looks for all finite discrete-valued functions in the whole space. Every function is represented by at least one tree.
  • It only holds one theory (unlike Candidate-Elimination). It is unable to inform us how many more feasible options exist.
  • It’s possible to get stranded in local optima.
  • At each phase, all training examples are used. Errors have a lower impact on the outcome.

What is Hypothesis in Machine Learning? How to Form a Hypothesis?

What is Hypothesis in Machine Learning? How to Form a Hypothesis?

Hypothesis Testing is a broad subject that is applicable to many fields. When we study statistics, the Hypothesis Testing there involves data from multiple populations and the test is to see how significant the effect is on the population.

Top Machine Learning and AI Courses Online

This involves calculating the p-value and comparing it with the critical value or the alpha. When it comes to Machine Learning, Hypothesis Testing deals with finding the function that best approximates independent features to the target. In other words, map the inputs to the outputs.

By the end of this tutorial, you will know the following:

Ads of upGrad blog

  • What is Hypothesis in Statistics vs Machine Learning
  • What is Hypothesis space?

Process of Forming a Hypothesis

Trending machine learning skills, hypothesis in statistics.

A Hypothesis is an assumption of a result that is falsifiable, meaning it can be proven wrong by some evidence. A Hypothesis can be either rejected or failed to be rejected. We never accept any hypothesis in statistics because it is all about probabilities and we are never 100% certain. Before the start of the experiment, we define two hypotheses:

1. Null Hypothesis: says that there is no significant effect

2. Alternative Hypothesis: says that there is some significant effect

In statistics, we compare the P-value (which is calculated using different types of statistical tests) with the critical value or alpha. The larger the P-value, the higher is the likelihood, which in turn signifies that the effect is not significant and we conclude that we fail to reject the null hypothesis .

In other words, the effect is highly likely to have occurred by chance and there is no statistical significance of it. On the other hand, if we get a P-value very small, it means that the likelihood is small. That means the probability of the event occurring by chance is very low. 

Join the   ML and AI Course  online from the World’s top Universities – Masters, Executive Post Graduate Programs, and Advanced Certificate Program in ML & AI to fast-track your career.

Significance Level

The Significance Level is set before starting the experiment. This defines how much is the tolerance of error and at which level can the effect can be considered significant. A common value for significance level is 95% which also means that there is a 5% chance of us getting fooled by the test and making an error. In other words, the critical value is 0.05 which acts as a threshold. Similarly, if the significance level was set at 99%, it would mean a critical value of 0.01%.

A statistical test is carried out on the population and sample to find out the P-value which then is compared with the critical value. If the P-value comes out to be less than the critical value, then we can conclude that the effect is significant and hence reject the Null Hypothesis (that said there is no significant effect). If P-Value comes out to be more than the critical value, we can conclude that there is no significant effect and hence fail to reject the Null Hypothesis.

Now, as we can never be 100% sure, there is always a chance of our tests being correct but the results being misleading. This means that either we reject the null when it is actually not wrong. It can also mean that we don’t reject the null when it is actually false. These are type 1 and type 2 errors of Hypothesis Testing. 

Example  

Consider you’re working for a vaccine manufacturer and your team develops the vaccine for Covid-19. To prove the efficacy of this vaccine, it needs to statistically proven that it is effective on humans. Therefore, we take two groups of people of equal size and properties. We give the vaccine to group A and we give a placebo to group B. We carry out analysis to see how many people in group A got infected and how many in group B got infected.

We test this multiple times to see if group A developed any significant immunity against Covid-19 or not. We calculate the P-value for all these tests and conclude that P-values are always less than the critical value. Hence, we can safely reject the null hypothesis and conclude there is indeed a significant effect.

Read:  Machine Learning Models Explained

Hypothesis in Machine Learning

Hypothesis in Machine Learning is used when in a Supervised Machine Learning, we need to find the function that best maps input to output. This can also be called function approximation because we are approximating a target function that best maps feature to the target.

1. Hypothesis(h): A Hypothesis can be a single model that maps features to the target, however, may be the result/metrics. A hypothesis is signified by “ h ”.

2. Hypothesis Space(H): A Hypothesis space is a complete range of models and their possible parameters that can be used to model the data. It is signified by “ H ”. In other words, the Hypothesis is a subset of Hypothesis Space.

In essence, we have the training data (independent features and the target) and a target function that maps features to the target. These are then run on different types of algorithms using different types of configuration of their hyperparameter space to check which configuration produces the best results. The training data is used to formulate and find the best hypothesis from the hypothesis space. The test data is used to validate or verify the results produced by the hypothesis.

Consider an example where we have a dataset of 10000 instances with 10 features and one target. The target is binary, which means it is a binary classification problem. Now, say, we model this data using Logistic Regression and get an accuracy of 78%. We can draw the regression line which separates both the classes. This is a Hypothesis(h). Then we test this hypothesis on test data and get a score of 74%. 

Checkout:  Machine Learning Projects & Topics

Now, again assume we fit a RandomForests model on the same data and get an accuracy score of 85%. This is a good improvement over Logistic Regression already. Now we decide to tune the hyperparameters of RandomForests to get a better score on the same data. We do a grid search and run multiple RandomForest models on the data and check their performance. In this step, we are essentially searching the Hypothesis Space(H) to find a better function. After completing the grid search, we get the best score of 89% and we end the search. 

FYI: Free nlp course !

Now we also try more models like XGBoost, Support Vector Machine and Naive Bayes theorem to test their performances on the same data. We then pick the best performing model and test it on the test data to validate its performance and get a score of 87%. 

Popular AI and ML Blogs & Free Courses

Before you go.

The hypothesis is a crucial aspect of Machine Learning and Data Science. It is present in all the domains of analytics and is the deciding factor of whether a change should be introduced or not. Be it pharma, software, sales, etc. A Hypothesis covers the complete training dataset to check the performance of the models from the Hypothesis space.

A Hypothesis must be falsifiable, which means that it must be possible to test and prove it wrong if the results go against it. The process of searching for the best configuration of the model is time-consuming when a lot of different configurations need to be verified. There are ways to speed up this process as well by using techniques like Random Search of hyperparameters.

If you’re interested to learn more about machine learning, check out IIIT-B & upGrad’s  Executive PG Programme in Machine Learning & AI  which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.

Profile

Pavan Vadapalli

Something went wrong

Machine Learning Skills To Master

  • Artificial Intelligence Courses
  • Tableau Courses
  • NLP Courses
  • Deep Learning Courses

Our Popular Machine Learning Course

Machine Learning Course

Our Trending Machine Learning Courses

  • Advanced Certificate Programme in Machine Learning and NLP from IIIT Bangalore - Duration 8 Months
  • Master of Science in Machine Learning & AI from LJMU - Duration 18 Months
  • Executive PG Program in Machine Learning and AI from IIIT-B - Duration 12 Months

Frequently Asked Questions (FAQs)

There are many reasons to do open-source projects. You are learning new things, you are helping others, you are networking with others, you are creating a reputation and many more. Open source is fun, and eventually you will get something back. One of the most important reasons is that it builds a portfolio of great work that you can present to companies and get hired. Open-source projects are a wonderful way to learn new things. You could be enhancing your knowledge of software development or you could be learning a new skill. There is no better way to learn than to teach.

Yes. Open-source projects do not discriminate. The open-source communities are made of people who love to write code. There is always a place for a newbie. You will learn a lot and also have the chance to participate in a variety of open-source projects. You will learn what works and what doesn't and you will also have the chance to make your code used by a large community of developers. There is a list of open-source projects that are always looking for new contributors.

GitHub offers developers a way to manage projects and collaborate with each other. It also serves as a sort of resume for developers, with a project's contributors, documentation, and releases listed. Contributions to a project show potential employers that you have the skills and motivation to work in a team. Projects are often more than code, so GitHub has a way that you can structure your project just like you would structure a website. You can manage your website with a branch. A branch is like an experiment or a copy of your website. When you want to experiment with a new feature or fix something, you make a branch and experiment there. If the experiment is successful, you can merge the branch back into the original website.

Explore Free Courses

Study Abroad Free Course

Learn more about the education system, top universities, entrance tests, course information, and employment opportunities in Canada through this course.

Marketing

Advance your career in the field of marketing with Industry relevant free courses

Data Science & Machine Learning

Build your foundation in one of the hottest industry of the 21st century

Management

Master industry-relevant skills that are required to become a leader and drive organizational success

Technology

Build essential technical skills to move forward in your career in these evolving times

Career Planning

Get insights from industry leaders and career counselors and learn how to stay ahead in your career

Law

Kickstart your career in law by building a solid foundation with these relevant free courses.

Chat GPT + Gen AI

Stay ahead of the curve and upskill yourself on Generative AI and ChatGPT

Soft Skills

Build your confidence by learning essential soft skills to help you become an Industry ready professional.

Study Abroad Free Course

Learn more about the education system, top universities, entrance tests, course information, and employment opportunities in USA through this course.

Suggested Blogs

Artificial Intelligence course fees

by venkatesh Rajanala

29 Feb 2024

Artificial Intelligence in Banking 2024: Examples &#038; Challenges

by Pavan Vadapalli

27 Feb 2024

Top 9 Python Libraries for Machine Learning in 2024

19 Feb 2024

Top 15 IoT Interview Questions &#038; Answers 2024 – For Beginners &#038; Experienced

by Kechit Goyal

Data Preprocessing in Machine Learning: 7 Easy Steps To Follow

18 Feb 2024

Artificial Intelligence Salary in India [For Beginners &#038; Experienced] in 2024

17 Feb 2024

45+ Interesting Machine Learning Project Ideas For Beginners [2024]

by Jaideep Khare

16 Feb 2024

Trending Articles on Technical and Non Technical topics

  • Selected Reading
  • UPSC IAS Exams Notes
  • Developer's Best Practices
  • Questions and Answers
  • Effective Resume Writing
  • HR Interview Questions
  • Computer Glossary

What is hypothesis in Machine Learning?

The hypothesis is a word that is frequently used in Machine Learning and data science initiatives. As we all know, machine learning is one of the most powerful technologies in the world, allowing us to anticipate outcomes based on previous experiences. Moreover, data scientists and ML specialists undertake experiments with the goal of solving an issue. These ML experts and data scientists make an initial guess on how to solve the challenge.

What is a Hypothesis?

A hypothesis is a conjecture or proposed explanation that is based on insufficient facts or assumptions. It is only a conjecture based on certain known facts that have yet to be confirmed. A good hypothesis is tested and yields either true or erroneous outcomes.

Let's look at an example to better grasp the hypothesis. According to some scientists, ultraviolet (UV) light can harm the eyes and induce blindness.

In this case, a scientist just states that UV rays are hazardous to the eyes, but people presume they can lead to blindness. Yet, it is conceivable that it will not be achievable. As a result, these kinds of assumptions are referred to as hypotheses.

Defining Hypothesis in Machine Learning

In machine learning, a hypothesis is a mathematical function or model that converts input data into output predictions. The model's first belief or explanation is based on the facts supplied. The hypothesis is typically expressed as a collection of parameters characterizing the behavior of the model.

If we're building a model to predict the price of a property based on its size and location. The hypothesis function may look something like this −

$$\mathrm{h(x)\:=\:θ0\:+\:θ1\:*\:x1\:+\:θ2\:*\:x2}$$

The hypothesis function is h(x), its input data is x, the model's parameters are 0, 1, and 2, and the features are x1 and x2.

The machine learning model's purpose is to discover the optimal values for parameters 0 through 2 that minimize the difference between projected and actual output labels.

To put it another way, we're looking for the hypothesis function that best represents the underlying link between the input and output data.

Types of Hypotheses in Machine Learning

The next step is to build a hypothesis after identifying the problem and obtaining evidence. A hypothesis is an explanation or solution to a problem based on insufficient data. It acts as a springboard for further investigation and experimentation. A hypothesis is a machine learning function that converts inputs to outputs based on some assumptions. A good hypothesis contributes to the creation of an accurate and efficient machine-learning model. Several machine learning theories are as follows −

1. Null Hypothesis

A null hypothesis is a basic hypothesis that states that no link exists between the independent and dependent variables. In other words, it assumes the independent variable has no influence on the dependent variable. It is symbolized by the symbol H0. If the p-value falls outside the significance level, the null hypothesis is typically rejected (). If the null hypothesis is correct, the coefficient of determination is the probability of rejecting it. A null hypothesis is involved in test findings such as t-tests and ANOVA.

2. Alternative Hypothesis

An alternative hypothesis is a hypothesis that contradicts the null hypothesis. It assumes that there is a relationship between the independent and dependent variables. In other words, it assumes that there is an effect of the independent variable on the dependent variable. It is denoted by Ha. An alternative hypothesis is generally accepted if the p-value is less than the significance level (α). An alternative hypothesis is also known as a research hypothesis.

3. One-tailed Hypothesis

A one-tailed test is a type of significance test in which the region of rejection is located at one end of the sample distribution. It denotes that the estimated test parameter is more or less than the crucial value, implying that the alternative hypothesis rather than the null hypothesis should be accepted. It is most commonly used in the chi-square distribution, where all of the crucial areas, related to, are put in either of the two tails. Left-tailed or right-tailed one-tailed tests are both possible.

4. Two-tailed Hypothesis

The two-tailed test is a hypothesis test in which the region of rejection or critical area is on both ends of the normal distribution. It determines whether the sample tested falls within or outside a certain range of values, and an alternative hypothesis is accepted if the calculated value falls in either of the two tails of the probability distribution. α is bifurcated into two equal parts, and the estimated parameter is either above or below the assumed parameter, so extreme values work as evidence against the null hypothesis.

Overall, the hypothesis plays a critical role in the machine learning model. It provides a starting point for the model to make predictions and helps to guide the learning process. The accuracy of the hypothesis is evaluated using various metrics like mean squared error or accuracy.

The hypothesis is a mathematical function or model that converts input data into output predictions, typically expressed as a collection of parameters characterizing the behavior of the model. It is an explanation or solution to a problem based on insufficient data. A good hypothesis contributes to the creation of an accurate and efficient machine-learning model. A two-tailed hypothesis is used when there is no prior knowledge or theoretical basis to infer a certain direction of the link.

Premansh Sharma

Related Articles

  • What is Machine Learning?
  • What is Epoch in Machine Learning?
  • What is momentum in Machine Learning?
  • What is Standardization in Machine Learning
  • What is Q-learning with respect to reinforcement learning in Machine Learning?
  • What is Bayes Theorem in Machine Learning
  • What is field Mapping in Machine Learning?
  • What is Parameter Extraction in Machine Learning
  • What is Grouped Convolution in Machine Learning?
  • What is Tpot AutoML in machine learning?
  • What is Projection Perspective in Machine Learning?
  • What is a Neural Network in Machine Learning?
  • What is corporate fraud detection in machine learning?
  • What is Linear Algebra Application in Machine Learning
  • What is Continuous Kernel Convolution in machine learning?

Kickstart Your Career

Get certified by completing the course

To Continue Learning Please Login

COMMENTS

  1. What is a Hypothesis in Machine Learning?

    A hypothesis in machine learning is a candidate model that approximates a target function for mapping inputs to outputs. Learn the difference between a hypothesis in science, in statistics, and in machine learning, and how they are used in supervised learning.

  2. Hypothesis in Machine Learning

    A hypothesis is a function that best describes the target in supervised machine learning. The hypothesis that an algorithm would come up depends upon the data and also depends upon the restrictions and bias that we have imposed on the data. The Hypothesis can be calculated as: Where, y = range. m = slope of the lines. x = domain.

  3. What's a Hypothesis Space?

    Machine-learning algorithms come with implicit or explicit assumptions about the actual patterns in the data. Mathematically, this means that each algorithm can learn a specific family of models, and that family goes by the name of the hypothesis space.

  4. Hypothesis in Machine Learning

    Learn what is hypothesis and hypothesis space in machine learning, and how they are used to find the best function to map inputs to outputs. Also, compare hypothesis in machine learning and statistics, and understand the concepts of null hypothesis, alternative hypothesis, significance level, and p-value.

  5. What exactly is a hypothesis space in machine learning?

    This set of observations can be used by a machine learning (ML) algorithm to learn a function f that is able to predict a value y for any input from the input space. ... The function f has to be chosen from the hypothesis space. To get a better idea: The input space is in the above given example $2^4$, its the number of possible inputs.

  6. Machine Learning: The Basics

    A learning rate or step-size parameter used by gradient-based methods. h() A hypothesis map that reads in features x of a data point and delivers a prediction ^y= h(x) for its label y. H A hypothesis space or model used by a ML method. The hypothesis space consists of di erent hypothesis maps h: X!Ybetween which the ML method has to choose. 8

  7. Basic Concepts in Machine Learning

    Traditional Programming : Data and program is run on the computer to produce the output. Machine Learning: Data and output is run on the computer to create a program. This program can be used in traditional programming. Machine learning is like farming or gardening. Seeds is the algorithms, nutrients is the data, the gardner is you and plants ...

  8. A Gentle Introduction to Computational Learning Theory

    Additionally, a hypothesis space (machine learning algorithm) is efficient under the PAC framework if an algorithm can find a PAC hypothesis (fit model) in polynomial time. A hypothesis space is said to be efficiently PAC-learnable if there is a polynomial time algorithm that can identify a function that is PAC.

  9. Best Guesses: Understanding The Hypothesis in Machine Learning

    In machine learning, the term 'hypothesis' can refer to two things. First, it can refer to the hypothesis space, the set of all possible training examples that could be used to predict or answer a new instance. Second, it can refer to the traditional null and alternative hypotheses from statistics. Since machine learning works so closely ...

  10. Hypothesis Space

    The term "hypothesis space" is ubiquitous in the machine learning literature, but few articles discuss the concept itself. In Inductive Logic Programming, a significant body of work exists on how to define a language bias (and thus a hypothesis space), and on how to automatically weaken the bias (enlarge the hypothesis space) when a given bias turns out to be too strong.

  11. Hypothesis Testing in Machine Learning

    The steps involved in the hypothesis testing are as follow: Assume a null hypothesis, usually in machine learning algorithms we consider that there is no anomaly between the target and independent variable. Collect a sample. Calculate test statistics. Decide either to accept or reject the null hypothesis.

  12. Hypothesis Space

    In machine learning, the goal of a supervised learning algorithm is to perform induction, i.e., to generalize a (finite) set of observations (the training data) into a general model of the domain. In this regard, the hypothesis space is defined as the set of candidate models considered by the algorithm. More specifically, consider the problem ...

  13. Introduction to the Hypothesis Space and the Bias-Variance Tradeoff in

    The hypothesis space in machine learning is a set of all possible models that can be used to explain a data distribution given the limitations of that space. A linear hypothesis space is limited to the set of all linear models. If the data distribution follows a non-linear distribution, the linear hypothesis space might not contain a model that ...

  14. Hypothesis in Machine Learning: Comprehensive Overview (2021)

    The hypothesis formula in machine learning: y= mx b. Where, y is range. m changes in y divided by change in x. x is domain. b is intercept. The purpose of restricting hypothesis space in machine learning is so that these can fit well with the general data that is needed by the user. It checks the reality or deception of observations or inputs ...

  15. machine learning

    A hypothesis space/class is the set of functions that the learning algorithm considers when picking one function to minimize some risk/loss functional.. The capacity of a hypothesis space is a number or bound that quantifies the size (or richness) of the hypothesis space, i.e. the number (and type) of functions that can be represented by the hypothesis space.

  16. Machine Learning 1.1: Hypothesis Spaces

    This video introduces the concept of a hypothesis space which is a restricted set of predictor functions that can be computed and manipulated efficiently giv...

  17. Machine Learning: Model Representation And Hypothesis

    Let's dive into it. First, the goal of most machine learning algorithms is to construct a model or a hypothesis. In machine learning, a model can be a mathematical representation of a real-world ...

  18. Version space learning

    The intermediate (thin) rectangles represent the hypotheses in the version space. Version space learning is a logical approach to machine learning, specifically binary classification. Version space learning algorithms search a predefined space of hypotheses, viewed as a set of logical sentences. Formally, the hypothesis space is a disjunction [1]

  19. Searching the hypothesis space (Chapter 6)

    In Chapter 5 we introduced the main notions of machine learning, with particular regard to hypothesis and data representation, and we saw that concept learning can be formulated in terms of a search problem in the hypothesis space H.As H is in general very large, or even infinite, well-designed strategies are required in order to perform efficiently the search for good hypotheses.

  20. Machine Learning Theory

    But the learning problem doesn't know that single hypothesis beforehand, it needs to pick one out of an entire hypothesis space $\mathcal{H}$, so we need a generalization bound that reflects the challenge of choosing the right hypothesis. ... Let's think for a moment about something we do usually in machine learning practice. In order to ...

  21. Introduction of Hypothesis in Statistics and Machine Learning

    Machine Learning Hypothesis. ... (hypothesis set): A space of possible hypotheses for mapping inputs to outputs that can be searched, often constrained by the choice of the framing of the problem, ...

  22. Hypothesis Testing

    Foundations Of Machine Learning (Free) Python Programming(Free) Numpy For Data Science(Free) Pandas For Data Science(Free) ... ($α$) 0.05: the results are not statistically significant, and they don't reject the null hypothesis, remaining unsure if the drug has a genuine effect. 4. Example in python. For simplicity, let's say we're using ...

  23. Hypothesis Spaces for Deep Learning

    The hypothesis space and the representer theorems for the two deep learning models in it provide us prosperous insights of deep learning and supply deep learning a sound mathematical foundation for further investigation. We organize this paper in six sections. We describe in Section 2 an innate deep learning model with DNNs.

  24. PDF Properties of the Hypothesis Space and their Effect on Machine Learning

    Three properties of the hypothesis space are discussed: dimensionality, local optima and representational capacity. There are other relevant prop-erties, such as a possible hierarchic structure [24] or lattice structure [38]. The three discussed here however are relevant to nearly all machine learning problems.

  25. ID3 Algorithm and Hypothesis space in Decision Tree Learning

    Hypothesis Space Search by ID3: ID3 climbs the hill of knowledge acquisition by searching the space of feasible decision trees. It looks for all finite discrete-valued functions in the whole space. Every function is represented by at least one tree. It only holds one theory (unlike Candidate-Elimination).

  26. What is Hypothesis in Machine Learning? How to Form a Hypothesis

    The hypothesis is a crucial aspect of Machine Learning and Data Science. It is present in all the domains of analytics and is the deciding factor of whether a change should be introduced or not. Be it pharma, software, sales, etc. A Hypothesis covers the complete training dataset to check the performance of the models from the Hypothesis space.

  27. What is hypothesis in Machine Learning?

    In machine learning, a hypothesis is a mathematical function or model that converts input data into output predictions. The model's first belief or explanation is based on the facts supplied. The hypothesis is typically expressed as a collection of parameters characterizing the behavior of the model. If we're building a model to predict the ...

  28. machine learning

    The hypothesis space used by a machine learning system is the set of all hypotheses that might possibly be returned by it. Per this post, the Perceptron algorithm makes prediction $$ \begin{equation} \hat y = \begin{cases} 1 & wx+b >= 0\\ 0 & wx+b<0 \end{cases} \end{equation} $$ ...

  29. The Power of Statistics Course by Google

    Hypothesis testing helps data professionals determine if the results of a test or experiment are statistically significant or due to chance. You'll learn about the basic steps for any hypothesis test and how hypothesis testing can help you draw meaningful conclusions about data.