Why is sigmoid activation function not recommended for hidden units but is fine for an output...

Why is sigmoid activation function not recommended for hidden units but is fine for an output unit?

Expert Solution

Answer :

As you can see, the gradient for the sigmoid function will saturate and when using the chain rule, it will contract. By difference, the subsidiary for ReLU is dependably 1 or 0.
The sigmoid, Gaussian and sinusoidal capacities are chosen because of their autonomous and major space division properties.
The sigmoid capacity isn't successful for a solitary concealed unit. Despite what might be expected, alternate capacities can give great execution.
At the point when a few shrouded units are utilized, the sigmoid capacity is helpful. Be that as it may, the union speed is still slower than the others.
The Gaussian function is sensitive to the additive noise, while the others are rather insensitive. As a result, based on convergence rates, the minimum error and noise sensitivity, the sinusoidal function is most useful for both with and without additive noise.
The property of each function is discussed based on the internal representation, that is the distribution of the hidden unit inputs and outputs.
Although this selection depends on the input signals to be classified, the periodic function can be effectively applied to a wide range of application fields.
Actually,sigmoid function is obsoletely replaced by relu function. As we all know,deep nn is hard to train when the network goes deep.
Why that happens,you can reference to this article . Neural networks and deep learning.Generally, it is caused by Gradient disappearance.
So, people create relu function to remission this issue. Relu make it possible to train deeper network .
Two additional major benefits of ReLUs are sparsity and a reduced likelihood of vanishing gradient. But first recall the definition of a ReLU is h=max(0,a)h=max(0,a)h=max(0,a)where a=Wx+ba=Wx+ba=Wx+b.
One major benefit is the reduced likelihood of the gradient to vanish. This arises when a>0a>0a>0. In this regime the gradient has a constant value. In contrast, the gradient of sigmoids becomes increasingly small as the absolute value of x increases. The constant gradient of ReLUs results in faster learning.
The other benefit of ReLUs is sparsity. Sparsity arises when a?0a?0a?0. The more such units that exist in a layer the more sparse the resulting representation.
Sigmoids on the other hand are always likely to generate some non-zero value resulting in dense representations. Sparse representations seem to be more beneficial than dense representations.
Actually,sigmoid function is obsoletely replaced by relu function. As we all know,deep nn is hard to train when the network goes deep.
Why that happens,you can reference to this article . Neural networks and deep learning.Generally, it is caused by Gradient disappearance.
So, people create relu function to remission this issue. Relu make it possible to train deeper network .
Two additional major benefits of ReLUs are sparsity and a reduced likelihood of vanishing gradient. But first recall the definition of a ReLU is h=max(0,a)h=max(0,a)h=max(0,a)where a=Wx+ba=Wx+ba=Wx+b.
One major benefit is the reduced likelihood of the gradient to vanish. This arises when a>0a>0a>0. In this regime the gradient has a constant value. In contrast, the gradient of sigmoids becomes increasingly small as the absolute value of x increases. The constant gradient of ReLUs results in faster learning.
The other benefit of ReLUs is sparsity. Sparsity arises when a?0a?0a?0. The more such units that exist in a layer the more sparse the resulting representation.
Sigmoids on the other hand are always likely to generate some non-zero value resulting in dense representations. Sparse representations seem to be more beneficial than dense representations.

milcah answered 1 year ago

Artificial Intelligence question: Why might we prefer the cross-entropy as an error function in a sigmoid-based...

Artificial Intelligence question: Why might we prefer the cross-entropy as an error function in a sigmoid-based network?

In general how many steady states would there be in a sigmoid function and how many...

In general how many steady states would there be in a sigmoid function and how many of those would be stable?

1.Whats the function of the , Ascending colon , descending colon , transverse colon, sigmoid colon,...

1.Whats the function of the , Ascending colon , descending colon , transverse colon, sigmoid colon, oropharynx , laryngopharynx in digestion ? IM LOOKING FOR A GENERAL ANSWER AND SOMEWHAT SPECIFIC . 2.What is the function of the neck , Body and head and hepatic ducts in the gallbladder? IM LOOKING FOR A GENERAL ANSWER AND SOMEWHAT SPECIFIC .

A firm has the following production function: q=KL The firm will produce 64 units of output...

A firm has the following production function: q=KL The firm will produce 64 units of output and faces prices for labor and capital of $4 and $1 respectively. What is the optimal quantity of labor and capital the firm should employ in order to minimize the cost of producing 64 units of output? What is the minimum cost of producing 64 units of output? Show the firms optimal production decision on a graph.

25) A U.S. firm currently produces 200 units of output according to the production function q...

25) A U.S. firm currently produces 200 units of output according to the production function q = L0.5K0.5 and faces input prices equal to wU.S. = rU.S = $11. Should the U.S. firm move their company abroad where they will face input prices equal to wabroad = $6.50 and rabroad = $15.00? A) Yes, because the total costs will fall from $3,859 to $2,810. B) No, because the total costs will increase from $2,810 to $3,859. C) No, because the...

Why does it not make sense for a function to have more than one output for...

Why does it not make sense for a function to have more than one output for the same input? Provide examples. First , define inputs and outputs ( independent and dependent variables) . Then, define ordered pairs and relations and give examples. Finally, define a function and talk about what particular type of relation is a function. Can you find a real world relation that is not a function? For example: The age of a person is the input and...

Why ERP and orcle system recommended for the Banks

A firm is currently producing 80 units of output. At this level of output produced: its...

A firm is currently producing 80 units of output. At this level of output produced: its average total cost is 100 (ATC = 100 ) The market price per unit of output is 120 MR = 40 MC = 20 i. Is this firm making profits or losses? How much? ii. Are they maximum profits? Why? iii. If your answer to part ii was no, what does this firm have to do with maximize its profits? A firm's total cost...

Assume that price equals a rising marginal cost at 50 units of output. At this output,...

Assume that price equals a rising marginal cost at 50 units of output. At this output, total variable cost is $250 and total fixed cost is $300. The product’s price is $6. a. The perfectly competitive firm will maximize profit by producing ______ units of output. b. If this firm shuts down, it will lose ______ dollars.

A monopolistically competitive firm is currently producing 20 units of output. At this level of output...

A monopolistically competitive firm is currently producing 20 units of output. At this level of output the firm is charging a price equal to $20, has marginal revenue equal to $12, has marginal cost equal to $12, and has average total cost equal to $24. From this information we can infer that firms are likely to leave this market in the long run. the firm is currently maximizing its profit. All of the above are correct. the profits of the...

Question