What is the concept of channels in CNNs?
I am trying to understand what channels mean in convolutional neural networks. When working with grayscale and colored images, I understand that the number of channels is set to 1 and 3 (in the first conv layer), respectively, where 3 corresponds to red, green, and blue.
Say you have a colored image that is 200x200 pixels. The standard is such that the input matrix is a 200x200 matrix with 3 channels. The first convolutional layer would have a filter that is size $N times M times 3$, where $N,M<200$ (I think they're usually set to 3 or 5).
Would it be possible to structure the input data differently, such that the number of channels now becomes the width or height of the image? i.e., the number of channels would be 200, the input matrix would then be 200x3 or 3x200. What would be the advantage/disadvantage of this formulation versus the standard (# of channels = 3)? Obviously, this would limit your filter's spatial size, but dramatically increase it in the depth direction.
I am really posing this question because I don't quite understand the concept of channels in CNNs.
machine-learning deep-learning convolutional-neural-networks
add a comment |
I am trying to understand what channels mean in convolutional neural networks. When working with grayscale and colored images, I understand that the number of channels is set to 1 and 3 (in the first conv layer), respectively, where 3 corresponds to red, green, and blue.
Say you have a colored image that is 200x200 pixels. The standard is such that the input matrix is a 200x200 matrix with 3 channels. The first convolutional layer would have a filter that is size $N times M times 3$, where $N,M<200$ (I think they're usually set to 3 or 5).
Would it be possible to structure the input data differently, such that the number of channels now becomes the width or height of the image? i.e., the number of channels would be 200, the input matrix would then be 200x3 or 3x200. What would be the advantage/disadvantage of this formulation versus the standard (# of channels = 3)? Obviously, this would limit your filter's spatial size, but dramatically increase it in the depth direction.
I am really posing this question because I don't quite understand the concept of channels in CNNs.
machine-learning deep-learning convolutional-neural-networks
add a comment |
I am trying to understand what channels mean in convolutional neural networks. When working with grayscale and colored images, I understand that the number of channels is set to 1 and 3 (in the first conv layer), respectively, where 3 corresponds to red, green, and blue.
Say you have a colored image that is 200x200 pixels. The standard is such that the input matrix is a 200x200 matrix with 3 channels. The first convolutional layer would have a filter that is size $N times M times 3$, where $N,M<200$ (I think they're usually set to 3 or 5).
Would it be possible to structure the input data differently, such that the number of channels now becomes the width or height of the image? i.e., the number of channels would be 200, the input matrix would then be 200x3 or 3x200. What would be the advantage/disadvantage of this formulation versus the standard (# of channels = 3)? Obviously, this would limit your filter's spatial size, but dramatically increase it in the depth direction.
I am really posing this question because I don't quite understand the concept of channels in CNNs.
machine-learning deep-learning convolutional-neural-networks
I am trying to understand what channels mean in convolutional neural networks. When working with grayscale and colored images, I understand that the number of channels is set to 1 and 3 (in the first conv layer), respectively, where 3 corresponds to red, green, and blue.
Say you have a colored image that is 200x200 pixels. The standard is such that the input matrix is a 200x200 matrix with 3 channels. The first convolutional layer would have a filter that is size $N times M times 3$, where $N,M<200$ (I think they're usually set to 3 or 5).
Would it be possible to structure the input data differently, such that the number of channels now becomes the width or height of the image? i.e., the number of channels would be 200, the input matrix would then be 200x3 or 3x200. What would be the advantage/disadvantage of this formulation versus the standard (# of channels = 3)? Obviously, this would limit your filter's spatial size, but dramatically increase it in the depth direction.
I am really posing this question because I don't quite understand the concept of channels in CNNs.
machine-learning deep-learning convolutional-neural-networks
machine-learning deep-learning convolutional-neural-networks
asked 6 hours ago
Iamanon
232
232
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
The concept of channels comes from communications and has little to do with neural networks specifically. We see the concept in AI literature because AI deals with media streams and files.
The concept is this. A media format has channels to allow the independent passage of multiple signals through a single mechanism. That's it.
For example, some streamed broadcast or multicast may have channels for red, green, blue, alpha, left audio, and right audio. In printing, four color is standardized to cyan, magenta, yellow, and black (or key). In jet flight control there are channels for rudder, aileron, elevator, throttle, landing gear, and break positions.
Automated vehicles would have steering, accelerator, break, horn, and two turn signals for control channels and data acquisition channels including an accelerometer, two microphones, probably some strain gauges, four radial encoders, three color channels for each of two cameras, and some thermistors.
In the context of convolution, a kernel may be applied in various ways to individual channels or groups of channels either in parallel or in aggregation. When you turn up the bass on a digital music player, a kernel designed to boost the lower frequencies is applied to both left and right audio channels. Missile countermeasures have kernels with built in hardware trigonometry to track and steer in compound angles with nanosecond response time. Remote control toys have channels through which joystick signals are sent through kernels to coordinate propulsion signals to wheels or props.
For computer vision, consider the five dimensions involved in a data set of 1,000 video examples with resolution of 1,920 x 1,080 progressive 32 bit pixels at 29.97 fps, each example with 9,010 frames.
- Example index
- Frame number
- Vertical index
- Horizontal index
- Channel
This would fit into an array of bytes dimensioned [1,000][9,010][1,080][1,920][4]. How the kernel deals with one or more channels and whether the channels interact or are kept independent has to do with the expected purpose of the layer within the deep learning design, and that is dependent upon what is expected in the video input and what is hoped for as output after the network is trained.
Therefore,
- The AI system topology,
- The configuration and dimensions of layers,
- The dimensions of kernels,
- How those kernels deal with the channels, and
- What criteria are used to determine what is back propagated
is a coordinated set of design decisions the AI engineer must make based on theory and probably demonstrate in a POC later on.
The mapping of 1 channel to grayscale and 3 channels to color is an oversimplification. Sometimes color is RGB at the input and output (3 and 3 channels respectively) but averaged to grays through some layers for the sake of computational thrift whereas a normalized representation of color balance is processed in a separate parallel stream. Other times the alpha channel is needed or an alpha and beta channel, so there may be 4 or 5 channels respectively to handle green or blue screen movie making techniques or to support the learning of scene layering.
The AI engineer must first
- Decide what is wanted,
- What must be learned to get it,
- Design a process to achieve it, and
- Select appropriate learning techniques to tune the parameterized process to learn what is necessary.
After such design activity is reasonably complete, the system topology and the dimension count and sizes become clear. The use of channels to fully represent information and what operations must be performed by kernels on those channels also become apparent.
add a comment |
Say you have a colored image that is 200x200 pixels. The standard is such that the input matrix is a 200x200 matrix with 3 channels. The first convolutional layer would have a filter that is size $N×M×3$, where $N,M<200$ (I think they're usually set to 3 or 5).
Would it be possible to structure the input data differently, such that the number of channels now becomes the width or height of the image? i.e., the number of channels would be 200, the input matrix would then be 200x3 or 3x200. What would be the advantage/disadvantage of this formulation versus the standard (# of channels = 3)? Obviously, this would limit your filter's spatial size, but dramatically increase it in the depth direction.
The different dimensions (width, height, and number of channels) do have different meanings, the intuition behind them is different, and this is important. You could rearrange your data, but if you then plug it into an implementation of CNNs that expects data in the original format, it would likely perform poorly.
The important observation to make is that the intuition behind CNNs is that they encode the "prior" assumption or knowledge, the "heuristic", the "rule of thumb", of location invariance. The intuition is that, when looking at images, we often want our Neural Network to be able to consistently (in the same way) detect features (maybe low-level features such as edges, corners, or maybe high-level features such as complete faces) regardless of where they are. It should not matter whether a face is located in the top-left corner or the bottom-right corner of an image, detecting that it is there should still be performed in the same way (i.e. likely requires exactly the same combination of learned weights in our network). That is what we mean with location invariance.
That intution of location invariance is implemented by using "filters" or "feature detectors" that we "slide" along the entire image. These are the things you mentioned having dimensionality $N times M times 3$. The intuition of location invariance is implemented by taking the exact same filter, and re-applying it in different locations of the image.
If you change the order in which you present your data, you will break this property of location invariance. Instead, you will replace it with a rather strange property of... for example, "width-colour" invariance. You might get a filter that can detect the same type of feature regardless its $x$-coordinate in an image, and regarldess of the colour in which it was drawn, but the $y$-coordinate will suddenly become relevant; your filter may be able to detect edges of any colour in the bottom of an image, but fail to recognize the same edges in the top-side of an image. This is not an intuition that I would expect to work successfully in most image recognition tasks.
Note that there may also be advantages in terms of computation time in having the data ordered in a certain way, depending on what calculations you're going to perform using that data afterwards (typically lots of matrix multiplications). It is best to have the data stored in RAM in such a way that the inner-most loops of algorithms using the data (matrix multiplication) access the data sequentially, in the same order that it is stored in. This is the most efficient way in which to access data from RAM, and will result in the fastest computations. You can generally safely expect that implementations in large frameworks like Tensorflow and PyTorch will already require you to supply data in whatever format is the most efficient by default.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "658"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f9751%2fwhat-is-the-concept-of-channels-in-cnns%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
The concept of channels comes from communications and has little to do with neural networks specifically. We see the concept in AI literature because AI deals with media streams and files.
The concept is this. A media format has channels to allow the independent passage of multiple signals through a single mechanism. That's it.
For example, some streamed broadcast or multicast may have channels for red, green, blue, alpha, left audio, and right audio. In printing, four color is standardized to cyan, magenta, yellow, and black (or key). In jet flight control there are channels for rudder, aileron, elevator, throttle, landing gear, and break positions.
Automated vehicles would have steering, accelerator, break, horn, and two turn signals for control channels and data acquisition channels including an accelerometer, two microphones, probably some strain gauges, four radial encoders, three color channels for each of two cameras, and some thermistors.
In the context of convolution, a kernel may be applied in various ways to individual channels or groups of channels either in parallel or in aggregation. When you turn up the bass on a digital music player, a kernel designed to boost the lower frequencies is applied to both left and right audio channels. Missile countermeasures have kernels with built in hardware trigonometry to track and steer in compound angles with nanosecond response time. Remote control toys have channels through which joystick signals are sent through kernels to coordinate propulsion signals to wheels or props.
For computer vision, consider the five dimensions involved in a data set of 1,000 video examples with resolution of 1,920 x 1,080 progressive 32 bit pixels at 29.97 fps, each example with 9,010 frames.
- Example index
- Frame number
- Vertical index
- Horizontal index
- Channel
This would fit into an array of bytes dimensioned [1,000][9,010][1,080][1,920][4]. How the kernel deals with one or more channels and whether the channels interact or are kept independent has to do with the expected purpose of the layer within the deep learning design, and that is dependent upon what is expected in the video input and what is hoped for as output after the network is trained.
Therefore,
- The AI system topology,
- The configuration and dimensions of layers,
- The dimensions of kernels,
- How those kernels deal with the channels, and
- What criteria are used to determine what is back propagated
is a coordinated set of design decisions the AI engineer must make based on theory and probably demonstrate in a POC later on.
The mapping of 1 channel to grayscale and 3 channels to color is an oversimplification. Sometimes color is RGB at the input and output (3 and 3 channels respectively) but averaged to grays through some layers for the sake of computational thrift whereas a normalized representation of color balance is processed in a separate parallel stream. Other times the alpha channel is needed or an alpha and beta channel, so there may be 4 or 5 channels respectively to handle green or blue screen movie making techniques or to support the learning of scene layering.
The AI engineer must first
- Decide what is wanted,
- What must be learned to get it,
- Design a process to achieve it, and
- Select appropriate learning techniques to tune the parameterized process to learn what is necessary.
After such design activity is reasonably complete, the system topology and the dimension count and sizes become clear. The use of channels to fully represent information and what operations must be performed by kernels on those channels also become apparent.
add a comment |
The concept of channels comes from communications and has little to do with neural networks specifically. We see the concept in AI literature because AI deals with media streams and files.
The concept is this. A media format has channels to allow the independent passage of multiple signals through a single mechanism. That's it.
For example, some streamed broadcast or multicast may have channels for red, green, blue, alpha, left audio, and right audio. In printing, four color is standardized to cyan, magenta, yellow, and black (or key). In jet flight control there are channels for rudder, aileron, elevator, throttle, landing gear, and break positions.
Automated vehicles would have steering, accelerator, break, horn, and two turn signals for control channels and data acquisition channels including an accelerometer, two microphones, probably some strain gauges, four radial encoders, three color channels for each of two cameras, and some thermistors.
In the context of convolution, a kernel may be applied in various ways to individual channels or groups of channels either in parallel or in aggregation. When you turn up the bass on a digital music player, a kernel designed to boost the lower frequencies is applied to both left and right audio channels. Missile countermeasures have kernels with built in hardware trigonometry to track and steer in compound angles with nanosecond response time. Remote control toys have channels through which joystick signals are sent through kernels to coordinate propulsion signals to wheels or props.
For computer vision, consider the five dimensions involved in a data set of 1,000 video examples with resolution of 1,920 x 1,080 progressive 32 bit pixels at 29.97 fps, each example with 9,010 frames.
- Example index
- Frame number
- Vertical index
- Horizontal index
- Channel
This would fit into an array of bytes dimensioned [1,000][9,010][1,080][1,920][4]. How the kernel deals with one or more channels and whether the channels interact or are kept independent has to do with the expected purpose of the layer within the deep learning design, and that is dependent upon what is expected in the video input and what is hoped for as output after the network is trained.
Therefore,
- The AI system topology,
- The configuration and dimensions of layers,
- The dimensions of kernels,
- How those kernels deal with the channels, and
- What criteria are used to determine what is back propagated
is a coordinated set of design decisions the AI engineer must make based on theory and probably demonstrate in a POC later on.
The mapping of 1 channel to grayscale and 3 channels to color is an oversimplification. Sometimes color is RGB at the input and output (3 and 3 channels respectively) but averaged to grays through some layers for the sake of computational thrift whereas a normalized representation of color balance is processed in a separate parallel stream. Other times the alpha channel is needed or an alpha and beta channel, so there may be 4 or 5 channels respectively to handle green or blue screen movie making techniques or to support the learning of scene layering.
The AI engineer must first
- Decide what is wanted,
- What must be learned to get it,
- Design a process to achieve it, and
- Select appropriate learning techniques to tune the parameterized process to learn what is necessary.
After such design activity is reasonably complete, the system topology and the dimension count and sizes become clear. The use of channels to fully represent information and what operations must be performed by kernels on those channels also become apparent.
add a comment |
The concept of channels comes from communications and has little to do with neural networks specifically. We see the concept in AI literature because AI deals with media streams and files.
The concept is this. A media format has channels to allow the independent passage of multiple signals through a single mechanism. That's it.
For example, some streamed broadcast or multicast may have channels for red, green, blue, alpha, left audio, and right audio. In printing, four color is standardized to cyan, magenta, yellow, and black (or key). In jet flight control there are channels for rudder, aileron, elevator, throttle, landing gear, and break positions.
Automated vehicles would have steering, accelerator, break, horn, and two turn signals for control channels and data acquisition channels including an accelerometer, two microphones, probably some strain gauges, four radial encoders, three color channels for each of two cameras, and some thermistors.
In the context of convolution, a kernel may be applied in various ways to individual channels or groups of channels either in parallel or in aggregation. When you turn up the bass on a digital music player, a kernel designed to boost the lower frequencies is applied to both left and right audio channels. Missile countermeasures have kernels with built in hardware trigonometry to track and steer in compound angles with nanosecond response time. Remote control toys have channels through which joystick signals are sent through kernels to coordinate propulsion signals to wheels or props.
For computer vision, consider the five dimensions involved in a data set of 1,000 video examples with resolution of 1,920 x 1,080 progressive 32 bit pixels at 29.97 fps, each example with 9,010 frames.
- Example index
- Frame number
- Vertical index
- Horizontal index
- Channel
This would fit into an array of bytes dimensioned [1,000][9,010][1,080][1,920][4]. How the kernel deals with one or more channels and whether the channels interact or are kept independent has to do with the expected purpose of the layer within the deep learning design, and that is dependent upon what is expected in the video input and what is hoped for as output after the network is trained.
Therefore,
- The AI system topology,
- The configuration and dimensions of layers,
- The dimensions of kernels,
- How those kernels deal with the channels, and
- What criteria are used to determine what is back propagated
is a coordinated set of design decisions the AI engineer must make based on theory and probably demonstrate in a POC later on.
The mapping of 1 channel to grayscale and 3 channels to color is an oversimplification. Sometimes color is RGB at the input and output (3 and 3 channels respectively) but averaged to grays through some layers for the sake of computational thrift whereas a normalized representation of color balance is processed in a separate parallel stream. Other times the alpha channel is needed or an alpha and beta channel, so there may be 4 or 5 channels respectively to handle green or blue screen movie making techniques or to support the learning of scene layering.
The AI engineer must first
- Decide what is wanted,
- What must be learned to get it,
- Design a process to achieve it, and
- Select appropriate learning techniques to tune the parameterized process to learn what is necessary.
After such design activity is reasonably complete, the system topology and the dimension count and sizes become clear. The use of channels to fully represent information and what operations must be performed by kernels on those channels also become apparent.
The concept of channels comes from communications and has little to do with neural networks specifically. We see the concept in AI literature because AI deals with media streams and files.
The concept is this. A media format has channels to allow the independent passage of multiple signals through a single mechanism. That's it.
For example, some streamed broadcast or multicast may have channels for red, green, blue, alpha, left audio, and right audio. In printing, four color is standardized to cyan, magenta, yellow, and black (or key). In jet flight control there are channels for rudder, aileron, elevator, throttle, landing gear, and break positions.
Automated vehicles would have steering, accelerator, break, horn, and two turn signals for control channels and data acquisition channels including an accelerometer, two microphones, probably some strain gauges, four radial encoders, three color channels for each of two cameras, and some thermistors.
In the context of convolution, a kernel may be applied in various ways to individual channels or groups of channels either in parallel or in aggregation. When you turn up the bass on a digital music player, a kernel designed to boost the lower frequencies is applied to both left and right audio channels. Missile countermeasures have kernels with built in hardware trigonometry to track and steer in compound angles with nanosecond response time. Remote control toys have channels through which joystick signals are sent through kernels to coordinate propulsion signals to wheels or props.
For computer vision, consider the five dimensions involved in a data set of 1,000 video examples with resolution of 1,920 x 1,080 progressive 32 bit pixels at 29.97 fps, each example with 9,010 frames.
- Example index
- Frame number
- Vertical index
- Horizontal index
- Channel
This would fit into an array of bytes dimensioned [1,000][9,010][1,080][1,920][4]. How the kernel deals with one or more channels and whether the channels interact or are kept independent has to do with the expected purpose of the layer within the deep learning design, and that is dependent upon what is expected in the video input and what is hoped for as output after the network is trained.
Therefore,
- The AI system topology,
- The configuration and dimensions of layers,
- The dimensions of kernels,
- How those kernels deal with the channels, and
- What criteria are used to determine what is back propagated
is a coordinated set of design decisions the AI engineer must make based on theory and probably demonstrate in a POC later on.
The mapping of 1 channel to grayscale and 3 channels to color is an oversimplification. Sometimes color is RGB at the input and output (3 and 3 channels respectively) but averaged to grays through some layers for the sake of computational thrift whereas a normalized representation of color balance is processed in a separate parallel stream. Other times the alpha channel is needed or an alpha and beta channel, so there may be 4 or 5 channels respectively to handle green or blue screen movie making techniques or to support the learning of scene layering.
The AI engineer must first
- Decide what is wanted,
- What must be learned to get it,
- Design a process to achieve it, and
- Select appropriate learning techniques to tune the parameterized process to learn what is necessary.
After such design activity is reasonably complete, the system topology and the dimension count and sizes become clear. The use of channels to fully represent information and what operations must be performed by kernels on those channels also become apparent.
edited 2 hours ago
answered 2 hours ago
Douglas Daseeco
4,283837
4,283837
add a comment |
add a comment |
Say you have a colored image that is 200x200 pixels. The standard is such that the input matrix is a 200x200 matrix with 3 channels. The first convolutional layer would have a filter that is size $N×M×3$, where $N,M<200$ (I think they're usually set to 3 or 5).
Would it be possible to structure the input data differently, such that the number of channels now becomes the width or height of the image? i.e., the number of channels would be 200, the input matrix would then be 200x3 or 3x200. What would be the advantage/disadvantage of this formulation versus the standard (# of channels = 3)? Obviously, this would limit your filter's spatial size, but dramatically increase it in the depth direction.
The different dimensions (width, height, and number of channels) do have different meanings, the intuition behind them is different, and this is important. You could rearrange your data, but if you then plug it into an implementation of CNNs that expects data in the original format, it would likely perform poorly.
The important observation to make is that the intuition behind CNNs is that they encode the "prior" assumption or knowledge, the "heuristic", the "rule of thumb", of location invariance. The intuition is that, when looking at images, we often want our Neural Network to be able to consistently (in the same way) detect features (maybe low-level features such as edges, corners, or maybe high-level features such as complete faces) regardless of where they are. It should not matter whether a face is located in the top-left corner or the bottom-right corner of an image, detecting that it is there should still be performed in the same way (i.e. likely requires exactly the same combination of learned weights in our network). That is what we mean with location invariance.
That intution of location invariance is implemented by using "filters" or "feature detectors" that we "slide" along the entire image. These are the things you mentioned having dimensionality $N times M times 3$. The intuition of location invariance is implemented by taking the exact same filter, and re-applying it in different locations of the image.
If you change the order in which you present your data, you will break this property of location invariance. Instead, you will replace it with a rather strange property of... for example, "width-colour" invariance. You might get a filter that can detect the same type of feature regardless its $x$-coordinate in an image, and regarldess of the colour in which it was drawn, but the $y$-coordinate will suddenly become relevant; your filter may be able to detect edges of any colour in the bottom of an image, but fail to recognize the same edges in the top-side of an image. This is not an intuition that I would expect to work successfully in most image recognition tasks.
Note that there may also be advantages in terms of computation time in having the data ordered in a certain way, depending on what calculations you're going to perform using that data afterwards (typically lots of matrix multiplications). It is best to have the data stored in RAM in such a way that the inner-most loops of algorithms using the data (matrix multiplication) access the data sequentially, in the same order that it is stored in. This is the most efficient way in which to access data from RAM, and will result in the fastest computations. You can generally safely expect that implementations in large frameworks like Tensorflow and PyTorch will already require you to supply data in whatever format is the most efficient by default.
add a comment |
Say you have a colored image that is 200x200 pixels. The standard is such that the input matrix is a 200x200 matrix with 3 channels. The first convolutional layer would have a filter that is size $N×M×3$, where $N,M<200$ (I think they're usually set to 3 or 5).
Would it be possible to structure the input data differently, such that the number of channels now becomes the width or height of the image? i.e., the number of channels would be 200, the input matrix would then be 200x3 or 3x200. What would be the advantage/disadvantage of this formulation versus the standard (# of channels = 3)? Obviously, this would limit your filter's spatial size, but dramatically increase it in the depth direction.
The different dimensions (width, height, and number of channels) do have different meanings, the intuition behind them is different, and this is important. You could rearrange your data, but if you then plug it into an implementation of CNNs that expects data in the original format, it would likely perform poorly.
The important observation to make is that the intuition behind CNNs is that they encode the "prior" assumption or knowledge, the "heuristic", the "rule of thumb", of location invariance. The intuition is that, when looking at images, we often want our Neural Network to be able to consistently (in the same way) detect features (maybe low-level features such as edges, corners, or maybe high-level features such as complete faces) regardless of where they are. It should not matter whether a face is located in the top-left corner or the bottom-right corner of an image, detecting that it is there should still be performed in the same way (i.e. likely requires exactly the same combination of learned weights in our network). That is what we mean with location invariance.
That intution of location invariance is implemented by using "filters" or "feature detectors" that we "slide" along the entire image. These are the things you mentioned having dimensionality $N times M times 3$. The intuition of location invariance is implemented by taking the exact same filter, and re-applying it in different locations of the image.
If you change the order in which you present your data, you will break this property of location invariance. Instead, you will replace it with a rather strange property of... for example, "width-colour" invariance. You might get a filter that can detect the same type of feature regardless its $x$-coordinate in an image, and regarldess of the colour in which it was drawn, but the $y$-coordinate will suddenly become relevant; your filter may be able to detect edges of any colour in the bottom of an image, but fail to recognize the same edges in the top-side of an image. This is not an intuition that I would expect to work successfully in most image recognition tasks.
Note that there may also be advantages in terms of computation time in having the data ordered in a certain way, depending on what calculations you're going to perform using that data afterwards (typically lots of matrix multiplications). It is best to have the data stored in RAM in such a way that the inner-most loops of algorithms using the data (matrix multiplication) access the data sequentially, in the same order that it is stored in. This is the most efficient way in which to access data from RAM, and will result in the fastest computations. You can generally safely expect that implementations in large frameworks like Tensorflow and PyTorch will already require you to supply data in whatever format is the most efficient by default.
add a comment |
Say you have a colored image that is 200x200 pixels. The standard is such that the input matrix is a 200x200 matrix with 3 channels. The first convolutional layer would have a filter that is size $N×M×3$, where $N,M<200$ (I think they're usually set to 3 or 5).
Would it be possible to structure the input data differently, such that the number of channels now becomes the width or height of the image? i.e., the number of channels would be 200, the input matrix would then be 200x3 or 3x200. What would be the advantage/disadvantage of this formulation versus the standard (# of channels = 3)? Obviously, this would limit your filter's spatial size, but dramatically increase it in the depth direction.
The different dimensions (width, height, and number of channels) do have different meanings, the intuition behind them is different, and this is important. You could rearrange your data, but if you then plug it into an implementation of CNNs that expects data in the original format, it would likely perform poorly.
The important observation to make is that the intuition behind CNNs is that they encode the "prior" assumption or knowledge, the "heuristic", the "rule of thumb", of location invariance. The intuition is that, when looking at images, we often want our Neural Network to be able to consistently (in the same way) detect features (maybe low-level features such as edges, corners, or maybe high-level features such as complete faces) regardless of where they are. It should not matter whether a face is located in the top-left corner or the bottom-right corner of an image, detecting that it is there should still be performed in the same way (i.e. likely requires exactly the same combination of learned weights in our network). That is what we mean with location invariance.
That intution of location invariance is implemented by using "filters" or "feature detectors" that we "slide" along the entire image. These are the things you mentioned having dimensionality $N times M times 3$. The intuition of location invariance is implemented by taking the exact same filter, and re-applying it in different locations of the image.
If you change the order in which you present your data, you will break this property of location invariance. Instead, you will replace it with a rather strange property of... for example, "width-colour" invariance. You might get a filter that can detect the same type of feature regardless its $x$-coordinate in an image, and regarldess of the colour in which it was drawn, but the $y$-coordinate will suddenly become relevant; your filter may be able to detect edges of any colour in the bottom of an image, but fail to recognize the same edges in the top-side of an image. This is not an intuition that I would expect to work successfully in most image recognition tasks.
Note that there may also be advantages in terms of computation time in having the data ordered in a certain way, depending on what calculations you're going to perform using that data afterwards (typically lots of matrix multiplications). It is best to have the data stored in RAM in such a way that the inner-most loops of algorithms using the data (matrix multiplication) access the data sequentially, in the same order that it is stored in. This is the most efficient way in which to access data from RAM, and will result in the fastest computations. You can generally safely expect that implementations in large frameworks like Tensorflow and PyTorch will already require you to supply data in whatever format is the most efficient by default.
Say you have a colored image that is 200x200 pixels. The standard is such that the input matrix is a 200x200 matrix with 3 channels. The first convolutional layer would have a filter that is size $N×M×3$, where $N,M<200$ (I think they're usually set to 3 or 5).
Would it be possible to structure the input data differently, such that the number of channels now becomes the width or height of the image? i.e., the number of channels would be 200, the input matrix would then be 200x3 or 3x200. What would be the advantage/disadvantage of this formulation versus the standard (# of channels = 3)? Obviously, this would limit your filter's spatial size, but dramatically increase it in the depth direction.
The different dimensions (width, height, and number of channels) do have different meanings, the intuition behind them is different, and this is important. You could rearrange your data, but if you then plug it into an implementation of CNNs that expects data in the original format, it would likely perform poorly.
The important observation to make is that the intuition behind CNNs is that they encode the "prior" assumption or knowledge, the "heuristic", the "rule of thumb", of location invariance. The intuition is that, when looking at images, we often want our Neural Network to be able to consistently (in the same way) detect features (maybe low-level features such as edges, corners, or maybe high-level features such as complete faces) regardless of where they are. It should not matter whether a face is located in the top-left corner or the bottom-right corner of an image, detecting that it is there should still be performed in the same way (i.e. likely requires exactly the same combination of learned weights in our network). That is what we mean with location invariance.
That intution of location invariance is implemented by using "filters" or "feature detectors" that we "slide" along the entire image. These are the things you mentioned having dimensionality $N times M times 3$. The intuition of location invariance is implemented by taking the exact same filter, and re-applying it in different locations of the image.
If you change the order in which you present your data, you will break this property of location invariance. Instead, you will replace it with a rather strange property of... for example, "width-colour" invariance. You might get a filter that can detect the same type of feature regardless its $x$-coordinate in an image, and regarldess of the colour in which it was drawn, but the $y$-coordinate will suddenly become relevant; your filter may be able to detect edges of any colour in the bottom of an image, but fail to recognize the same edges in the top-side of an image. This is not an intuition that I would expect to work successfully in most image recognition tasks.
Note that there may also be advantages in terms of computation time in having the data ordered in a certain way, depending on what calculations you're going to perform using that data afterwards (typically lots of matrix multiplications). It is best to have the data stored in RAM in such a way that the inner-most loops of algorithms using the data (matrix multiplication) access the data sequentially, in the same order that it is stored in. This is the most efficient way in which to access data from RAM, and will result in the fastest computations. You can generally safely expect that implementations in large frameworks like Tensorflow and PyTorch will already require you to supply data in whatever format is the most efficient by default.
answered 56 mins ago
Dennis Soemers
3,1401433
3,1401433
add a comment |
add a comment |
Thanks for contributing an answer to Artificial Intelligence Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f9751%2fwhat-is-the-concept-of-channels-in-cnns%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown