Update Deep Learning Questions & Answers for Data Scientists.md #27
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The question was How can transformers be used for tasks other than natural language processing, such as computer vision?
I have found the most understandable answer of that.
In NLP:
Example:
In the sentence “The cat sat on the mat”,
"cat" and "sat" are related,
"cat" and "mat" are also related (because the cat is on the mat).
The transformer automatically figures out these relationships.
In Computer Vision:
An image is not a sequence, it’s a 2D grid (height x width x channels).
So before feeding it to a transformer, we:
Now, treat patches like "words" and the image like a "sentence"!
Then self-attention can figure out which parts of the image should attend to which others.
Maybe eyes should attend to nose for face recognition.
Maybe wheels should attend to car body in car detection.
✨ Why does this help?
In traditional CNNs, each convolutional filter looks only at a small local region (say 3×3 pixels).
In transformers, every patch can look at every other patch — even if they are far away!
So transformers can capture global relationships better.