This is a pretty huge subject and strikes at the heart of things I've looked at since college.
The problem in your example comes down to knowledge representation. You show a human an apple and we instinctively know how to isolate that object from the background, think about it's shape, color, feel. We imagine what it might weigh even without holding it. We automatically think of things that might be similar given all of those properties.
The computer sees pixels. All of them, background, hand, etc.. It has no built in ability to detect edges like our eyes have. It cannot reason about a single thing without reams of programming to "find" the apple in the first place.
This in particular is the study of neural networks. They try to simulate the pattern matching of our brains but they still do it at a very crude level. The biological systems we have were honed over hundreds of thousands of years and have lots of built in things that help in sometimes non-obvious ways.
Regarding language learning. This is also tricky. As you mention, the computer has no basic architecture to work from. How would you describe an apple in words that the computer would understand. There are none. If you tried to give it some words then you would have to describe those words, too. There has to be some core knowledge to build from. In this area, there are some interesting things to look into. Check out WordNet, AliceBot, Semantic Web and specifically the Dublin core.
http://wordnet.princeton.edu/ If you want to try to reduce text into more understandable concepts then this is an excellent resource. I've done some playing with this and read several research papers that use this is their foundation.
http://www.alicebot.org/about.html This seems like sort of a "trick" in that using a simplified rule language you can essentially parrot back ideas to a chatter... but to me it seemed like another excellent way to reduce text into simpler forms that a computer might be able to reason about given something like WordNet.
http://en.wikipedia.org/wiki/Dublin_Core I'm actually not a fan of ontologies but if you already have data mapped this way then it improves your chances with using other approaches. In the semantic web, presumably a human has already attempted to reduce a document to only "dublin core" related semantics.
(I actually tried this really early in my career, to explain words to a computer through repetitive drill in... trying to zone in some common core knowledge to program. It was fun to play around with but nothing even close to the awesomeness of WordNet or the completeness of the Dublin Core.)
Having two kids that have some symptoms of being on the autism spectrum, it's been interesting to pull back some of this knowledge and try to apply it to what I see them struggling with. Something as simple as a single sensory processing issue can have far reaching implications. For example, we are not born with the ability to pick a single sound out of all of the sounds we are being bombarded with. When we are born it's just noise and we early on learn to filter and zoom in on specific sounds (hearing the violins in the symphony or your mother's voice in the crowd). If you delay this ability by even a little bit then it affects everything... not the least of which is language development. Similar things happen with other senses. If I point to a toy and the child cannot see an object for the clutter then they have no idea what I'm trying to convey.
Interesting subject. But yeah, there is a reason this stuff is hard.