1) Agents receive historical visual information (X, eg., photo) of chairs (T) and facts about those rooms (Y, eg., their color).
2) They produce functions which take visual information (eg., a photo) of an unknown chair, and can produce descriptions of the unknown chair which are accurate if the visual information is accurate. (Eg., it is red if the photo accurately depicts its redness)
(3) ML systems must receive the same type and amount of visual information when learning and when inferring; and produce the same type of descriptions. Eg., an ML system cannot take a b/w photo, if it has only seen color photos, and conclude a color accurately; even if the b/w photo is accurate. Nor can it draw a color version.
4) Humans are able to receive visual information in a form they haven't seen before (eg., a child sees a b/w photo for the first time), which contains less information than what they have seen before (eg., no color), and produce accurate descriptions in many forms -- eg., paint, draw, spoken, ... -- (if the b/w photo is accurate).
(4) is required for general intelligence, and (3 =/= 4).
NB.
If the "child's first b/w photo" isnt convincing, consider a language case: we read entirely new sentences and still produce accurate facts based on our reading them. Eg., "new york is loved by the gleebals of phefar prime aliens" -> "Gleebals are a lot like tourists, they should visit Manhattan"; A painting of the planet phefar-prime with a "I <3 NY" logo on it.
You might say that an ensemble of ML systems with all possible types of measurement and types of description would "perform as well" as a human -- narrowly, here, this is the case.
The point is that humans do not need all possible "versions" of everything. If you have learnt to draw you can take entirely new drawings (with less info that you have had before) to entirely new sentences (with more info).
---
A semi-formal proof,
1) Agents receive an encoding (E) of measurements (X) of target objects (T), and relevant descriptions of those objects (DY).
2) They produce functions which takes an E'-encoding of a new measurement (E'(X')) of an unknown target U, and produces, a description D'O about a possible target O. Accuracy: O is D'-similar to U iff E'(O) similar to E'(U).
3) ML Systems produce only one such function, h, where E=E'.
4) GI Agents (eg., Humans) produce a set of functions, g1..n, where E'=E and E' in C1..n (a big set of compressive encodings) without prior experience of eg., Cn(X'). And which produce multiple modes of description D=D' and D=M1..Mn without having prior Mn(Y').
General intelligence requires (at least) generating g1..n from (EX,DY) and (2) fails.