Andrei Zmievski, in a new post over his blog, informs you about a settled approach for HTTP input request encoding in PHP 6. He says, there were four different proposals but the approach that he talks about combines flexibility, performance, intuitiveness, and minimal architectural changes, and has only a couple of small drawbacks.
He explains that to correctly determine the encoding of HTTP requests is somewhat of an unsolved problem. He does not know any mainstream clients that send the charset specification along with the request. This means that it is up to the server or the application to figure out the encoding, which can be done in a number of ways, including encoding detection, looking at Accept-Charset header, parsing request to see if charset_field is passed. Unfortunately, none of them are completely reliable and the best you can do is guess the encoding with little confidence, Andrei says.
He says the approach that he talks about is basically a lazy evaluation scheme. When PHP receives the request, it will simply store it internally as-is and not do any request decoding at all. However, if your script happens to access GET, , or arrays, the runtime JIT handler will kick in and convert the values in the array from binary to Unicode based on the current HTTP input encoding setting. This will be done for the whole array at once, not per element. The encoding setting can be changed at runtime via tentatively named http_input_encoding() function. If the encoding is changed, the JIT handler is re-armed and the next access to the arrays will re-convert the stored raw data to Unicode based on the new setting, he explains.
Andrei speaks about the advantages of this approach. The first advantage he explains that PHP is not forced to guess the encoding of the request during request parsing stage, which happens before the script is executed. This allows the application to explicitly set the expected encoding or query other sources for the possible encoding value. Secondly, he says that PHP does not have to do request decoding until it is necessary to do so, removing the upfront cost for scripts that do not need request arrays. Thirdly, in case there are conversion errors, they are processed using the same mechanism that PHP employs for other encoding conversions, allowing application to set a custom conversion error handler, he explains.
He speaks about a possible problem with this approach. He says that someone could try to inject bogus data into the request, so that when the application accesses a request array for the first time, the bogus data trigger the errors in the conversion process. He defends the approach by saying that the pros outweigh the cons. He says to note that the decoding of the request has nothing to do with filtering. The job of the filter extension is to validate or sanitize the data, and it has to operate on the results of the request conversion, i.e. Unicode strings. |