A Security-focused HTTP Primer
What follows is a primer on the key security-oriented characteristics of the HTTP protocol. It’s a collection of a number of different sub-topics, explained in my own way, for the purpose of having a single reference point when needed.
-
Message-based You make a request, you get a response.
-
Line-based Lines are quite significant in HTTP. Each header is on an individual line (each line ends with a ), and a blank line separates the header section from the optional body section.
-
Stateless HTTP doesn’t have the concept of state built-in, which is why things like cookies are used to track users within and across sessions.
-
Query Strings (?) A query string is defined by using the question mark (?) character after the URL being requested, and it defines what is being sent to the web application for processing. They are typically used to pass the contents of HTML forms, and are encoded using name:value pairs.http://google.com/search?query=mysearch
-
Parameters (something=something) In the request above the parameter is the “query” value–presumably indicating it’s what’s being searched for. It is followed by an equals sign (=) and then the value of the parameter.http://google.com/search?q=mysearch
-
The Ampersand (&) Ampersands are used to separate a list of parameters being sent to the same form, e.g. sending a query value, a language, and a verbose value to a search form.http://google.com/search?q=mysearch&lang=en&verbose=1
[ Ampersands are not mentioned in the HTTP spec itself; they are used as a matter of convention. ]
URL encoding seems more tricky than it is. It’s basically a workaround for a single rule in RFC 1738, which states that:
…Only alphanumerics [0-9a-zA-Z], the special characters “$-_.+!*'(),” [not including the quotes – ed], and reserved characters used for their reserved purposes may be used unencoded within a URL.
The issue is that humans are inclined to use far more than just those characters, so we need some way of getting the larger range of characters transformed into the smaller, approved set. That’s what URL Encoding does. As mentioned here in a most excellent piece on the topic, there are a few basic groups of characters that need to be encoded:
-
ASCII Control Characters: because they’re not printable.
-
Non-ASCII Characters: because they’re not in the approved set (see the requirement above from RFC 1738). This includes the upper portion of the ISO-Latin character set (see my encoding primer to learn more about character sets)
-
Reserved Characters: these are kind of like system variables in programming–they mean something within URLs, so they can’t be used outside of that meaning.
For any of these characters listed that can’t (or shouldn’t be) be put in a URL natively, the following encoding algorithm must be used to make it properly URL-encoded:
-
Find the ISO 8859-1 code point for the character in question
-
Convert that code point to two characters of hex
-
Append a percent sign (%) to the front of the two hex characters
This is why you see so many instances of %20 in your URLs. That’s the URL-encoding for a space.
Here are the primary HTTP authentication types:
Basic
-
A user requests page protected by basic auth
-
Server sends back a 401 and a WWW-Authenticate header with the value of basic
-
The client takes his username and password–separated by a colon–and Base64 encodes it
-
The client then sends that value in an Authorization header, like so: Authorization: Basic BTxhZGRpbjpbcGAuINMlc2FtZC==
[ As the authors of The Web Application Hacker’s Handbook point out, Basic Authentication isn’t as bad as people make it out to be. Or, to be more precise, it’s no worse than Forms-based Authentication (the most common type). The reason for this is simple: Both send credentials in plain-text by default (actually, at least Basic offers Base64, whereas Forms-based isn’t even encoded). Either way, the only way for either protocol to even approach security is by adding SSL/TLS. ]
Digest
-
A user requests page protected by digest auth
-
The server sends back a 401 and a WWW-Authenticate header with the value of digest along with a nonce value and a realm value
-
The user concatenates his credentials with the nonce and realm and uses that as input to MD5 to produce one has (HA1)
-
The user concatenates the method and the URI to create a second MD5 hash (HA2)
-
The user then sends an Authorize header with the realm, nonce, URI, and the response–which is the MD5 of the two previous hashes combined
Forms-based Authentication
This is the most common type of web authentication, and it works by presenting a user with an HTML form for entering his/her username and password, and then sends those values to the server for verification. Some things to note:
-
The login information should be sent via POST rather than GET
-
The POST should be sent over HTTPS, not in the clear
-
Ideally, the entire login page itself should be HTTPS, not just the page that the credentials are being sent to
Shown below is a typical structure of a login form (this one from wordpress.com):
Source link