What are the safe characters for making URLs?

Question

I am making a website with articles, and I need the articles to have "friendly" URLs, based on the title.

For example, if the title of my article is "Article Test", I would like the URL to be http://www.example.com/articles/article_test.

However, article titles (as any string) can contain multiple special characters that would not be possible to put literally in my URL. For instance, I know that ? or # need to be replaced, but I don't know all the others.

What characters are permissible in URLs? What is safe to keep?

There was a similar question, here. Check it out, you may find some useful answers there also (there were quite a lot of them). — Rook, Commented Mar 29, 2009 at 22:07
I reworded the question to be more clear. The question and answers are useful and of good quality. (48 people, including me, have favorited it) In my opinion, it should be reopened. — Jonathan Allard, Commented Nov 17, 2020 at 21:53

Peter Mortensen · Accepted Answer · 2021-01-13 22:07:31Z

300

To quote section 2.3 of RFC 3986:

Characters that are allowed in a URI, but do not have a reserved purpose, are called unreserved. These include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde.
  ALPHA  DIGIT  "-" / "." / "_" / "~"

Note that RFC 3986 lists fewer reserved punctuation marks than the older RFC 2396.

edited Jan 13, 2021 at 22:07

Peter Mortensen

31.3k22 gold badges109 silver badges132 bronze badges

answered Mar 29, 2009 at 21:57

Skip Head

7,7001 gold badge31 silver badges34 bronze badges

1

@Skip Head, does "characters" include Latin encoded characters like ç and õ?
– Mohamad
Commented Jun 10, 2011 at 19:34
7

@Mohamad: No, ASCII only, although UTF-8 support is getting better.
– Dietrich Epp
Commented Jun 19, 2011 at 12:58
1

@Mohamad: The last part there will get changed under the hood to post-title-with-%C3%A7-and-%C3%B5, but it will still display in the user's location bar as post-title-with-ç-and-õ.
– Dietrich Epp
Commented Jun 19, 2011 at 16:35
7

Your readers are Portuguese, so use Portuguese characters.
– Dietrich Epp
Commented Jun 19, 2011 at 19:49
2

As the referred document is very old and this post too. Just wanted to know is this still valid or we have any updated document.
– prasingh
Commented May 31, 2019 at 7:05

| Show 4 more comments

Oded Breiner · Accepted Answer · 2021-08-02 06:34:17Z

144

There are two sets of characters you need to watch out for: reserved and unsafe.

The reserved characters are:

ampersand ("&")
dollar ("$")
plus sign ("+")
comma (",")
forward slash ("/")
colon (":")
semi-colon (";")
equals ("=")
question mark ("?")
'At' symbol ("@")
pound ("#").

The characters generally considered unsafe are:

space (" ")
less than and greater than ("<>")
open and close brackets ("[]")
open and close braces ("{}")
pipe ("|")
backslash ("\")
caret ("^")
percent ("%")

I may have forgotten one or more, which leads to me echoing Carl V's answer. In the long run you are probably better off using a "white list" of allowed characters and then encoding the string rather than trying to stay abreast of characters that are disallowed by servers and systems.

edited Aug 2, 2021 at 6:34

Oded Breiner

29.4k10 gold badges107 silver badges74 bronze badges

answered Mar 29, 2009 at 21:56

Gary.Ray

6,4611 gold badge28 silver badges42 bronze badges

1

# is a reserved character used for bookmarks on a specific page, created by having one HTML element with a matching name-attribute or id-attribute (sans #-symbol).
– TheLonelyGhost
Commented Aug 12, 2014 at 14:00
8

Other's seem to disagree that the tilde ~ is unsafe. Are you sure it is?
– drs
Commented Jun 15, 2015 at 14:04
4

Whitelist is not so good if handling languages other than English. Unicode just has too many OK code points. Therefore, blacklisting the unsafe ones is likely to be the easiest to implement in regular expressions.
– Patanjali
Commented Nov 26, 2015 at 7:04
1

tilde ~ seems to be safe: Characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde. unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~", from ietf.org/rfc/rfc3986.txt
– jorgefpastor
Commented Jun 18, 2016 at 18:23
2

I've made a working regex based off this answer here: regex101.com/r/9VBu66/1 with the following notes. 1. The first part blacklists non-ascii characters, so you'd need to remove that if you want to support Unicode and 2. I don't blacklist / because I am allowing subdirectories. This is the regex I'm using: /([^\x00-\x7F]|[&$\+,:;=\?@#\s<>\[\]\{\}|\\\^%])+/
– andyvanee
Commented Dec 2, 2020 at 20:34

| Show 3 more comments

Peter Mortensen · Accepted Answer · 2021-01-26 11:57:36Z

68

Always Safe

In theory and by the specification, these are safe basically anywhere, except the domain name. Percent-encode anything not listed, and you're good to go.

    A-Z a-z 0-9 - . _ ~ ( ) ' ! * : @ , ;

Sometimes Safe

Only safe when used within specific URL components; use with care.

    Paths:     + & =
    Queries:   ? /
    Fragments: ? / # + & =

Never Safe

According to the URI specification (RFC 3986), all other characters must be percent-encoded. This includes:

    <space> <control-characters> <extended-ascii> <unicode>
    % < > [ ] { } | \ ^

If maximum compatibility is a concern, limit the character set to A-Z a-z 0-9 - _ . (with periods only for filename extensions).

Keep Context in Mind

Even if valid per the specification, a URL can still be "unsafe", depending on context. Such as a file:/// URL containing invalid filename characters, or a query component containing "?", "=", and "&" when not used as delimiters. Correct handling of these cases are generally up to your scripts and can be worked around, but it's something to keep in mind.

edited Jan 26, 2021 at 11:57

Peter Mortensen

31.3k22 gold badges109 silver badges132 bronze badges

answered Nov 4, 2016 at 3:58

Beejor

9,1022 gold badges42 silver badges31 bronze badges

1

Could you provide any sources for your second claim ("Sometimes Safe")? In particular, I believe you are wrong in saying that = is not safe for queries. For example, FIQL accepts equal signs and describes itself as being "URI-friendly" and "optimised and intended for use in the query component". In my interpretation, RFC 3986 explicitly allows "=", "&", "+" and others in queries.
– DanielM
Commented Nov 26, 2019 at 10:40
@DanielM "?", "=", and "&" are valid in queries per spec, though in practice they're widely used for parsing name-value pairs within the query. So they can be unsafe as part of the names/values themselves. Whether or not this constitutes "unsafe" may be a matter of opinion.
– Beejor
Commented Jan 5, 2020 at 20:03
Some sources, as requested. (1) RFC 3986, Sec 3.4: "[...] query components are often used to carry identifying information in the form of 'key=value' pairs [...]" (2) WhatWG URL Spec, Sec. 6.2: "Constructing and stringifying a URLSearchParams object is fairly straightforward: [...] params.toString() // "key=730d67"" (3) PHP Manual, http-build-query: "Generate URL-encoded query string. [...] The above example will output: 0=foo&1=bar[...]" (4) J. Starr, Perishable Press: "When building web pages, it is often necessary to add links that require parameterized query strings."
– Beejor
Commented Jan 5, 2020 at 20:05
@Beejor : I am constructing a URL & I use '-' and ';' during construction. It is not a web app but a mobile app. Not a web developer & hence, would I be safe if I use the above two chars in Path property? learn.microsoft.com/en-us/dotnet/api/…
– Filip
Commented Feb 15, 2020 at 0:27
1

@karsnen Those are valid URL characters. Though if used to reference paths on a local filesystem, keep in mind that some systems disallow certain characters in filenames. For example, "file:///path/to/my:file.ext" would be invalid on Mac.
– Beejor
Commented Feb 17, 2020 at 7:15

| Show 2 more comments

mklement0 · Accepted Answer · 2015-11-08 22:03:53Z

47

You are best keeping only some characters (whitelist) instead of removing certain characters (blacklist).

You can technically allow any character, just as long as you properly encode it. But, to answer in the spirit of the question, you should only allow these characters:

Lower case letters (convert upper case to lower)
Numbers, 0 through 9
A dash - or underscore _
Tilde ~

Everything else has a potentially special meaning. For example, you may think you can use +, but it can be replaced with a space. & is dangerous, too, especially if using some rewrite rules.

As with the other comments, check out the standards and specifications for complete details.

edited Nov 8, 2015 at 22:03

mklement0

420k68 gold badges674 silver badges862 bronze badges

answered Mar 29, 2009 at 21:48

carl

50.3k17 gold badges75 silver badges82 bronze badges

23

A preiod, I discovered today, is a bad choice of character to use for a URL-safe Base64 encoder, because there will be those rare cases where your encoded data may produce two consecutive dots (".."), which is significant in that it refers to the parent directory.
– pohl
Commented May 3, 2011 at 21:54
6

@pohl: that's only a problem if your URL is used as a file path, either in your code or if your web server actually tries to map the URL to files before forwarding the request to a script (unfortunately very common).
– André Caron
Commented May 6, 2011 at 22:01
4

Actually, in our case using it as a file path would be ok, since in unix files are allowed to have multiple, and even consecutive, dots in their names. For us, the problem arose in a monitoring tool called Site Scope which has a bug (perhaps a naive regex) and it was reporting spurious false downtimes. For us, we are stuck on an old version of Site Scope, the admin team refuses to pay for an upgrade, and one very important client has Site Scope (not an equivalent) written into their contract. Admittedly, most won't find themselves in my shoes.
– pohl
Commented May 7, 2011 at 1:48
9

Thank god that someone posted a list without much blabbering. As for dot (.) - as @pohl said, do not use it! Here is another weird case on IIS (don't know if this happens on other Web Servers): if it is at the end of your URL you'll most likely get 404 error (it'll try to search for [/pagename]. page)
– nikib3ro
Commented Jun 1, 2012 at 19:27
Can you rephrase "You are best keeping"?
– Peter Mortensen
Commented Jan 13, 2021 at 22:13

| Show 1 more comment

Community · Accepted Answer · 2021-10-07 05:49:19Z

20

Looking at RFC3986 - Uniform Resource Identifier (URI): Generic Syntax, your question revolves around the path component of a URI.

    foo://example.com:8042/over/there?name=ferret#nose

     \_/   \______________/\_________/ \_________/ \__/
      |           |            |            |        |
   scheme     authority       path        query   fragment
      |   _____________________|__
     / \ /                        \
     urn:example:animal:ferret:nose

Citing section 3.3, valid characters for a URI segment are of type pchar:

pchar = unreserved / pct-encoded / sub-delims / ":" / "@"

Which breaks down to:

ALPHA / DIGIT / "-" / "." / "_" / "~"

pct-encoded

"!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="

":" / "@"

Or in other words: You may use any (non-control-) character from the ASCII table, except /, ?, #, [ and ].

This understanding is backed by RFC1738 - Uniform Resource Locators (URL).

edited Oct 7, 2021 at 5:49

CommunityBot

11 silver badge

answered Jul 19, 2014 at 1:47

Philzen

4,4111 gold badge34 silver badges51 bronze badges

3

This is a great example of a theoretically correct answer, that leads to trouble when applied to the real world we actually live in. It is true that most of those characters will not cause a problem most of the time. But there exist in the real world things like proxies, routers, gateways, relays, etc., all of which "love" to inspect and interact with URLs in ways that disregard the theoretical standard. To avoid these pitfalls, you're pretty much limited to escaping everything except alphanumerics, dash, underscore, and period.
– deltamind106
Commented Dec 14, 2015 at 18:29
1

@deltamind106 Can you provide examples and/or references to clarify which of those characters being safe according to the RFCs are in fact not? I'd to prefer stick to the facts backed by standards in my answer, and i'm happy to update my answer if you can pinpoint any facts i may have neglected.
– Philzen
Commented Dec 14, 2015 at 18:41
2

@deltamind106 I'd suggest we try to get products to follow the standards rather than tell devs not to. I consider your warning merited, but we should do our part in reporting non-compliance to vendors if necessary.
– Lo-Tan
Commented May 11, 2016 at 18:19
@Philzen : I am constructing a URL & I use '-' and ';' during construction. It is not a web app but a mobile app. Not a web developer & hence, would I be safe if I use the above two chars in Path property? learn.microsoft.com/en-us/dotnet/api/…
– Filip
Commented Feb 15, 2020 at 0:27
1

@karsnen Yes of course - and ; are safe, that's what my answer and RFC clearly states.
– Philzen
Commented Feb 22, 2020 at 19:09

| Show 1 more comment

chaos · Accepted Answer · 2009-03-29 22:09:20Z

12

From the context you describe, I suspect that what you're actually trying to make is something called an 'SEO slug'. The best general known practice for those is:

Convert to lower-case
Convert entire sequences of characters other than a-z and 0-9 to one hyphen (-) (not underscores)
Remove 'stop words' from the URL, i.e. not-meaningfully-indexable words like 'a', 'an', and 'the'; Google 'stop words' for extensive lists

So, as an example, an article titled "The Usage of !@%$* to Represent Swearing In Comics" would get a slug of "usage-represent-swearing-comics".

answered Mar 29, 2009 at 22:09

chaos

123k33 gold badges305 silver badges310 bronze badges

1

Is it really a good approach to remove these "stop words" from the url? Would search engines penalize a website because of this?
– Paulo
Commented Mar 30, 2009 at 2:40
Search engines are generally believed to only acknowledge some portion of the URL and/or to give reduced significance to later portions, so by removing stop words what you're doing is maximizing the number of keywords you embed in your URL that you have a chance of actually ranking on.
– chaos
Commented Mar 30, 2009 at 3:50
2

@chaos Do you still recommend stripping StopWord, if you take into account this: seobythesea.com/2008/08/google-stopword-patent Also, can you recommend a good list of stopwords? This is the best list I've found so far - link-assistant.com/seo-stop-words.html
– nikib3ro
Commented Jun 1, 2012 at 19:53
1

@kape123 That doesn't look like a very good list to me. "c" and "d" are programming languages, and a lot of those other words also look significant. I'd probably just strip the basic ones: a, and, is, on, of, or, the, with.
– mpen
Commented Feb 2, 2016 at 16:37

Add a comment |

LKK · Accepted Answer · 2010-12-01 22:28:38Z

11

unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"

answered Dec 1, 2010 at 22:28

LKK

1271 silver badge2 bronze badges

4

Doesn't "ALPHA" imply "DIGIT"? I assume ALPHA is short for "alphanumeric", and alphanumeric means uppercase, lowercase and digits.
– Luc
Commented Jun 4, 2013 at 13:30
15

Actually alpha doesn't imply alphanumeric. Alpha and numeric are 2 distinct things and alphanumeric is the combination of those things. He could have written his answer like so: ALPHANUMERIC / "-" / "." / "_" / "~"
– MacroMan
Commented Sep 3, 2013 at 10:32
2

The ABNF notation for 'unreserved' in RFC 3986 lists them separately.
– Patanjali
Commented Nov 26, 2015 at 7:05

Add a comment |

mpen · Accepted Answer · 2009-03-29 22:19:34Z

7

From an SEO perspective, hyphens are preferred over underscores. Convert to lowercase, remove all apostrophes, then replace all non-alphanumeric strings of characters with a single hyphen. Trim excess hyphens off the start and finish.

answered Mar 29, 2009 at 22:19

mpen

280k275 gold badges880 silver badges1.3k bronze badges

1

Why are hyphens preferred over underscores? What is the explanation?
– Peter Mortensen
Commented Jan 26, 2021 at 11:44
@PeterMortensen studiohawk.com.au/blog/…. or maybe better: ecreativeim.com/blog/index.php/2011/03/30/… " Google treats a hyphen as a word separator, but does not treat an underscore that way. Google treats and underscore as a word joiner — so red_sneakers is the same as redsneakers to Google"
– mpen
Commented Jan 26, 2021 at 20:24
As per the latest google guide lines - Google treats a hyphen as a word separator, but does not treat an underscore that way - takesurvery.com
– Milan Soni
Commented Nov 17, 2022 at 5:26

Add a comment |

joschi · Accepted Answer · 2009-03-29 21:46:37Z

6

The format for an URI is defined in RFC 3986. See section 3.3 for details.

answered Mar 29, 2009 at 21:46

joschi

13k4 gold badges45 silver badges51 bronze badges

Add a comment |

Peter Mortensen · Accepted Answer · 2021-01-13 22:24:32Z

I had a similar problem. I wanted to have pretty URLs and reached the conclusion that I have to allow only letters, digits, - and _ in URLs.

That is fine, but then I wrote some nice regex and I realized that it recognizes all UTF-8 characters are not letters in .NET and was screwed. This appears to be a know problem for the .NET regex engine. So I got to this solution:

private static string GetTitleForUrlDisplay(string title)
{
    if (!string.IsNullOrEmpty(title))
    {
        return Regex.Replace(Regex.Replace(title, @"[^A-Za-z0-9_-]", new MatchEvaluator(CharacterTester)).Replace(' ', '-').TrimStart('-').TrimEnd('-'), "[-]+", "-").ToLower();
    }
    return string.Empty;
}


/// <summary>
/// All characters that do not match the patter, will get to this method, i.e. useful for Unicode characters, because
/// .NET implementation of regex do not handle Unicode characters. So we use char.IsLetterOrDigit() which works nicely and we
/// return what we approve and return - for everything else.
/// </summary>
/// <param name="m"></param>
/// <returns></returns>
private static string CharacterTester(Match m)
{
    string x = m.ToString();
    if (x.Length > 0 && char.IsLetterOrDigit(x[0]))
    {
        return x.ToLower();
    }
    else
    {
        return "-";
    }
}

.NET regexes support unicode quite well actually. You have to use unicode character classes e.g. \p{L} for all letters. See msdn.microsoft.com/en-us/library/20bw873z.aspx#CategoryOrBlock — TheCycoONE, Commented Jun 26, 2013 at 18:49

Peter Mortensen · Accepted Answer · 2021-01-26 14:20:03Z

I found it very useful to encode my URL to a safe one when I was returning a value through Ajax/PHP to a URL which was then read by the page again.

PHP output with URL encoder for the special character &:

// PHP returning the success information of an Ajax request
echo "".str_replace('&', '%26', $_POST['name']) . " category was changed";

// JavaScript sending the value to the URL
window.location.href = 'time.php?return=updated&val=' + msg;

// JavaScript/PHP executing the function printing the value of the URL,
// now with the text normally lost in space because of the reserved & character.

setTimeout("infoApp('updated','<?php echo $_GET['val'];?>');", 360);

Peter Mortensen · Accepted Answer · 2021-01-13 22:11:01Z

0

I think you're looking for something like "URL encoding" - encoding a URL so that it's "safe" to use on the web:

Here's a reference for that. If you don't want any special characters, just remove any that require URL encoding:

HTML URL Encoding Reference

edited Jan 13, 2021 at 22:11

Peter Mortensen

31.3k22 gold badges109 silver badges132 bronze badges

answered Mar 29, 2009 at 21:47

Andy White

87.6k48 gold badges177 silver badges212 bronze badges

Add a comment |

Ramji · Accepted Answer · 2016-02-23 15:44:00Z

-4

Between 3-50 characters. Can contain lowercase letters, numbers and special characters - dot(.), dash(-), underscore(_) and at the rate(@).

answered Feb 23, 2016 at 15:44

Ramji

3

5

Any reference for that?
– dakab
Commented Feb 23, 2016 at 16:06

Add a comment |

Collectives™ on Stack Overflow

What are the safe characters for making URLs?

13 Answers 13

Always Safe

Sometimes Safe

Never Safe

Keep Context in Mind

`ALPHA / DIGIT / "-" / "." / "_" / "~"`

`pct-encoded`

`"!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="`

`":" / "@"`

Not the answer you're looking for? Browse other questions tagged
url
friendly-url
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

13 Answers 13

Always Safe

Sometimes Safe

Never Safe

Keep Context in Mind

ALPHA / DIGIT / "-" / "." / "_" / "~"

pct-encoded

"!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="

":" / "@"

Not the answer you're looking for? Browse other questions tagged urlfriendly-url or ask your own question.

Linked

Related

`ALPHA / DIGIT / "-" / "." / "_" / "~"`

`pct-encoded`

`"!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="`

`":" / "@"`

Not the answer you're looking for? Browse other questions tagged
url
friendly-url
or ask your own question.