In a way this is an argument for languages where it's normal to have result-types [0] or exceptions, instead of -1 etc. It just feels wrong to have any potential collision or confusion between normal results and error results.
In other words, this is a case of https://en.wikipedia.org/wiki/In-band_signaling versus out-of-band signaling https://en.wikipedia.org/wiki/Signaling_(telecommunications)... .
Clears throat the precise wiki page you're looking for is the Semipredicate Problem
-1 is in-band signaling; but it steals a value.
We all know that types are not Pythonic.
(I’m only mostly joking)
Ackchyually, Python has pretty strong typing, as far as dynamic languages go.
If you use the actual type system and something like mypy, Python is a joy work with. (For my definition of joy, which includes static typing).
Actually writing Python is reasonably nice in that case.
But dealing with Python infrastructure is so awful as to make the whole experience just bad.
uv fixes a lot of that, but I think it will be some time before it's used everywhere, and I have zero hope that the Python devs will ever do the right thing and officially replace Pip with uv.
you also have to trawl through the flags to make sure it actually checks types, e.g. check_untyped_defs
Sure. Everything is a PyObject.
Every PyObject structure has a ob_type pointer.
Every Object in Java has a getClass method. If we changed all the static type information in Java to the `Object` type, that'd be pretty close to Python's case.
GP's post is probably the "unityped" critique of dynamically typed languages by Robert Harper: https://existentialtype.wordpress.com/2011/03/19/dynamic-lan...
Thank you for sharing. I hadn't read that critique but I am in wholehearted agreement. Dynamic typing imposes significant cognitive overhead in exchange the privilege of letting you writing incorrect programs.
1. But, Python has exceptions! The underlying C language doesn't, but Python's run-time has them and can use them in the C code.
2. It may be an argument for ensuring that absolutely everything that is an object can hash: the object hasher must not have error states. Nobody wants the overhead of a simple hash code being wrapped in a result type.
Not really, I don’t think. Python code doesn’t ever really use hash() for anything where specific outputs are expected. Even if hash(a)==hash(b), it’s not implied that a==b or a is b.
But -2 seems like just a bad choice here. -1 and -2 are very "common" integers. Seems they could have just picked a quasi-random large integer to reduce the likelihood of collision. I expect hash functions to have collisions, but I expect it to be "unlikely" in a way that scales with the size of the hash.
I don't think that matters here, though. This specific hash function is literally used just for putting values in dict buckets. The docs say:
> Hash values are integers. They are used to quickly compare dictionary keys during a dictionary lookup.
They're not to be used as a crypto digest. The downside here, then is that if -1 and -2 are used as dict keys, they'll end up bucketed together. Apparently that hasn't been enough of an issue over the years to bother changing it. One advantage is that those values are common enough that it might break code that incorrectly expects hash() to have predictable outputs. If it used 0xffffffff as the value, lots of busted tests may never stumble across it.
> The downside here, then is that if -1 and -2 are used as dict keys, they'll end up bucketed together.
Note that they'll only end up bucketed together in the underlying hashmap. The dictionary will still disambiguate the keys.
>>> a = -1 >>> b = -2 >>> {a: a, b: b} {-1: -1, -2: -2}
Yep, absolutely. End programmers will never see that unless they go digging into the CPython code. It's otherwise invisible.
Not only bucketed together, which means they could still be disambiguated in C by the full hash, but they'll actually share the full hash, which means they'd have to jump back into Python code to check `__eq__` to disambiguate IIUC. That seems like a huge perf issue, is it not?
Without looking it up, I think that's right, but... that's just kind of inevitable with hash tables. Unless you have a small number of keys and get very lucky, it's exceedingly likely that you'll have multiple keys in the same bucket, and you'll have to use whatever language's `__eq__` equivalent mechanism.
A mitigating factor is that dict keys have to be hashable, which implies immutability, which generally implies a small number of native, fast types like str, numbers, or simple data structures like tuples. You could have some frozendict monstrosity as keys, but 1) that's not something you see often, and 2) don't do that. It you must, define a fast __eq__. For example, an object mapping to a database row might look like:
But again, that's just not really something that's done.def __eq__(self, other): if self.pk != other.pk: return False # Fail quickly # If that's true, only then do a deep comparison return self.tuple_of_values == other.tuple_of_values
Hash *keys* are 8 bytes (assume 64-bit machine). But the Set *buckets*[1] are a masked subset of that, starting at 3 bits and growing with the size of the set. Meaning if you store 16, 8, 32, 24, 0 all in the same set, they'll be in the same bucket (0b000)*. But, they'll still have different hash *keys*, which are stored in the hash table along with the object id. Then if you look up whether 40 is in that set, the C code will look in 0b000 bucket, see the other five values there, but note that each has a different hash *key*, since those are also stored in the table. So it'll know that 40 is not in the set without having to jump up into `__eq__`**.
[1] https://github.com/python/cpython/blob/main/Objects/setobjec...
* Note that the actual implementation will place bucket conflicts in the next available bucket, not the typical "overflow" linked list from CS 102, and does lookups in the same fashion. This approach makes better use of CPU caches.
** For `int` specifically, it's a C class and has no `__eq__` method. But the idea is the same if you have a class with member `i:int` and `__hash__(self): self.i`.
I bet -2 is way less common than -1 in a typical codebase, particularly a C one.
Are they? At least as keys for most dictionaries. Zero and positive integers sure, but negative ones would be somewhat rare. I could maybe see something like error handling, but then do you need performance there?
Yes, but having result types would avoid the need for this special casing.
Somewhat related, hash() returns a result % sys.hash_info.modulus, so if you build a set or a dict with keys multiple of sys.hash_info.modulus, you run into quadratic behavior.
>>> {i * sys.hash_info.modulus for i in range(5000)} 0.40s >>> {i * sys.hash_info.modulus for i in range(50000)} 29.34s >>> {i for i in range(50000)} 0.06s
Naturally, this extends to custom types:
>>> class X: ... def __hash__(self): return -1 ... >>> hash(X()) -2
The next k for which hash(k) != k: 2**61-1
print(hash(2**61 - 2)) 2305843009213693950 print(hash(2**61 - 1)) 0
The fact that a C function really only has its return value for communicating success or error status is why most fallible C functions use mutable parameters, also known as output arguments, for their result value or values. This is like, elementary C program design.
In this case however it can be partly justified as a space saving mechanism, as that hash has to be also cached in the object. It's like Rust's `Option<NonZeroU64>` done manually.
What I’m saying is either the value or the error can be discarded. If it was an error, the exception mechanism can be used since at that point you’re in the Python wrapper. It would probably be a little slower though, but Python is already slow. That way the error doesn’t need to be in the API at all.
Another question is why do ints largely hash to themselves? Most hashing guides recommend hashes be far more distributed IIUC.
There could be an issue whereby some attacker controls the hash inputs which happen to be integers, and chooses a sequence which collides to the same hash value. It would be easy to identify such a sequence.
It just doesn't seem to have been identified as a real isssue, probably because the most commonly hashed external data is character strings. (For those, there are countermeasures like a properly scrambled hashing function, modified by a seed that can be randomized.)
Python's `__hash__` used primarily for hash map bucketing. In that context, the identity of an integer is a perfectly cromulent bucketing value.
- [deleted]
If your keys are multiples of the table size then all your keys will hash to the same bucket, in a naive hash table at least.
More explicitly, since buckets in Python's Set implementation are the bottom N bits of the hash (where N depends on set size), this ensures linear sequences of ints (which are the most common type of int set) are all in different buckets, so zero collisions. This is the case except when the differences between integers in the set are multiples of powers of 2. But even then, the way Set is implemented, you don't see perf start to regress until like 2**20 or so. If everything in the set is multiples of 2**20, then they start to underperform better-distributed hashes. But it never gets more than like twice as slow.
So, it ends up being a matter of optimizing for the common case, but still being reasonably performant for the worst case.
> So, in C, when you design a function, and you want your function to be able to indicate “there was an error”, it has to return that error as its return value.
Well, you can also use an errno-like system. It has its own set of drawbacks as well, but it removes the "I need to reserve a sentinel value" problem.
Functions using errno typically still use a return value of -1 to indicate an error has occurred, then errno just stores the specific error code. Otherwise, how do you even know that you need to check errno?
There are examples of such an ambiguity in the ISO C library itself.
In those situations, the application must begin by clearing errno to zero.
Then, it checks for a certain error value or values from the function which are ambiguous: they could be legit or indicate an error.
If one of those values occurs, and if errno is nonzero, then the error occurred.
This is how you deal with, for instance the strtol (string to long int) function. If there is a range error, strtol returns LONG_MIN or LONG_MAX. Those are also valid values in the range of long, but when no error has occurred, they are produced without errno being touched.
strtol can also return 0 in another error case, when the input is such that no conversion can be performed. ISO C doesn't require errno to be set to anything in this case, unfortunately. The case is distinguished from a legitimate zero by the original pointer to the string being stored in *endptr (if the caller specifies endptr that is not null).
ISO C and POSIX library functions do not reset errno to zero. They either leave it alone or set it to a nonzero value.
If you need to use the above trick and are working inside a C library function, you have to save the original errno value before storing a zero in it, and then put that value back if no error has happened.
You could always check errno and reset it to zero before any function call.
Out params are thing too.
Also you can longjmp().
>it has to return that error as its return value.
This is the kind of crap ignorant C developers do and then pretend is the language's fault.
You can jump, you can return a reference to a complex type, a tuple, whatever.
I enjoyed this, because it seems to be written by an interested and interesting human, at a level that I could actually follow.
Object hashes that can fail or be unimplemented, where lower-level hashes have to be aware of this and stay away from the value that the parent uses for error signaling?
What else is unmitigated shit in Python? :)
Why not instead have
hash(-1)=-2,
hash(-2)=-3,
hash(-3)=-4,
and so on?
>CPython is written in C, and unlike Python, C doesn’t have exceptions. So, in C, when you design a function, and you want your function to be able to indicate “there was an error”, it has to return that error as its return value
This is just wrong. C doesn't feature 'error handling' as a dedicated form of branching the way many higher level languages do and you're in no way required to use return codes as an error signal. This is a case of bad API design and is entirely python's fault.
This seems like a loaded footgun: anyone writing a custome hash function has to know about -1 being treated like an error and handle it in a similar way.
I can’t think of any other language where this kind of thing happens, which means other developers won’t expect it either.
I can see the bug report now: “certain records cause errors, occurs only for about one in a few billion.”
If you're writing a custom hash function in python for a custom python class, then this won't affect you since it's only treated as an error in the underlying C code.
If you're adding a new type to the core python language then you have to be aware of this, but if you're hacking the C implementation to change the core language then you're probably pretty well versed in the cpython internals, or at least surrounded by people that are.
interestingly enough though, it doesn't seem to be possible to write a custom hash function that returns -1
So if you're doing something where you have a custom __hash__ function that you're expecting to return -1 for a certain value and then are testing for value of the hash of an object rather than testing the property of the object directly, then this might bite you. But I cannot think of any reasonable case where you might want to do that.>>> class Foo: def __hash__(self): return -1 >>> f = Foo() >>> print(hash(f)) -2
You can write a custom hash method that returns -1 (and the one you wrote verifiably does.)
The standard (not custom) hash function, in CPython, which calls your custom method and returns the actual value CPython uses for bucketing, will return -2 in that case, though.
Which is necessary, so that, e.g., following the single mathematical function for hashing of numeric types published in the Python Language Reference will preserve x==y implies hash(x)==hash(y) for custom and build in numeric types in CPython where hash(-1) does not follow that pattern.
Even though for -1 itself, the divergence from the rule is in __hash__ itself. This is also consistent with the basic rule that dunder methods are for communication with the standard machinery, but external consumers should use the standard machinery, not the dunder method.
Perhaps the worry was that because the observed behavior up until now was that hash never returns -1, if it were possible possible to write a custom hash function that returns -1, that might break some unsuspecting code due to Hyrum's Law.
If the built-in hash function returned -1 when a custom __hash__ method did so, then custom numeric classes following the hashing guidelines in the Python Language Reference designed to maintain the x==y => hash(x)==hash(y) invariant would fail to maintain that invariant on the reference implementation of Python due to a CPython implementation detail, which would be a Bad Thing™.
I don't think so. This is true for people writing cpython. But if you implement custom `__hash__` method, you should be fine.
>>> d = {-2: None}
>>> -1 in d
False
What gives?
Because -1 is not equal to -2. They will fall into the same bucket of the hash map, but the hash collisions are expected and must be accounted for.
Since hashes aren't guaranteed to be unique, a dictionary should use equality to actually confirm that the object with the given hash matches the key present in the dictionary.
> The __hash__() method should return an integer. The only required property is that objects which compare equal have the same hash value; it is advised to mix together the hash values of the components of the object that also play a part in comparison of objects ...
-- https://docs.python.org/3/reference/datamodel.html#object.__...
> The general contract of hashCode is:
> * If two objects are equal according to the equals method, then calling the hashCode method on each of the two objects must produce the same integer result.
> * It is not required that if two objects are unequal according to the equals method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables.
-- https://docs.oracle.com/en/java/javase/23/docs/api/java.base...
If hashmaps only looked at hash(k1) == hash(k2) and not also k1 == k2 they would be quite useless as maps.
- [deleted]
Because the hash function is only reliably useful for laying out dict keys, and its outputs are totally opaque and coincidental.
The pigeonhole principle says there are an infinite number of inputs with the same hash output. Trying to figure out how that happens can be fun and enlightening, but why? “Because it just does, that’s all.”
I think this is not constructive criticism. There's a good, clear cut reason for this specific overlap (error handling), saying it collides just because is not really useful.
But even that’s an implementation detail that happens to be true today, but could easily disappear tomorrow. There’s nothing in the docs saying it has to be that way, or that you can infer anything from a hash output other than that it’s going to be consistent within a single execution of that Python program.
This article is not about the API contract of the hash function, or the abstraction it provides. If you are just trying to hash things, you don't need any info here.
It's very much trying to go _under_ the abstraction layer to investigate its behavior. Because it's interesting.
This is very similar to how people investigate performance quirks or security issues.
I read the article and it's interesting! I learned something from it. From an end user POV it explains the mechanism behind how hash(-1)==hash(-2), which is neat. There's not really a why behind it, though. It wasn't really planned or designed for -1 to have an unexpected hash value. That's just the value the devs randomly picked as a flag. It could've been -837 just as easily.
If the article title had been "why does the hash function never return -1", you would agree it's not a random behavior and serves a very clear purpose. It just so happens that the original author didn't know that until they did all the digging (nice job by the way).
It could have been a bug. There's nothing wrong with the question and seeking answers to it. It was an interesting read too.
A better question might be, why are people still using OO dynamic languages without any typing?
> without any typing?
I rest my case.>>> "abc" + 321 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: can only concatenate str (not "int") to str
Not sure. Is anyone using a language like that? Python certainly isn’t one.
> Python certainly isn’t one.
I am not a Python expert, but...
Python is described as OO and I thought is is dynamically typed
Not quite "without any typing" but close
There are different definitions of most of the terms around typing which make most typing conversations confusing as people talk past each other.
Particularly, there is a common definition of “typed” which is exactly equivalent to “statically typed” under which all so-called “dynamically typed” languages are actually “untyped”, and within that system strong and weak typing are either a meaningless distinction or one within the set of statically types languages.
There's also, of course, a common usage within which “dynamically typed” is meaningful and strong vs. weak typing is a separate distinction, usually mostly discussed within languages with dynamic typing, though languages like C being both statically typed and weakly typed has been discussed.
Not really. Python's strongly typed, but dynamically. You can think of Python variables as untyped references to strongly typed values: the values are typed, but their names are not.
> untyped references to strongly typed values
I would call that weakly typed since there is no step before your code runs that validates (or even tries to validate) that your program satisfies a "type system". But since the definition of "strong" and "weak" typing has always been vague, I will just say that it is stupidly typed[1], and that I am a pedantic nerd.
1. https://danieltuveson.github.io/programming/languages/stupid...
This is just hair splitting. What everyone means by "typing" today is typing members in structures and co. Every time you reach for python after a typed language, it feels like going through a barbed minefield blindfolded. The fact that its "world" almost entirely consists of little moving subclasses in subnamespaces doesn't help either. All you can think of python is how to write as few lines of it as possible to return to DX faster. It is my only use case for AI-based development because I cannot stand an extra minute in this abomination without cursing into the editor.
It's not hairsplitting, and it's not what everyone means, just the people you talk to. Languages without typing are things like assembly language(s), which underlie everything you do on a computer and is (are) still in the TIOBE Index's top 20 https://www.tiobe.com/tiobe-index, as well as much more obscure languages. The distinction between untyped languages and dynamically typed languages is night and day.
I'm pretty competent in Python, C, Java, JavaScript, assembly, and a number of other languages†, and my experience in Python is nothing like what you're describing, although I do have my own frustrations with it. It's possible that when you're more experienced you'll have a different perspective, but it sounds like you're stuck inside a pretty small bubble right now.
______
† Last year I also programmed in C++, Lua, C#, bash, Tcl, Emacs Lisp, Forth, Golang, OCaml, Scheme, Perl, Common Lisp, and the ngspice scripting language.
I have a similar list, no need to worry about my YoE in various things. I left full-time python many years ago. You're right about my current bubble though. I find it productive and useful, compared to python which I have to touch from time to time. My worst gripe is that ML is my new interest and it is all python. I'm basically screwed.
> Every time you reach for python after a typed language, it feels like going through a barbed minefield blindfolded.
Using both Python and mandatorily statically typed languages regularly professsionally (my main working languages being C#, Python, and a mix of JS and TS), that's not my experience at all.
(Of course, Python has a variety of optional static typecheckers with differing degrees of type inference, as well as supporting explicit type specifications; its not untyped unless you choose to use it that way.)
I agree with your last sentence, but I think we shouldn't say python itself is "strongly typed".
Why, specifically? It's commonly described as strongly typed and I'm not aware of an argument against that.
Is the specific reason not clear? You just said it. The variables are untyped.
Strong versus weak is not super solidly defined as referring only to values and not variables.
> Is the specific reason not clear?
Correct.
> Strong versus weak is not super solidly defined as referring only to values and not variables.
You're right. It's not super solidly defined as referring only to variables and not values, either. If dynamic typing is the same as weak typing, there'd be no point in having both phrases.
>> Is the specific reason not clear?
> Correct.
Is it still unclear with the clarification you didn't quote? I can't tell if I need to explain more or not.
> It's not super solidly defined as referring only to variables and not values, either.
I didn't mean to suggest it was.
> If dynamic typing is the same as weak typing, there'd be no point in having both phrases.
It's not the same. There's a bunch of things referred to as "weak".
- [deleted]
> Not really. Python's strongly typed, but dynamically
Yea, nah!
You cannot have both.
That's an interesting and uncommon claim.
TaPL should be required reading for self proclaimed "engineers". Do untyped languages even exist, or could they? maybe something purely theoretical for ivory tower academics? Even POSIX shell and assembler type their values.
> Do untyped languages even exist, or could they? Even POSIX shell and assembler type their values.
I program in assembly language. Memory is just a big array of bytes. Values don't have types, but each instruction chooses what type to interpret memory as - e.g. uint8, int16, float32, x86 real-mode pointer, long-mode pointer, etc.
BCPL[0], BLISS[1], and Forth[2] are all untyped.
[0] https://en.wikipedia.org/wiki/BCPL
[1] https://en.wikipedia.org/wiki/BLISS
[2] https://en.wikipedia.org/wiki/Forth_(programming_language)
I think one can reasonably argue that, if the language only has a single type, it is effectively "untyped".
That would mean languages like Tcl. Or, for that matter, B.