Somehow torvalds/linux is in Fronterra, next to JS projects, awesome-X lists, and frontend checklists.
Either kernel hackers unexpectedly love frontend, or more likely the people that write the code don't overlap much with the people that star Github projects!
Jaccard similarity is not particularly good for "celebrity" projects.
They are similar because they are popular, not because there is semantic relationship.
It's the same problem I faced with the map of reddit (https://anvaka.github.io/map-of-reddit/ ) - all popular subreddits are just "similar" to each other.
Stil works great for smaller, non-celebrity projects :D
I wonder if code embeddings might have been a better way to organize the projects, although probably infeasible given the amount of resources required to download and compute embeddings for each file.
Embeddings are super cheap to compute
Perhaps the same reason heat maps are often really the underlining population map https://xkcd.com/1138/
That’s why in NLP we use term frequency over inverse document frequency. It gives you a measure of common uncommon things are.
Wonder how you’d implement that in a heat map. Just call each pixel a document and see where it takes you?
People have been critiquing the collaborative filtering aspect of this work vs content analysis ("[why use stars instead of code similarity]") but there's something elegant about the simplicity of using less priors here.
A tf*idf matrix could be applied to the star-feature matrix too. Document = github repo. Term = name of user who starred it.
THUS, users who overstar are simply less important for computing similarities.
This would mitigate the phenomenon of massively popular github repos being clustered together because of folks who blithely star the most well known stuff.
Winsorize the data points to remove outliers and then divide it by the population count for the case of the heatmap?
Because of react ?
That was my first reaction.
What's your angle?
Can you elaborate?
Wait, why React?
Live link: https://anvaka.github.io/map-of-github/
Yeah, this should be the link - not the repo.
Very fun to be able to find my own project there (mapbox-gl-utils):
https://anvaka.github.io/map-of-github/#12/24.78947/18.85186
"Sussex" as the name of the Among Us section had me laughing
The funniest one I saw was "Lispaña"
I just noticed Halaska right next to it :)
Olé! (^..^)
As a fan of Julia, surprised to see how julialang/julia has so few links. It's a niche language; how isolated it is on this map is maybe not so unrepresentative of the user or developer experience.
There's a JuliaLand to the west of the island where julialang/julia is.
The fact that julialang/julia ended up near tensorflow and opencv, and actual Julia packages ended up elsewhere, probably reflects a difference between aspirational users and real users: a lot of people who starred the Julia project itself were numeric Python users who were looking for a new Python, but then mostly stuck to Python itself, so their other stars are in the numeric Python land. Those who starred the JuliaLand packages are the actual Julia users who aptly enough ended up near Moleculandia and AstroSpace and Quantumia.
Surprised at how small Rustland is. Barely a province in Clouderra.
Also, interesting how both Bevy and Veloren are in Rustland. Probably, the stars come more from the Rust community than the game dev community. Which I guess makes sense: the Rust ecosystem is still relatively small and feels like a lot of people doing X but in Rust.
I'm also shocked how small "nodelandia" is and that its not even its own continent. I guess we all overestimate the size of our bubbles
Most of the mass is concentrated in the node_modules folder.
OTOH the vim and emacs lands seem to be huge!
Vim land seems much larger than I expected.
I can see many osdev Rust projects in "PlusPlus Nation" near other kernels, which mean that "X but in Rust" might be in "X" instead of "RustLand".
The data is a from March 2023 according to OP so a lot of the more recent rust projects just won't be included yet.
Yes...
Aiming to redo it some time in early 2025!
Happy to see bevy between them though! :)
Tangent: not that often to see a fellow Ramon in hn :)
Also, lol at Zig being a suburb of Rust
Not that surprised. Rust is known for being evangelized by a very loud minority.
A fun minigame is trying to find a particular project using the map only, without the search feature :-)
or start with one project and find your way to another, you can imagine there are shipping lines :)
love it =)!
Cool
Very neat and creative approach but I'm honestly conflicted whether the country/map metaphor is the best choice. In many cases the names are not that clear, so one has to zoom in to understand what they represent. It would perhaps be more interesting to do hierarchical clustering and show something like average connectiveness between the (super)clusters with lines, possibly with more descriptive/faithful LLM-generated labels for each cluster.
I couldn't find a universal clustering algorithm yet: Frequently there is more than one way to group data that still makes sense, and as a result whichever final clustering option we choose - it will not be perfect.
Hm... unless maybe we do some sort of quantum clustering, which could be a fun project to explore!
It's a bit hazy now, but I remember trying hdbscan algorithm (hierarchical clustering), and on the graph of the GitHub size - I just couldn't fit it in memory.
I did end up using something similar to hierarchical clustering (mix of louvain/leiden/my own), and that's what we see in the final map.
I was pleasantly surprised that it wasn’t a heavy line drawing creation. As someone who first did those in the 90’s and almost immediately learned their limits, I think this is nice because it doesn’t overclaim. It’s just a view, not a thesis.
I like diagrams where the axes mean something. Lines, shape, boxes/groups, distance, X vs Y, colour, thickness, texture, background, foreground. I also like simple. So often it’s lines to be fancy with no meaning. This one is just a pic, with some grouping, and it has personality. Yay?
(Still love lines, just not everywhere always.)
They could have done that, but they decided to do a map
Quitlessia and NeoQuitlessia... These names are evil. (Doom Emacs lies in NeoQuitlessia instead of Emacsia, which surprisingly makes sense. :)
haha! I love vim.
We shall not quit.
Very interesting that HTMX (bigskysoftware/htmx), which is backend-agnostic, lives in Pythonia->Djangonia and not in e.g. Fronterra.
Does this mean that HTMX is mostly used by Django devs?
Despite HTMX being backend-agnostic, I heard it pairs extremely well with Django, so that's probably why! Maybe the two are particularly well fitting pieces of the web dev puzzle.
Its good to see all those "why is X in Y?" type comments.
Remember that feeling when deploying algorithms, especially when those affect people (which hopefully in not the case with this nice project.
A mechanism to explain how specific results came about is as much part of the project as the more technical machine learning choices involved.
"Stop the war" looks like a very small territory, you don't even need to think what kind of message they send. It's so small in the grand scheme
Kudos to the author for the amazing idea!
The only problem I see is that projects don't fit so nicely in the division between languages (Pythonia, Javaland, Clojuria, etc) and applications (Gamedonia, AILandia, etc). There's a lot of intersection between them.
But the visualization is super-cool nonetheless. :)
The author of this also made other outstanding vizualisations.
A while back ngraph blew my mind. I built a taxonomy biz off ngraph:
Thank you for your kind words!
That link you've shared - doesn't open for some reason
It did for me now, perhaps too much traffic. It's pretty wild, especially on a tablet due to the gyroscope effect!
Gyro support is also Andrei's work, not mine.
I found the (awesome) video where he presents ngraph: https://youtu.be/vZ6Yhlxv7Os
Edit: not loading? surge.sh has been less reliable lately, will get to finish that project some day and will publish elsewhere.
How are connections between repos determined? I checked some of my repos and don't see any references in either direction for some of the connections.
The author answered that question in the original HN post: https://news.ycombinator.com/item?id=35933981
Basically what others are guessing, lines represent the highest similarity scores based on "stargazers", which also forms the entire map. To anyone confused, the lines only appear once you click into a specific country.
In the first line: "Dots are close to each other if they have a lot of common stargazers."
That explains why they are "close to each other" but not what determines which nodes are connected by an edge.
I think it's the other way around. The similarity metric determines which repos have edges (possibly weighted?)
And then some clustering algorithm makes sense of this giant graph by laying out sets of nodes that have a lot of edges to each other, close to each other
The closeness is just layout, the edges is the data structure that determines closeness.
This is correct!
Jaccard similarity returns a value between 0 and 1 (in this case the vast majority of the values being close to 0). I suspect there's a hard-coded threshold value to determine an edge, e.g. if Jaccard similarity between A and B is > 0.2, create an edge.
Django is in the middle of Pythonia, and not in Djangonia. Weird!
If you develop on Linux you generally probably don't star linux/kernel. But you do star other projects developing on Linux.
Ditto if you develop Django, you star Python libraries, not Django downstream plugins.
It's similar to how SpringBootia is almost as big as Javaland, but Spring Boot itself is in Javaland and not it's "homeland"
I do have the theory that the more untyped the language is, the larger the islands are: Fronterra (JavaScript), Cloudderra (YAML), AILandia (Python) are way bigger than Java, Swift, DotNet, etc. even though the prejudice saying goes that the problem of software engineering is stale old enterprise code in Java/DotNet.
That might be the case, but the libraries seems to be more reusable!
Javascript made the barrier to entry for creating a package nearly zero. In contrast, it's fairly difficult to publish something on Maven Central (the main Java repository). You need to prove you own a domain, setup a GPG key for signing, manually register with Sonatype, which is more than many people are willing to do. I think that explains it much better.
The stale old enterprise code is not in public repositories.
Cool visualisation!
It was somewhat amusing that MicroPython isn't in MicroPythonia but Arduinoria...and CircuitPython is in PicoPythonia. :)
> In the second phase I computed exact Jaccard Similarity between each repository.
Using what inputs? The repo seems to have only the frontend code.
The star data from the first phase.
this looks really great!
I tried something similar a few weeks ago, using the embedding vectors of the Github project descriptions.
Lispaña is a really excellent name for a lisp country :)
Thank you =)
Some might say that PHP is dead (and I’d be one of them too), but there is a PHP kingdom on the map! :) I think we might have all been mistaken.
I don't think anyone has ever seriously suggested PHP is dead. People may want it to be dead, but it's probably still the most-used language on the web!
Sorry, not sorry—PHP is alive, and thriving! The language runtime is getting ever faster, the packagist ecosystem in combination with composer (PHP's package manager) are rock-solid, there are event loops and application servers by now, serverless deployments are the default operation mode, and with Laravel or Symfony, there are trusted and extremely versatile frameworks available that do stuff out of the box that require lots of manual efforts with other languages.
Add to that the support for type annotations that can go all the way from fully untyped and dynamic, to runtime-enforced primitive constraints and object types, and you'll end up with a very good choice for web applications that evolve quickly.
docker-minecraft is under Adulttopia. I wonder what made it make that connection
Nice, but kind of weird to find piku/Piku in Fronterra.
I'm not sure why BinanceLand is in AILandia though, please dont encourage them XD
Clearly crypto should be a sinking ship with people swimming to the shores of other places in this metaphor
It would make sense that there's an overlap between crypto fans and a certain subset of AI fans.
ZH.Pyscrapia had an island of its own.
I've been thinking something similar for identifying ownership areas within an organization would be cool.
Very well done, loads quickly and is usable even from mobile.
I love this sort of concept map and I am typically disappointed by the execution.
Wikimedia is right next to GPT Nation. I think an invasion is imminent.
Why was jaccard similarity preferred here i would love to learn more about the choice process. Fantastic Work though love it
Thank you!
I tried quite a few various similarity metrics, and Jaccard was giving me the best results. This is all very subjective, of course.
This is truly a work of art! Great job!
> Homelabia
Definitely some unique naming choices there lol
yay anvaka reaches front page!
fun times from reddit map
haha thank you!
Interesting how one fork of Magisk lands in "AndroModLand" and another in some gaming space.
Interesting that azureland is under l33t nation and not clouderia
"The GitHub Archipelago"
This is phenomenal!
But stargazers are absolutely meaningless, since most of them are bots that give stars behind payment and like random stuff to throw off detection.
And as usual important libraries don't get as much attention as flash little leaf projects.
couldn't find any of my stuff so that means i gotta do more lol
Looks like AI is already trashing the place, lol.
FORTRAN and COBOL Programming is a part of the AI island, lol.