Hi there ๐Ÿ‘‹

Today, let's dive into 7 ML repos that the top 1% of developers use (and those you have likely never heard of)!


What defines the top 1%?

Ranking developers is a difficult problem, and every methodology has its issues.

For example, if you rank developers by the number of lines of code they have written in Python youโ€™ll probably get some pretty good Python developers at the top.

However, you may get people who have just copy-pasted lots of Python code to their repos and they arenโ€™t that good. ๐Ÿ™

At Quine, we have developed a methodology that we think is robust in most cases, but again not 100% perfect!

Itโ€™s called DevRank (you can read more about how we calculate this here).

The notion of the Top 1% that I use in this article is based on DevRank.

And yes, we continue working on this to make it better every day!

How do we know which repos the top 1% use?

We look at the repos that the 99th percentile has starred.

We then compare the propensity of the top 1% of devs vs the bottom 50% of devs to star a repo, and automatically generate the list.

In other words, these repositories are the hidden gems used by the top 1% of developers and are yet to be discovered by the wider developer community.


CleverCSV

I handle your messy CSVs

A package developed by some friends of ours to handle common pain points of loading CSV files. A small but common problem at the start of many ML pipelines, solved well. ๐Ÿ”ฎ

https://github.com/alan-turing-institute/CleverCSV


skll

Streamline ML workflows with scikit-learn through CLI

Are you writing endless boilerplate in sklearn to obtain cross-validated results with multiple algorithms? Try skllโ€™s interface instead for a much cleaner coding experience. โšก๏ธ

https://github.com/EducationalTestingService/skll


BanditPAM

k-Medoids Clustering in Almost Linear-Time

Back to fundamental algos here โ€” BanditPAM is a new k-medoids (think a robust โ€œk-meansโ€) algorithm that can run in almost linear time. ๐ŸŽ‰

https://github.com/motiwari/BanditPAM


recordlinkage

The record matcher and duplicate detector everyone needs

Have you ever struggled to match users within different datasets who have spelt their name wrong, or who have slightly different attributes? Use this great library inspired by the Freely Extensible Biomedical Record Linkage (FEBRL), rebuilt for modern Python tooling. ๐Ÿ› ๏ธ

https://github.com/J535D165/recordlinkage


dragnet

A sole focus on web page content extraction

Content extraction from webpages. Dragnet focuses on the content and user comments on a page, and ignores the rest. It's handy for our scraper-friends out there. ๐Ÿ•ท๏ธ

https://github.com/dragnet-org/dragnet


spacy-stanza

The latest StanfordNLP research models directly in spaCy

Interested in standard NLP tasks such as part-of-speech tagging, dependency parsing and named entity recognition? ๐Ÿค”

SpaCy-Stanza wraps the Stanza (formerly StanfordNLP) library to be used in spaCy pipelines.

https://github.com/explosion/spacy-stanza


Littleballoffur

"Swiss Army knife for graph sampling tasks"

Have you ever worked with a dataset so large that you need to take a sample of it? For simple data, random sampling maintains distribution in a smaller sample. However, in complex networks, snowball sampling - where you select initial users and include their connections - better captures network structure.

This helps avoid bias in analysis. ๐Ÿ”ฆ

Now, do you have graph-structured data and need to work on samples of it (either for algorithmic or computational reasons)? ๐Ÿ‘ฉโ€๐Ÿ’ป

https://github.com/benedekrozemberczki/littleballoffur


I hope these discoveries are valuable to you and will help build a more robust ML toolkit! โš’๏ธ

If you are interested in leveraging these tools to create impactful projects in open source, you should first find out what your current DevRank is on Quine and see how it evolves in the coming months!

Lastly, please consider supporting these projects by starring them. โญ๏ธ

PS: We are not affiliated with them. We just think that great projects deserve great recognition.

See you next week,

Your Hackernoon buddy ๐Ÿ’š

Bap


If you want to join the self-proclaimed "coolest" server in open source ๐Ÿ˜, you should join our discord server. We are here to help you on your journey in open source. ๐Ÿซถ