Machine learning: The Architecture of a Large-Scale Web Search Engine

In the earliest days, Yahoo was a simple database engine-it hosted several links in the form of directories. It was one of the largest “human curated” directories that gained popularity with those on the web. Search became an urgent unmet need till Google came into picture. Yahoo went into irrelevance and now it has become an ad infested tracking website.

However, I came across the blogpost from Cliqz, that’s a new European rival to Google. I am not getting into the politics of GDPR or how Europe was left behind squarely in the race for technology dominance. Yet, this post is quite relevant because it discusses various issues related to machine learning at scale:

As we grow and rely more on machine learning and its variants, we want the processes surrounding Machine Learning to be streamlined and be more reproducible in general. This is where things like model tracking, model management, data versioning and lineage becomes crucial.To run things consistently at our scale where we apply periodic updates and assessments, we needed a solution around data management for serving models in production, which facilitates hot swapping of models and indexes in our live production services autonomously. To tackle this issue, we built a solution in-house “Hydra” which provides downstream services with the capability of performing a dataset pub-sub

The Architecture of a Large-Scale Web Search Engine, circa 2019
Machine learning complex
Behind the scenes is an extremely complex scenario.

The authors have discussed the various issues related to several engineering challenges that they encounter, including onboarding the new hires without raking up the infrastructure costs or underutilisation of the resources. It is a delicate balance that they have to maintain everyday.

Here’s a very interesting paper that they discuss- it is called as “technical debt”.

[embeddoc url=”https://www.dropbox.com/s/cciqsydjz7bomi4/5656-hidden-technical-debt-in-machine-learning-systems.pdf?dl=1″ viewer=”google” ]

This has huge implications for those who need to understand deployment of machine learning in the hospitals. Pricing is of course an issue, but it is always going to be cheaper to have facilities “in-house” than “out-sourced” (even with the technical compliance in place).

Machine learning or AI will fundamentally affect the users “subtly” but we can’t be oblivious to the “J-curve” of productivity either. It is a management problem-should we invest now in AI (or personalised medicine) without any surety of returns? Or be left behind the curve when it possibly starts delivering value to the incumbents.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.