Some years back, we worked on a vertical search engine for job postings, and it quickly grew to 10MM domains. To avoid showing expired jobs data, the system was mostly monitoring (re-crawling) about 250,000+ sites with actual jobs, or where jobs could potentially appear in the near future.
A lot of data attributes were extracted from unstructured web and processed through filters, classifiers, Machine Learning and custom heuristics, resulting into large useful dataset full of business data.
Maybe that project deserves its own technical article, however in this post let's discuss different but somewhat related search engine idea:
We want powerful B2B Search and we only see dumbed down mainstream consumer search - smart engineers have to serve mainsteam users and so available search query capabilities are really dumbed down
Not advanced enough - Google's Advanced Search syntax is NOT ADVANCED AT ALL. Top 5% of global power users would want much much more advanced search syntax, robust API or SDK, and integration capabilities with their own IT systems. We would review example use cases below
Does not integrate - Google would never let you export 253,781 lines of your search results, yet alone integrate with your own IT system / backend. Do not get fooled by terms like "Google Search API" (now depricated), or Google Custom Search - they solve completely different problems (like adding Google as a search box to your website)
Fast simple VS slow but powerful - consumer search engine has to be split-second fast which almost always implies simple (i.e. limited). B2B search could be much more complex and powerful, even if that implies certain queries / procedures may take minutes (hours? days?) before returning results
Open Source - maybe this new Search Engine should be an Open Source project? So even interested Google engineers could contribute? Maybe this could be a collaborative dev community effort funded by Bitcoin Donations / Grants / Tips / KickStarter, and contributing developers worldwide being paid in Bitcoin?
Let's discuss examples of important tasks that could have been accomplished with this imaginable search engine. For the lack of a better option, I am going to use SQL-like language to create imaginable queries against the Internet Universe
You are a startup founder. You need to find Journalists active in your niche to pitch your story to. Here is how your query might look like:
1) Get top 100 news sites by alexa traffic rank
select top 100 * from Site s where s.Category = 'News' order by SiteAttribute(s.SiteID, 'AlexaRank')
2) Get all articles from those sites that are published within last 2 years and containing any of your niche keywords in the title of the article
from Page p
where p.SiteID in (... top 100 sites above ...)
and p.CreateDate > getdate() - 365 * 2
and Regex.IsMatch(p.Title, '\b(bitcoin|litecoin|dogecoin|btc|ltc)s?\b')
3) Get Authors from those articles, group by Author Name, calculate number of matching articles for each author
AuthorID, Count(p.PageID) as ArticleCount
PageAttribute(p.PageID, 'AuthorID') as AuthorID
group by AuthorID
4) For each Author, try to lookup their email address, twitter handle, and Klout score
PersonAttribute(prs.PersonID, 'Email') as Email
PersonAttribute(prs.PersonID, 'Twitter') as Twitter
PersonAttribute(prs.PersonID, 'Klout') as Klout
left outer join Person prs on prs.PersonID = AuthorID
5) Finally, output that list of Authors, ordered by a custom formula of a weighted sum of their news site alexa rank, how many matching articles, and their Klout score.
For readability, I am only providing parts of the larger query. Mostly to get a sense of the data model and basic operations. I am also trying to keep it close to the real MS SQL syntax. It might also get split into multiple queries combined into one executable program written in Java or C#. It is not necessary that everything has to be expressed in one large hard to read sql query. Sometimes procedural approach makes it is easier to understand the underlying logic.
For the sake of this article, let's forget about sharding, big data, map reduce, and just try to make the very first draft of an entity model capable of holding websites, webpages and whole bunch of related metadata in a relational database. Let's keep it really simple and straightforward so at least we have something to start talking about.
Here is one company DataSift who is actually already doing something similar, focusing on about 20 most important internet websites. The most requested data sources (by their customers) are Facebook, LinkedIn and increasingly YouTube. They also crawl twitter, reddit, etc, about 20 different websites. Learned about it from Mark Suster's blog post Twitter apparently started giving them problems, why are we not surprised
... to be continued ...