Hyperbook Technology overview

Hyperbook has the following capacities in the backend.

1.  classification  of  URL into categories
2. Identify Tags / topics associated with web page or document
3. Index the contents for quick search
4. Provide Recommendation based on content
5. Extract Named entities
6. Identify relations / ontology
7. Compute Interest Graph
8. Identify communities / groups / networks
9. Text mining – to discover knowledge
10. Machine learning algorithms for categorization / prediction

1. URL Classification  

  Automatically categorize URL into high level categories  like  News, Social, Technology,  Health, Work etc.

Using Machine learning techniques, we have classified top 1 million alexa sites into  categories. We used data available from various sources like DMOZ, open directory project etc. to train our learning algorithm,  which was refined by manual feedback from usage of over 100K users , providing feedback on the quality of classification

2. Identify Tags :  Identify common tags associated with the URL or document .
Quickly visualize what the web page or document is about
Using several NLP techniques ( POS , WordNet, Collocation , Named Entity Recognition, N-gram), we identify the important terms and phrases associated with a web-page or document

3. Search :  Index the data for efficient searching.  Full text search for documents
Currently done using Lucene / SOLR
Under progress: Ability to index the whole page that has been added. Full text search capabilities of PDF and Word documents.
The autocomplete feature provides suggestions for search terms.
Basic Search Engine

4. Recommendations 
(replace buy -> read )
We primarily use two kinds of approach for computing recommended articles
1) Content-based systems examine properties of the items recommended. For instance, if a  user reads many articles on Technology Startups, then we recommend articles on those from our database
2) Collaborative filtering systems recommend items based on similarity measures between users and/or items. The items recommended to a user are those preferred by similar users. 

5.    Extract  Named entities 
Named-entity recognition (NER)  is an information extraction procedure that seeks to locate and classify elements in text into pre-defined categories such as the names of persons, organizations, locations etc. NER helps in understanding the text better and also improves other takss such as search and text analysis

6  Relation / Ontology  Identify the relations
Bill Gates
The Knowledge Graph also helps us understand the relationships between things. Marie Curie is a person in the Knowledge Graph, and she had two children, one of whom also won a Nobel Prize, as well as a husband, Pierre Curie, who claimed a third Nobel Prize for the family. All of these are linked in our graph. It’s not just a catalog of objects; it also models all these inter-relationships. It’s the intelligence between these different entities that’s the key.

7. Interest Graph
There are a number of uses for interest graphs both from a personal and business standpoint. Interest graphs can be applied in conjunction with social graphs as a way to meet or connect to people in a social network or community who have shared or common interests, and who may not necessarily otherwise know each other. [7][12] 
Interest graphs can also be applied to marketing for purposes such as audience analytics and audience-based buying,[13] for sentiment analysis,[14] and for advertising as another form of behavioral profiling and targeting based on interests.[7][10] As an example, through the use of interest graphs companies like Twitter are able to target ads more specifically based on their users’ individual interests.[15] Interest graphs may be applied to product development by using customer interests to help determine which new features or capabilities to provide in future versions of a product.[5] Interest graphs have many other uses as well[11] including simulation,[16] research and other content discovery and filtering tasks,[17][18] as input to recommendation engines for films, books, music, etc.,[19] and for learning and education.[20]

8. Identify Communities /  groups / network
Segment the users into groups based on their shared interests. Useful to identify the influencers, domain experts, targeted marketing etc.
some of the applications are identifying customers with similar interests geographically near each other, identify people with similar interests and purchasing history

9 Text Mining
Discover knowledge from text

Text mining, also referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. 
Text mining — also called intelligent text analysis, text data mining, or knowledge discovery in text — uncovers previously invisible patterns in existing resources .This can be used to perform analysis, decision-making, and knowledge management tasks

10 Machine Learning  algorithms
classification /tagging   We collected data collected from various open data sources like dmoz, freebase  and applied machine learning algorithms that predict the category of websites
click prediction use the clicking behavior of users to predict the articles they are interested in  / showing recommendations or ads
The machine learning algorithm is trained using available data. Prediction is done for new instance
Supervised method is used when there is known input/output data for training
Unsupervised method is used when there is no such data
Categorical – when output falls into discrete classes
Continuous – when output is a continuous variable

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>