The content of this blog is based on the course COM6115(Text Processing (AUTUMN 2019~20)) at the University of Sheffield.
@All the konowledge rights are reserved by the owner of this course’s materials,Dr Mark Hepple and Dr Chenghua Lin at the University of Sheffield.
[toc]
Information Retrieval (IR): concerned with developing algorithms and models for retrieving relevent documents from text collections.
How can I formulate a query?
Normally keywords, could be natural language.
How are the documents represented?
indexing
How does the system find the best-matching document?
retrieval model
How are the results presented to me?
unsorted list, ranked list, clusters
How do we know whether the system is any good?
evalution
The task of finding terms that describe documents well
MeSH — Medical Subject Headings (Manuall indexing)
a very large controlled vocabulary for describing/indexing medical documents, e.g. journal papers and books. It provides a hierarchy of descriptors (a.k.a. subject headings).
hierarchy has a number of top-level categories, e.g.: Anatomy [A] Organisms [B] Diseases [C] Chemicals and Drugs [D] Analytical, Diagnostic and Therapeutic Techniques and Equipment [E] Psychiatry and Psychology [F] Biological Sciences [G]
Evalution:
- Advantage: 1)High precision searches 2)Works well for closed collections (books in a library).
- Disadvantage: 1)Searchers need to know terms to achieve high precision. 2)Labellers need to be trained to achieve consistency. 3)Collections are dynamic → schemes change constantly.
Automatic Indexing:
Use the natural language as indexing language
Implementation: inverted files
- record each term, the ids of thedocuments in which it appears
- only matters if it does or does not appear - not how many times
- (doc_id)
- also recoed the count of occurrences within each document.
- help find documents more relevant to query.
- (doc_id : occurance_number )
Common to just use the words, but pre-process them for generalisation.
Method:
Tokenisation: split words from punctuation (get rid of punctuation)
Capitalisation: normalise all words to lower (or upper) case
Lemmatisation: conflate different inflected forms of a word to their basic form (singular, present tense, 1st person):
Stemming:
Normalisation: heuristics to conflate variants due to spelling, hyphenation, spaces, etc.
e.g. USA and U.S.A. and U S A → USA
e.g. chequebook and cheque book → cheque book
e.g. word-sense and word sense → word-sense
Problem of Single and Multi-words
Positional indexs:
Boolean Method(Binary)
- Purpose:
- evaluate the relevance and suffience of term for match
- provides a simple logical basis for deciding whether any document should be returned.
- whether basic terms of query do/do not appear in the document
- the meaning of the logical operators
Algorithm:
- frequency of document terms
- not all search terms necessarily present in document
- vector space model
Approach:
- Basic search terms(keywords)
- using boolean operators
Boolean Operators:
- AND, OR, NOT, BUT, XOR (exclusive OR)
- E.g.: Monte-Carlo AND (importance OR stratification) BUT gambling
Set-theoretic interpretation:
- $d(E)$ denote the document set for expression E
- $E$ either a basic term or boolean expression
- $d(E1 \,AND\, E2) = d(E1) ∩ d(E2)$
- $d(E1 \,OR\, E2) = d(E1) ∪ d(E2)$
- $d(NOT\, E) = d(E)^c$
- $d(E1 \,BUT\, E2) = d(E1) − d(E2)$
- Some example:
Conclusion:
- Advantage:
- Higher requirements for user: Expert knowledge needed to create high-precision queries
- Often used by bibliographic search engines (library)
- Disadvantage:
- Most users not familiar with writing Boolean queries → not natural
- Most users don’t want to wade through 1000s unranked result lists → unless very specific search in small collections
- This is particularly true of web search → large set of docs
Vector Space Model
Description:
- Documents are points in high-dimensional vector space
- each term in index is a dimension → sparse vectors
- values are frequencies of terms in documents, or variants of frequency
- Queries are also represented as vectors (for terms that exist in index)
Priori:
- Select document(s) with highest document–query similarity
- Document–query similarity is a model for relevance (ranking)
- With ranking, the number of returned documents is less relevant → users start at the top and stop when satisfied
Approach:
compare vector of query against vector of each document
- to rank documents according to their similarity to the query
Each document and query are represented as a vector of $n$ values:
$\vec{d^i} = (\vec{d_1^i},\vec{d_2^i},\vec{d_3^i}, \cdots,\vec{d_n^i} )$ $\vec{q} = (q_1,q_2,q_3, \cdots , q_n)$
Similarity: \(\sqrt{\sum\limits_{k=1}^N}(q_k - d_k)^2\)
E.g.:
Doc1 and Q
$\sqrt{(9-0)^2 + (0-1)^2 + (1-0)^2 + (0-1)^2} = 9.15$
Doc2 and Q
$\sqrt{(0-0)^2 + (1-1)^2 + (0-0)^2 + (10-1)^2} = 9.15$
Evalution:
distance is large for vectors of different lengths, even if by only one term (e.g. Doc2 and Q)
means frequency of terms given too much impact
Better similarity metric, used in vector-space model: cosine of the angle between two vectors $\vec{x}$ and $\vec{y}$: ( For the each value of text and query, when the term occurs, the value should be $1$ and if not, the value should be $0$ for each $x_i$ and $y_i$ ) \(cos(\vec{x},\vec{y}) = \frac{\vec{x} \cdot \vec{y}}{\left|\vec{x}\right| \left|\vec{y}\right|} = \frac{\sum\limits_{i=1}^{n}x_iy_i}{\sqrt{\sum\limits_{i=1}^{n}x^2}\sqrt{\sum\limits_{i=1}^{n}y^2}}\)
It can be interpreted as the normalised correlation coefficient:
i.e. it computes how well the $x_i$ and $y_i$ correlate, and then divides by the length of the vectors, to scale for their magnitude.
Value:
- range from:
- 1, for vectors pointing in the same direction, to
- 0, for orthogonal vectors,
- -1, for vectors pointing in opposite directions
Inverse document frequency( **IDF** )
Term Weighting
\[\frac{\left|D\right|}{df_w}\]
informativeness of terms
- Idea that less common terms are more useful to finding relevant docs
- document frequency(df) reflects this difference
collection frequency(cf) fails to distinguish them (i.e. very similar counts)
Informativeness is inversely related to (document) frequency
- less common terms are more useful to finding relevant documents
more common terms are less useful to finding relevant documents
- Compute
Value reduces as $df_w$ gets larger, tending to 1 as $df_w$ approaches $ D $ Value very large for small dfw— over-weights such cases
To moderate this, take $\textbf{log}$: Inverse document frequency( **IDF** ) \(idf_{w,D} = log\frac{\left|D\right|}{df_w}\)
BUT Not all terms describe a document equally well
Putting it all together: tf.idf
Terms which are frequent in a document are better: \(\mathbf{tf_{w,d} = freq_{w,d}}\)
Combine the two to give tf.idf term weighting: \(\mathbf{tf.idf_{w,d,D} = tf_{w,d}\cdot\,idf_{w,D}}\)
e.g.
e.g.
PageRank Algorithm
- Key method to exploit link structure of web: PageRank algorithm
- Assigns a score to each page on web: its PageRank score
- can be seen to represent the page’s authority (or quality)
- Idea:
- Link from page A to page B confers authority on B
- how much authority is conferred depends on:
- the authority (PageRank score) of A, and its number of out-going links
- i.e. A’s authority is shared out amongst its out-going links
- note that this measure is recursively defined
- i.e. score of any page depends on score of every other page
- PageRank scores have an alternative interpretation:
- probability that a random surfer will visit that page
- its PageRank score: a measure of its authority
Component / technique
Relevance
Evaluation of IR systems – Metrics
AIM
The two aspects of this aim are addressed by two separate measures — recall and precision.
Trade-off Between Recall and Precision
Recall and Precision - Which is the better system?
F-measure
F measure (also called F1):
combines precision and recall into a single figure, and it is a harmonic mean
gives equal weight to both:
P = Precision R = Retrievaled \(\mathbf {F = \frac{2PR}{P + R}}\)
penalises low performance in one value more than arithmethic mean:
$F_\beta$
Precision at a cutoff
Measures how well a method ranks relevant documents before non-relevant documents。
Average Precision
Aggregates many precision numbers into one evaluation figure
Precision computed for each point a relevant document is found, and figures averaged
Abstract
- Explain the relevance of the topic
- Differentiate between objective and subjective texts
- List the main elements in a sentiment analysis system
- Provide a critical summary of the main approaches to the problem
- Explain how sentiment analysis systems are evaluated.
General Goal:
- Extract opinions, sentiments and emotions expressed by humans in texts and use this information for business, intelligence, etc. purposes. Can’t be done manually: huge volumes of opinionated text (esp. Big Data on the Web).
- Product review mining: Which features of the iPhone 11 customers like and which do they dislike?
- Review classification: Is a movie review positive or negative?
- Tracking sentiments toward topics over time: Is anger about the government policies growing or cooling down?
- Prediction (election outcomes, market trends): Will the Tories win the next election?
Sentient Analysis
- The task of opinion mining is: given an opinionated document:
- Discover all quintuples
- Discover some of these components
- With that, one can structure the unstructured:
- Traditional data and visualisation tools can be used to slice, dice and visualise the results.
- Qualitative and quantitative analysis can be done.
Rule-based subjectivity classifier: a sentence/document is subjective.
if it has at least n (say 2) words from the emotion words lexicon; a sentence/document is objective otherwise.
Rule-based sentiment classifier: for subjective sentences/documents, count positive and negative words/phrases in the sentence/document.
If more negative than positive words/phrases, then negative; otherwise, positive (if equal, neutral).
Rule-based sentiment classifier (feature-level):
Rule-based sentiment classifier (feature-based)
- Use of ranges of sentiment instead of a binary system, to deal with intensifiers like:
- absolutely, utterly, completely, totally, nearly, virtually, essentially, mainly, almost, e.g.: absolutely awful
- And grading adverbs like:
- Very, little, dreadfully, extremely, fairly, hugely, immensely, intensely, rather, reasonably, slightly, unusually, e.g.: a little bit cold
Rule-based gradable sentiment classifier
The lexicon: word-lists with pre-assigned emotional weights, e.g:
Neg. dimension ($\text{C_neg}$ ): {-5,…,-1}, Pos. dimension ($\text{C_pos}$ ): {+1,…,+5}
Additional general rules to change the original weights:
**Intensifier rule**:
Diminisher rule:
**Exclamation rule**:
Emoticon rule:
Each has its own emotional weight, like an emotion word.
E.g.: Emotion(🙂) = +2; Emotion(🙁) = −2.
E.g.: “I can’t believe this product 😣
Decision
Evalution:
Collect relevant words/phrases that can be used to express sentiment. Determine the emotion of these subjective word/phrases.
Basic Knowledge:
Example
Prior:
Likelihoods
Relative frequencies for prior (P(ci)) and likelihood($P(t_j | c_i$) makethe model in a Naive Bayes classifier. |
At decision (test) time, given a new segment to classify, this model is applied to find the most likely class for the segment \(\mathbf{argmax\,P(c_i) = \prod\limits_{j=1}^nP(t_j|c_i) }\)
e.g.
$P(positive) ∗ P(great | positive) ∗ P(great | positive) ∗ P(great | positive)$ |
3/7 ∗ 1/10 ∗ 1/10 ∗ 1/10 = 0.00043
$P(negative) ∗ P(great | negative) ∗ P(great | negative) ∗ P(great | negative)$ |
4/7 ∗ 2/8 ∗ 2/8 ∗ 2/8 = 0.00893
what if both the positive and negative value equals to zero
Add smoothing to feature counts (add 1 to every count).
Likelihoods \(\mathbf{P(t_j|c_i) = \frac{count(t_j,c_i)+1}{count(c_i)+\left|V\right|}}\) where the $\left|V\right|$ is the number of distinct attributes in training(all classes)
number fo classes is the number of the total documents
example
$P(positive) ∗ P(boring | positive) ∗ P(annoying | positive) ∗ P(unimaginative | positive)$ |
3/7 ∗ ((0 + 1)/(10 + 12)) ∗ ((0 + 1)/(10 + 12)) ∗ ((0 + 1)/(10 + 12)) = 0.000040
$P(negative) ∗ P(boring | negative) ∗ P(annoying | negative) ∗ P(unimaginative | negative)$ |
4/7 ∗ ((0 + 1)/(8 + 12)) ∗ ((0 + 1)/(8 + 12)) ∗ ((1 + 1)/(8 + 12)) = 0.000143
Evaluation:
It’s simple and will work well if data is not sparse
How can we improve?
Can contrast direct opinions versus more complex comparative opinions:
How do we quantify how well our Sentiment Analysis systems work?
Create experimental datasets (aka test corpora)
Compare (positive vs negative) system to human classifications
Compute metrics like
Scanerio:
Goals:
- Decide on content: to determine what information to communicate
- Decide on rhetorical structure: to determine how to structure the information to make a coherent text
Definition
From each text in a set of unstructured natural language texts identify information about predefined classes of entities, relationships or events and record this information in a structured form by either:
- Annotating the source text, e.g. using XML tags;
- Filling in a data structure separate from the text, e.g a template or database record or “stand-off annotation”
- The activity of populating a structured information repository (database) from an unstructured, or free text, information source
- The activity of creating a semantically annotated text collection (cf.“The Semantic Web”)
Purpose:
- searching or analysis using conventional database queries
- data-mining;
- generating a summary (perhaps in another language);
Task
- Given: a document collection and a predefined set of entities, relations and/or events
- Return: a structured representation of all mentions of the specified entities, relations and/or events
Evaluation:
- Strengths
- Extracts facts from texts, not just texts from text collections
- Can feed other powerful applications( databases, semantic indexing engines, data mining tools)
- Weakness:
- Systems tend to be genre/domain specific and porting to new genres and domains can be time-consuming/requires expertise
- Limited accuracy
- Computationally demanding, so performance issues on very large collections
Knowledge Engineering Approaches
Supervised Learning Approach
- Systems are given texts with manually annotated entities + relations
- For each entity/relation create a training instance
- k words either side of an entity mention
- k words to the left of entity 1 and to the right of entity 2 plus the words in between
- Training instances represented in terms of features
- words, parts of speech, orthographic characteristics, syntactic info
- Systems may learn
- patterns that match extraction targets
- Classifiers that classify tokens as beginning / inside / outside a tage type
- Learning techniques include: covering algorithms, HMMs, SVMs
Bootstrapping Approaches
- A technique for relation extraction that requires only minimal supervision
- Systems are given
- seed tuples
- seed patterns
- System searches in large corpus for
- occurences of seed tuples and then extracts a pattern that matches the context of the seed tuple
- matches of seed patterns from which it harvests new tuples
- New tuples are assumed to stand in the required relation and are added to the tuple store
Distant Supervision Approaches
- Sometimes also called “weakly labelled” approaches
- Assumes a (semi-)structured data source, such as
- which contains tuples of entities standing in the relation of interest and, ideally, a pointer to a source text
- Tuples from data source are used to label
- the text with which they are associated, if available
- documents from the web, if not
- Labelled data is used to train a standard supervised named entity or relation extraction system
Knowledge Engineering Approaches to NER
- Such systems typically use
- named entity lexicons and
- Manually authored pattern / action rules or regular expression
- System has three main stages:
- Lexical processing
- NE parsing
- Discourse interpretation
- Lexical processing
- Many rule-based NER systems made extensive use of specialised lexicons of proper names, such as gazetteers – lists of place names
- Why not use even larger gazetteers?
- Many NEs occur in multiple categories - the larger the lexicons the greater ambiguity
- the listing of names is never complete, so need some mechanism to type unseen NEs in any case
- Principles
- Tokenisation, sentence splitting, morphological analysis
- Part-of-speech tagging
- tags known proper name words andunknown uppercase-initial words as proper names (NNP, NNPS) NNP = Proper nouns
- Name List / Gazetter Lookup and Tagging (organisations, locations, persons, company designators. person titles)
- Trigger Word Tagging
- certain words in multi-word names function as trigger words, permitting classification of the name
- system has trigger words for various orgs,gov’t institutions,locations
- NE Parsing
- Using the presetrules to identify the unclassified proper name
- Discourcse Interpretation - Coreference Resolution
- When the name class of an antecedent (anaphor) is known thenestablishing coreference allows the name class of the anaphor (antecedent) to be established.
- An unclassified PN may be co-referential with a variant form of a classified PN. (PN = Pronoun)
- Ford – Ford Motor Co.
- An unclassified PN may be co-referential with a definite NP which permits the PN’s class to be inferred
- E.g. Kellogg … the breakfast cereal manufacturer
- Semantic type information
- noun-noun qualification
- When an unclassified PN qualifies an organisation-related object then the PN is classified as an organisation; e.g. Erickson stocks
- Possessives
- when an unclassified PN stands in a possessive relationto an organisation post, then the PN is classified as an organisation; e.g. vice president of ABC, ABC’s vice president
- Apposition
- when an unclassified PN is apposed with a known orgnisation post, the former name is classified as a person name; e.g. Miodrag Jones, president of XYZ
- Evaluation
- Strengths
- High performance - only several points behind human annotators
- Transparent - easy to understand what system is doing/why
- Weakness
- Porting to another domain requires substantial rule re-engineering
- Acquisition of domain-specific lexicons
- Rule writing requires high levels of expertise
Supervised learning approaches to NER
- Task
- Supervised learning approaches aim to address the portability problems inherent in knowledge engineering NER
- Instead of manually authoring rules, systems learn from annotated examples
- Moving to new domain requires only annotated data in the domain – can be supplied by domain expert without need for expert computational linguist
- Sequence Labelling
- Systems may learn
- patterns that match extraction targets
- classifiers that label tokens as beginning/inside/outside a tage type
- In sequence labelling for NER, each token is given one of three label types:
- Scheme is called
BIO
or sometimesIOB
- Each training instance(token) is typically represented as a set of features
- Features can be not only the characteristics of the token itxself but of neighbouring tokens as well
- the classifier extracts features from
- input string
- left predictions
- Carreras et al.(2003)
- NE detection: in a first pass over the text BIO tags are assigned without regard to type.
- NE classification: in a second pass the NE’s detected in the first pass are assigned a class (organisation, person, location, etc.)
- Adaboost classifier
- Type pattern of consecutive words in context
- functional (f)
- capitalized (C)
- lowercased (l)
- punctuation mark (.)
- quote ()
- other(x)
- Entity Linking
- KBP: Knowlege base population
- Facts are gathered from open access web sources and used to build a structuted information repository.
- Entity Linking Task: Given a text with a recognised NE mention in that text and a knowledge base (KB), such as Wikipedia, link the NEs to the matching entry in the KB if there is one, else create an entry.
- Very Difficult Task
- Approach:
- Given a text T containing an NE mention m and using Wikipedia as a KB
- index all pages in the KB using an information retrieval system
- Build a query from T (e.g. use the sentence/paragraph/whole text) containing m and search the KB
- From the ranked list of KB pages returned by step 2 pick the high ranked page whose name matches m and return it
- Doesn’t work very well
Knowledge Rngineering Approach
Use the authored rules and can be divided into
- Shallow: using the pattern-action rules
- Deep: linguinstically inspired “language understanding” systems
- typically parse input using broad coverage NL parser to identify key grammatical relations, like subject and object
- use transduction rules to extract relations of interest from parser output
- Extraction rules over parser output allow a wider set of expressions to be captured than with regex’s over words and NE tags alone
Strengths:
- high precision
- System behaviour is human-comprehensible
Weakness:
- The writing of rules has no end
- New rules needed for every new domain( pattern action rules for shallow approaches; transduction rules for deep approaches)
Suppervised Learning Approaches
What to be learned?
- Rules that
- Match to all and only relation bearing sentences
- Capture sunstrings within the matched text that correspond to relation arguments
- Binary classifier
- containing instances of the entity types between which the relation holds
- As with NER can be divided into detection and classification stages:
Process:
- Assume entities to be related already tagged
- Use any algorithm for learning binary classifiers to learn to distinguishinstances (typically sentences) where
- entities co-occur and relation holds (positive instances)
- entities cooccur anreation os (positive instances)
- Features used fall into 3 broad classes:
- Features of the named entities
- Features from the words in the text, usually words from 3 locations
- Features about the entity pair within the sentence
Strengths:
- No need to write extensive/complex rule sets for each domain
- Same system straightforwardly adapts to any new domain, provided training data is supplied.
Weakness:
- Quality of relation extraction dependent on quality and quantity of training data, which can be difficult and time consuming to generate
- Developing feature extractors can be difficult and they may be noisy (e.g. parsers) reducing overall performance
Bootstrapping Approaches
Motivation:
- reduce number of manually labelled examples needed to build a system
Key Idea:
set of trusted tuples T (e.g. pairs of entities known to stand in the relation of interest)
set of trusted patterns P (i.e. patterns known to extract pairs of entities in the given relation with high accuracy)
The if
- then find tuples from T in sentences S in D, extract patterns from context of sentences in S, add patterns to P and then use P to find new tuples in D and add to T; repeat until convergence.
- then match patterns from P in sentences S in D, extract tuples from pattern matches in sentences in S, add tuples to T and then use tuples in T to find new patterns in D and add to P; repeat until convergence.
DIPRE(Dual Iterative Pattern Relation Expansion)
- Aim
- to extract useful relational tuples from the Web, of the form
- Method
- Exploit “duality of patterns and relations”
- Good tuples help find good patterns
- Good patterns help find good tuples
- Starting with user-supplied tuples, iteratively
- Use these tuples to find patterns
- Use the patterns to find more tuples
- Patterns are defined as 5-tuples: (order, urlprefix, prefix, middle,suffix)
- Occurrences are defined as 7-tuples: (author,title, order, url, prefix, middle,suffix)
- An algorithm for generating a pattern given a set of occurrences is described
- Algorithm inisists order and middle of all occurrences is the same – they form part of the generated pattern
- Additionally pattern contains
- longest matching prefix of the url of all the occurrences
- longest matching suffix of the prefix of all the occurrences
- longest matching prefix of the suffix of all the occurrences
- See Brin (1999) for details
- Patterns are assessed for specificity and rejected if their specificity is too low, i.e. if they are too general.
- Conclusion
- Strengths:
- Need for manually labelled training data is eliminated
- Weaknesses:
- Can suffer from semantic drift – when an erroneous pattern introduces erroneous tuples, which in turn lead to erroneous patterns
- Works well when significant redudancy in assertion of specific tuples an in use of specific patterns to express a relation
- Issues when multiple relations hold between the same pair of entities
Distant Supervision Approaches
Aim:
- reduce/eliminate the need for manually labelled training data
Key Idea:
- Suppose we have a large document collection D plus a structured data source (e.g. a database) R that contains
- many instances of a relation of interest in, e.g., a relational table
- optionally, for each relation instance a link to a document in D providing evidence for the relation
Method:
- search for sentences in D containing the entity pairs that occur in relation instances (tuples) in R
- label these sentences as positive occurrences of the relation instance
- use the labelled sentences as training data to train a standard supervised relation extractor
Freebase - a free on-line database of structured semantic data
Distant Supervision Assumption
if two entities participate in a relation, any sentence that contains those two entities might express that relation.
- combine features from multiple mentions to get a richer feature vector
- use multiclass logistic regression as a learning framework
- At test time features are combined from all occurrences of a given entity pair in the test data and the most likely relation is assigned
Negative instances
- to get these, randomly select entity pairs that do not appear in any Freebase relation and extract features for them
- Freease reation anextract eatures or tem these rare occurrences should be low
Conclusion:
- Strengths:
- Need for manually labelled training data is eliminated
- Can very rapidly get extractors for a wide range of relations
- Weakness:
- Precision still lags behind best knowledge-engineered/directly supervised learning approaches
- Only works if a good supply of structured data is available for the relation(s) of interest
Section A
a) In the context of Information Retrieval, explain the difference between algorithms that perform boolean search and algorithms that perform a ranked search. What type of algorithm would be better for a regular user (such as an undergraduate student in the Humanities area) who is using a search query with multiple terms, which he/she expects to appear in many documents? Explain the reasons behind your choice of algorithm type.
Answer:
For the boolean search model, the model just needs to decide the whether the document is relevant or not. And the presence of the terms is very essential and suffient for the search. For the operator, the boolean search model exploits the boolean operators like AND and OR. The boolean query provides the logical result for deciding whether the document should be returned.
For the ranked search method like the vector space model, it relys on the frequency of terms and maybe some weight of terms could affect the result of the search. The documents are point in high-dimension vector space and its value is the frequency of the terms. Also, the query should be represented as the vectors. This method record the vectors for each documents and queries, finally calculate the similarity metrics which can be interpreted as the normalised correlation coefficient.
For the choice of the searching algorithm, the ranked algorithm should be chosen. Because the boolean algorithm is difficult and unnatural for the newbies. And the users don’t want the plenty of ranked results list. For the ranked algorithm, it’s very easy to generate the result lists and pick up the relevant documents.
b) Compression techniques are important due to the growth in volume of the data that must be stored and transmitted.
(i) Explain the difference between lossy and lossless forms of compression. Discuss the suitability of these alternative forms of compression for different media types (e.g. for text vs. image data).
Answer:
Data Compression is a technique in which the size of data is reduced without loss of information. Lossy compression and Lossless compression are the categories of data compression method.
The main difference between the two compression techniques (lossy compression and Lossless compression) is that, The lossy compression technique does not restored the data in its original form, after decompression on the other hand lossless compression restores and rebuilt the data in its original form, after decompression.
(ii)Explain the difference between static, semi-static and adaptive techniques for text compression, noting their key advantages and disadvantages.
Answer:
A static model is a fixed model that is known by both the encoder and thedecoder and does not depend on the specific data that is beingcompressed.
A semiadaptive or semistatic model is a fixed model that is constructed from the data to be compressed.
An adaptive model changes during the compression. At a given point incompression, the model is a function of the previously compressed part ofthe data. Since that part of the data is available to the decoder at thecorresponding point in decompression, there is no need to store the model.
c)The two main model components in Statistical Machine Translation are the Translation Model and the Language Model. Explain the role of each of these components.Describe the type of data that is necessary to build each of them. Mention one way in which these components can be combined to build a translation system.
**d)Assume we have a small set of seed words with positive and negative opinions, e.g.: positive = {good, fast, cheap} and negative = {slow, boring, fragile}. Explain the two most common (semi-)automated approaches to expand these sets with more opinion words or phrases to create lexica for Sentiment Analysis, providing examples whenever possible. Give one advantage and one disadvantage of each approach. **
Answer:
This model is created from the seed words which comes from the supervised seed positive and negative words. It can be divided into two parts ways, Dictionary-based and Corpus-based approach. For the Dictionary-based, the users should find the synonyms/antonyms of seed emotion words in dictionaries like WordNet. Otherwise, users can build the patterns from the seed words/phrases to search on large corpora. For the Corpus-based approach, the examples are annotated with sentiment are used withe machine learning algorithms to learn a classifier for each sentence and document. The users can rely on the manully way (gold-standards) and crow-annotated resources like Amazon Product Resource. This approach can be divided into the two steps: subjectivity classifier : first run binary classifier to identify and then eliminate objective segments. Subjectivity Classifier with remaining segments: learn how to combine and weight different attributes to make predictions.
For the advantages for the Dictionary-based approach, it doesn’t need the labelled data and the procedure of learning is not required. For disadvantages, it requires powerful linguistic resources which is not always available, so users cannot guarantee the efficiency of this approach.
For the advantages for the Corpus-based approach, it really easy to implement anf usually works well. As for the disvantage, the assumption of this approach is hard to estimate.
Section B
2.In the context of Information Retrieval, given the following documents:
Document 1: Sea shell, buy my sea shell! Document 2: You may buy lovely SEA SHELL at the sea produce market. Document 3: Product marketing in the Shelly sea is an expensive market. and the query: Query 1: sea shell produce market
a)Apply the following term manipulations on document terms: stoplist removal, capitalisation and stemming, showing the transformed documents. Explain each of these manipulations. Provide the stoplist used, making sure it includes punctuation, but no content words.
Answer:
For the stop removal list, it should exclude the “non-content” words, so the stop list should be “my, may ,at, you ,the, in, is, an, , , ! , .”.
For the capitalisation, turn all the words to lower case.
For the stemming, turn the marketing
to market
. product
to produce
So the transformed document should be:
Document 1: sea shell buy sea shell
Document 2: buy lovely sea shell sea produce market
Document 3: produce market shelly sea expensive market
b)Show how Document 1, Document 2 and Document 3 would be represented using an inverted index which includes term frequency information.
Answer:
sea: (1,1) (1,4) (2,3) (2,5) (3,4)
shell: (1,2) (1,5) (2,4)
buy: (1,3) (2,1)
lovely:(2,2)
produce: (2,6) (3,1)
market: (2,7) (3,2) (3,6)
shelly:(3,3)
expensive: (3,5)
c)Using term frequency (TF) to weight terms, represent the documents and query as vectors. Produce rankings of Document 1, Document 2 and Document 3 according to their relevance to Query 1 using two metrics: Cosine Similarity and Euclidean Distance. Show which document is ranked first according to each of these metrics.
Answer:
TF:
Query: sea(1) shell(1) produce(1) market(1)
Document1: sea(2) shell(2) produce(0) market(0) buy(1)
Document2: sea(2) shell(1) produce(1) market(1) buy(1) lovely(1)
Document3: sea(1) shell(0) produce(1) market(2) expensive(1)
Evaluation:
Similarity cosine:
Document1: (2×1+2×1+0×1+0×1+0×1) / (2× √(4+4+0+0+1)) = 4/6
Document2: (2×1+1×1+1×1+1×1+0×1+0×1) / (2×√(4+1+1+1+1+1)) = 5/6
Document3: (1×1+0×1+1×1+2×1+0×1) / (2×√(1+0+1+4+1) = 4/5.29
Euclidean Distance:
Document1: √(1+1+1+1+1) = √5
Document2: √(1+0 +0 + 0 + 1 + 1) = √3
Document3: √(0+1+0+1+1) = √3
For the best similarity, we should choose the lowest value document.
d)Explain the intuition behind using TF.IDF (term frequency inverse document frequency) to weight terms in documents. Include the formula (or formulae) for computing TF.IDF values as part of your answer. For the ranking in the previous question using cosine similarity, discuss whether and how using TF.IDF to weight terms instead of TF only would change the results.
Answer:
For using the Terms frequency, the result of the model may be affected by the freqency of the terms in documents, so we should consider the less common terms could be more useful to find the relevant documents, so we use the inverse document frequency to avoid this situation happening.
we need the document frequency (df) which include the key terms from the query.
sea : idf:0 log1 = 0
shell: idf: log(3/2)=0.17609125905
buy: idf: log(3/2)=0.17609125905
lovely: idf: log(3/1)=0.477
produce: idf: log(3/2)=0.17609125905
market: idf: log(3/2)=0.17609125905
shelly: idf: log(3/1) = 0.477
expensive: idf: log(3/1)=0.477
Evaluation:
Terms | Query | Document1 | Document2 | Document3 |
---|---|---|---|---|
Sea | 0×1/4 | 0×2/5 | 0×2/6 | 0×1/5 |
Shell | 0.18×1/4 | 0.18×2/5 | 0.18×1/6 | 0.18×0 |
Buy | 0 | 0.18×1/5 | 0.18×1/6 | 0.18×0 |
Lovely | 0 | 0.477×0 | 0.477×1/6 | 0.477×0 |
Produce | 0.176×1/4 | 0.176×0 | 0.176×1/6 | 0.176×1/5 |
Market | 0.176×1/4 | 0 | 0.176×1/6 | 0.176×1/5 |
Shelly | 0 | 0 | 0 | 0.477×1/5 |
Expensive | 0 | 0 | 0 | 0.477×1/5 |
0.20925 | 0.108 | 0.1975 | 0.0892 |
Document1: (0+(0.045×0.072)) / 0.20925×0.108
Document2: (0+(0.03×0.045)+(0.044×0.0293)×2) / 0.20925×0.1975
Document3: (0+(0.044×0.0352)) / 0.20925×0.0892
e)Explain the metrics Precision, Recall and F-measure in the context of evaluation in Information Retrieval against a gold-standard set, assuming a boolean retrieval model. Discuss why it is not feasible to compute recall in the context of searches performed on very large collections of documents, such as the Web.
Answer:
Recall: the proportion of the relevant documents
Precision: the proportion of retrieved documents that are relevant
Precision and Recall address the relation between the retrieved and relevant sets of documents.
F-measure: combines precision and recall into a single figure,gives equal weight to both. It is the F is a harmonic mean which penalises low performance in one value more than arithmethic mean.
Because the there are tremendous web pages which is included in the Internet, the mount of the retrieved pages is pretty small because of the assumption and limitation of algorithms.
4.
a) Differentiate subjectivity from sentiment. How are the tasks of Subjectivity Classification and Sentiment Analysis related?
Answer:
As for the rule-based subjectivity classifier, the task is to search for the emotion words lexicon and determine the sentence/document is objective or subjective.
As for the rule-based sentiment classifier, the task is to determine the document/sentence shows the positive sentiment or negative sentiment by counting the value of lexicons and build the judgement model.
b)Explain the steps involved in the lexicon-based approach to Sentiment Analysis of features in a sentence (e.g. features of a product, such as the battery of a mobile phone). Discuss the limitations of this approach.
Answer:
As for the binary approach, the input is the sentences s and product features f, the output is the attitude of this feature is positive,negative or netural.
Step1: the model should pick up all the sentences which contains the fratures and lexicons about the attitude. Step2: the model should select all the emotion words in sentence Step3: assign the values of emotion words, 1=positive, 0=netural, -1=negative Step4: sum up the orientation and assign the orientation to (f,s)
As for the shortcoming for intensifiers, the gradable approach assign the different levels to the emotional content. The process is similar to the binary approach, the final decision is based on the all emotion words.
The disadvantage is requiring a lexicon of emotion words which should be fairly comprehensive,covering new words.abbreviations,misspelled words.
c)Explain the graded lexicon-based approach for Sentiment Analysis.Given the following sentences and opinion lexicon (adjectives only), apply the weighted lexical-based approach to classify EACH sentence as positive, negative or objective. Show the final emotion score for each sentence. In addition to use of the lexicon, make sure you consider any general rules that have an impact in the final decision. Explain these rules when they are applied.
Lexicon: boring -3 brilliant 2 good 3 horrible -5 happy 5
Graded lexicon-based Approach:
(S1)He is brilliant but boring.**
Diminisher rule: the weight should be substracted from the positive terms.
emotion(brilliant) = 2 emotion(boring) = -3 , so the decision value is 2-3 = -1
(S2) I am not good today.
Negation rule: when the neighbourhood area occurs the negation words, the value should be decreased by 1 and inverted.
emotion(good) = 3, the decision value is -(3-1) = -2
(S3) I am feeling HORRIBLE today, despite being happy with my achievement.
Capitalization rule : the value of emtions words should be increased by 1 for positive words,-1 for negative words. Diminisher rule: the weight should be substracted from the positive terms.
emotion(horrible) = -5, capitalization rules -> emotion(horrible) = -6 emotion(happy) = 5
Decision value = -1
(S4) He is extremely brilliant but boring, boring.
Intensifier rule: the weight of extremely is 2, so the emotion(brilliant) = 4, emotion(boring) = -3, so the decision value is -2.
d)Specify the five elements of Bing Liu’s model for Sentiment Analysis, and exemplify them with respect to the following text. Identify the features present in the text, and for each indicate its sentiment value as either positive or negative. Discuss two language processing challenges in automating the identification of such elements.
“I am in love with my new Toshiba Portege z830-11j. With its i7 core processors, it is extremely fast. It is the lightest laptop I have ever had, weighting only 1 Kg. The SSD disk makes reading/writing operations very efficient. It is also very silent, the fan is hardly ever used. The only downside is the price: it is more expensive than any Mac. Lucia Specia, 10/04/2012.”
Answer:
As for the five elements, they should be:
e_i
: the name of the entity
a_ij
: an aspect of thee_i
oo_ijkl
: the orientation of the opinion about the aspecrta_ij
of the entity.
h_k
: the opinion holder
t_l
: the time when opinion is expressedx byh_k
The name of the entity is Toshiba Portege z830-11j
. For the many aspects, they should be i7 core processors
,SSD disk
,weighting
,fan
,price
.The orientation of each opinion about the aspect, for the general idea about this product, the holders said love
, it’s the positive. For the aspect i7 core processors
, the holder said fast
, it is positive. For the weight
, the holder said it’sthe lightest
, the orientation is positive. As for the SSD disk
, the holder said efficient and silent
, so the positive. For the fan
, it said hardly ever used
, it’s netural. The aspect privr
, the more expensive
, so it’s negative. Aslo, you can generate the quintuple to show the result, like below:
(Toshiba Portege z830-11j
, i7 core processors, positive,Lucia Specia,10/04/2012)
Challenge: 1) Maybe it’s very difficult to comfirm the entity name because of variations and abbreviation. 2) It’s very difficult to distinguish the subjective and objective sentences because of the language.
e) Differentiate direct from comparative Sentiment Analysis. What are the elements necessary in comparative models of Sentiment Analysis?
The comparative SA can contrast direct opinion verusu more comparativve opinions such as direct sentiment expressions on target objects and comparisons expressing similarities or differences.
Section A
a)Explain briefly the intuition behind the PageRank algorithm. Discuss how it can documents that are ranked equally “relevant” according to the similarity score given by the vector space model.
PageRank is designed to use the structure of a graph to quantify the importance of each node in that graph. Accordingly, every usage of PageRank outside of a web context must maintain some notion of importance, even if the interpretation of the importance of a node varies from application to application.
PageRank can be thought of as a fluid that circulates throughout a network, passing from node to node across edges, and pooling at the nodes that are the most important
Intuition: PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.