Ch11 Verified Test Bank Text Mining - Forecasting with Forecast X 7e Complete Test Bank by Barry Keating. DOCX document preview.
Forecasting and Predictive Analytics with Forecast X, 7e (Keating)
Chapter 11 Text Mining
1) Of all the data available today it is estimated that about 90% of data is
A) unstructured.
B) structured.
C) numerical.
D) graphical.
2) A major part of text mining is to
A) reduce the dimensions of the data.
B) generalize the use of modifiers.
C) screen the articles from the data set.
D) reduce the word count of the text actually used.
3) Semantic Processing seeks to
A) extract meaning.
B) group individual terms into bins.
C) eliminate "extra" or unnecessary terms from an analysis.
D) uncover undefined words or terms in a set of textual data.
4) In text mining, "knowledge discovery" refers to
A) extraction of codified features.
B) analysis of feature distribution.
C) counting the number of unknown terms used in a document.
D) the measurement of single-use terms present in the text.
5) "Information distillation" as used in text mining refers to
A) the analysis of the feature distribution.
B) reducing the number of words or phrases that need to be combed.
C) reducing a text by eliminating punctuation and word spacing.
D) measuring the strength of a model's ability to predict.
6)
In the text mining example shown here, which part of the diagram represents the product of a text mining operation?
A) The node labeled "Sentiment_Analysis."
B) The node labeled "Merge."
C) The node labeled "CHAID Customer..."
D) The node labeled "Satisfaction_Survey."
7) As compared to data mining, text mining seeks to
A) extract features.
B) increase dimensionality of the data.
C) use "fuzzy" attributes in its algorithms.
D) reduce the speed of passes through the data.
8)
Consider the IBM/SPSS Modeler stream shown here. The "Nugget" labeled "sentiment_analysis"
A) was constructed using a "merge model."
B) was created by using the "Text Mining Node."
C) contains the documents used to determine a sentiment analysis model.
D) is a placeholder for a text mining model to be later inserted in the stream.
9) _______ refers to the process of deriving high-quality information from text.
A) Text Mining
B) Image Mining
C) Database Mining
D) Multimedia Mining
10)
Refer to the IBM/SPSS Modeler Stream here. What is the Node labeled "Seth Grodin Blog"?
A) The Node contains postings of a particular web blog.
B) The Node can either read an Excel file, a text file, or a "data" file.
C) The Node contains the rule set for interpreting text data; it is a Natural Language Processor.
D) The Node is a web feed node that collects data from a web URL.
11)
Refer to the Seth Grodin Stream. The Node (not the Nugget) labeled Text Mining
A) contains the text mining rules that will be applied to the blog data.
B) may access either an interactive built model or a directly generated model.
C) contains the data to be mined itself or the link to the data to be mined.
D) is used to instantiate the text data.
12)
Refer to the Seth Grodin Stream here. The diamond-shaped Nugget (not the Node) titled "Text Mining"
A) is generated by the five-sided pentagon shaped node also labeled "Text Mining."
B) will list the concepts the Text Mining Node has created.
C) contains the rules that the Natural Language Processor will use to reduce the dimensions of the data.
D) All of the options are correct.
13) Which of the following techniques can be used for the purpose of keyword normalization, the process of converting a keyword into its base form?
1) Lemmatization
2) Levenshtein
3) Stemming
4) Soundex
A) 1 and 2
B) 2 and 4
C) 1 and 3
D) 1, 2, and 3
E) 2, 3, and 4
F) 1, 2, 3, and 4
14) You have collected a data of about 10,000 rows of tweet text and no other information. You want to create a tweet classification model that categorizes each of the tweets in three buckets—positive, negative, and neutral.
Which of the following models can perform tweet classification with regards to context mentioned above?
A) Naїve Bayes
B) Logit
C) kNN
D) None of the options are correct.
15) You have created a document term matrix of the data, treating every tweet as one document. Which of the following is correct, in regards to document term matrix?
A) Removal of stop words from the data will affect the dimensionality of data.
B) Normalization of words in the data will reduce the dimensionality of data.
C) Both "Removal of stop words from the data will affect the dimensionality of data" and "Normalization of words in the data will reduce the dimensionality of data" are correct.
D) None of the options are correct.
16) Refer the following concept document matrix.
Term | Document | |||||||
d1 | d2 | d3 | d4 | d5 | d6 | d7 | ||
t1 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | |
t2 | 1 | 2 | 0 | 0 | 0 | 0 | 1 | |
t3 | 3 | 1 | 0 | 0 | 1 | 1 | 0 | |
t4 | 0 | 0 | 1 | 2 | 1 | 1 | 1 | |
t5 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | |
t6 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
Which of the following documents contains the same number of terms (concepts) and the number of terms (concepts) in the one of the document is not equal to least number of terms (concepts) in any document in the entire corpus.
A) d1 and d4
B) d6 and d7
C) d2 and d4
D) d5 and d6
A) A
B) B
C) C
D) D
17)
Term | Document | |||||||
d1 | d2 | d3 | d4 | d5 | d6 | d7 | ||
t1 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | |
t2 | 1 | 2 | 0 | 0 | 0 | 0 | 1 | |
t3 | 3 | 1 | 0 | 0 | 1 | 1 | 0 | |
t4 | 0 | 0 | 1 | 2 | 1 | 1 | 1 | |
t5 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | |
t6 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
Which are the most common and the rarest terms (concepts) of the corpus?
A) t4, t6
B) t3, t5
C) t5, t1
D) t5, t6
A) A
B) B
C) C
D) D
18) While creating a machine learning model on text data, you created a document term matrix of the input data of 100K documents. Which of the following remedies can be used to reduce the dimensions of data?
A) Latent Semantic Indexing
B) Keyword Normalization
C) Both "Latent Semantic Indexing" and "Keyword Normalization" can reduce the dimensions of data.
D) None of the options are correct.
19) What are typical data preprocessing tasks for text?
A) Remove "stop" words.
B) Stemming
C) Correcting spelling errors.
D) All of the options are correct.
20) Terms such as "agree," "agreed," "agreeing," and "agreeable" would result in the _______ "agree."
A) discretization
B) root
C) stem
D) feature selection
21) In text mining "tokens" are the "words" finally extracted from a block of text once these procedures have been performed.
What "procedures" are being referenced here?
A) Remove words more than 20 letters in length.
B) Normalize the text.
C) Remove monetary values.
D) All of the options are correct.
22) "Numbertoken" refers to _______.
A) the text that has been turned into numbers
B) any number appearing in the document
C) the word count of a text document
D) how many times a particular word appears in a document
23) Term-Document Matrix
Doc ID | accur | activ | addit | advanc | air | applic | articl | australia | auto |
101551 | 0 | 1.59603 | 0 | 0 | 0 | 0 | 0.679859 | 0 | 2.515777 |
101552 | 0 | 0 | 1.59603 | 0 | 0 | 0 | 0.679859 | 0 | 0.679859 |
101553 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
101554 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
101555 | 0 | 0 | 0 | 0 | 0 | 0 | 0.679859 | 0 | 1.75741 |
101556 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
The Term Document Matrix from XLMiner is shown. What do the columns of this matrix show?
A) The "numbertokens" appearing in the documents
B) Possible misspellings in the documents
C) A word or term extracted from the Usenet documents
D) None of the options are correct.
24)
The Zipf Plot (like the one displayed) for the Usenet Newsgroups is drawn with the knowledge that
A) in the absence of being horizontal, there is little information to be gained from the documents.
B) the plot should slant upward from lower left to upper right.
C) the frequency of any single word is directly proportional to its rank in the frequency table.
D) the frequency of any single word is inversely proportional to its rank in the frequency table.
25)
SVD or singular value decomposition was explained with the use of this two dimensional chart. SVD is best described as
A) a way to reduce the dimensions of the data matrix.
B) a regression method for "linearizing" the text data.
C) a stemming process for text data.
D) a clustering technique.
26)
A Scree Plot like the one shown
A) displays a graphical representation of the importance (i.e., contribution) of each concept.
B) displays the number of dummy variables that the algorithm could create from the text data.
C) is used in clustering algorithms to choose the optimal number of clusters.
D) is always created and examined before the data is subjected to stemming.
27) "Latent semantic indexing"
A) is that part of building an ensemble model that "indexes" the different models to combine.
B) is a classification algorithm that works only on categorical data.
C) creates clusters of like items.
D) collates the most common words and phrases and identifies them as keywords for particular postings.
28) Logistics regression can never be used when text mining
A) because the target in text mining is rarely categorical.
B) because Logit assumes that all the attributes are numerical.
C) because Logit is never used with text data.
D) None of the options are correct; Logit is commonly used with text data after that data is "datified."
29) A "bag of words" analysis
A) is almost identical to "natural language processing."
B) is a manual way of inspecting text data.
C) looks at the unprocessed text as a collection of words without regard to grammar.
D) uses no stemming procedure.
30) "Stop words"
A) usually include words such as "the," "and," "a," and so on.
B) are the last words in any text document.
C) signal the text mining algorithm that a solution has been achieved.
D) are words that are unintelligible to the algorithm.
31) In some cases text mining software will perform the process of "entity extraction"
A) by breaking up phrases into their component parts.
B) but only if the number of entities exceeds the number of tokens.
C) if more than about 10 entities are encountered.
D) which identifies a group of words as a single item; people's names are often characterized in this way.
32) "Bag of Words" text analysis
A) usually requires "stemming" as one step in the process.
B) and Natural Language Processing are one in the same.
C) never involves "tokenization."
D) does not reduce the dimensions of the data.
33) In the customer satisfaction example in the text mining chapter
A) the data consisted of only textual data.
B) the text data was transformed into numerical dummy variables.
C) it was not necessary to partition the data because it was text.
D) the algorithm demonstrated was a clustering type model.
34) "Target leakage"
A) refers to choosing a variable as a target which does not have distinct numerical values.
B) is the concept that any variable chosen as a target could have multiple meanings.
C) refers to the fact that any text mining algorithm will have less predictive power than any numeric algorithm.
D) is the term given to the introduction of information about the text mining target which should not legitimately be available to the algorithm.
35) "Target leakage" was described as being similar to what situation in regression analysis?
A) Not allowing variable values of less than zero (i.e., negative values).
B) Using explanatory attributes or variables that are highly correlated.
C) Placing the same variable on both sides of a regression.
D) Using too few explanatory variables.
36) Amazon's Alexa and Apple's Siri
A) probably use natural language processing.
B) are likely examples of a Bag of Words algorithms.
C) do not involve text processing in any form.
D) are analog, rather than digital processors.
37) The most important consideration in order to protect from target leakage
A) involves the correct selection of the target values.
B) would be to carefully set the parameters of the text mining algorithm.
C) is expert domain knowledge of your data.
D) could involve the type of stemming that the algorithm chooses.
38) The customer service model used in the text mining chapter
A) was an ensemble model because it combined text data and numeric data.
B) is an example of a logistics regression with text inputs.
C) required only a testing partition (and not a validation partition).
D) demonstrated a technique that could only rarely be used in actual practice.
39) The customer service example in the text mining chapter
A) used only structured data.
B) used only unstructured data.
C) used both structured and unstructured data.
D) used only text data.
40) Overall our goal in text analytics is to
A) reduce the dimensions of the unstructured text to manageable attributes we can use in data mining algorithms.
B) change the words in the text to numbers.
C) count the words, sentences, and paragraphs in order to deduce meaning.
D) None of the options are correct.
41) The process of counting the number of words in a document is part of what we do in text mining,
A) and it is the only thing we do in Bag of Words analysis.
B) but the number we determine is only used for error checking.
C) but by itself, counting words is not analytics.
D) however, it is not a part of the standard procedures in text mining.
42) If we were to analyze text statements scraped from a Facebook page for sentiment,
A) we would probably be using text mining.
B) we would be performing a Structured Query Language sort.
C) we would not use natural language processing.
D) we would not be attempting to reduce the dimensions of the data.
43) "Web Scraping"
A) is the collection of all the numerical data on a web page.
B) is an unsupervised classification algorithm.
C) is the analysis of a website for sentiment.
D) is a method of collecting information from websites.
44) SVD (Singular Value Decomposition)
A) is a form of entity extraction.
B) removes stop words from unstructured text.
C) is another name for "stemming."
D) is a concept extraction tool.
45) A "Scree Plot"
A) can often be used to suggest that an ensemble model would work well with some data.
B) gives a graphical representation of the contribution of each concept.
C) presents information on the number of words and paragraphs in a document or set of documents.
D) is used in clustering to suggest the optimal number of clusters.