-
Notifications
You must be signed in to change notification settings - Fork 13
Programmer Documentation: Visualizations
Explanation of the visualizations created by Madison Nuss from Fall 2015 through Spring 2016.
http://bl.ocks.org/mbostock/7607535
This visualization is meant to be an intuitive and quick way to see which topics are the most popular in the dataset and, secondly, to see which documents contain those topics. The outermost circles represent the topics. Their size should correspond to the topic's prevalence in the dataset. The inner, white circles represent the documents containing those topics. Each document circle's size should represent the prevalence of the topic, whose circle it resides in, within the document. Clicking on a topic circle should zoom in to that circle and display the titles of the document circles and the title of the topic above the topic circle. Clicking on a document circle (when zoomed in) should display that document below the visualization. When zoomed in, clicking on either the topic (darker teal) circle or the outer (lighter teal) circle should zoom all the way out.
The controls change certain things about the visualization. "Number of Topics" changes how many top topics are shown in the visualization, i.e. how many darker teal circles there are. "Documents Displayed Per Topic" changes how many documents are shown within each topic circle. The "Top Topics Calculation Method" switch changes the way topic prevalence is calculated. (It also changes how document circle sizes are calculated.) The two algorithms for calculating topic prevalence will be discussed later.
Since the text in the visualization is so small (and I believe it needs to remain that small), the visualization should not shrink as the page shrinks. Instead, the controls should move above the visualization when the screen is too small to show them side-by-side. When the screen is large enough, the controls should appear to the right of the visualization to be consistent with the other visualizations on the site.
Location of JavaScript: topicalguide/visualize/static/scripts/visualizations/circle_packing_view.js
Location of CSS: topicalguide/visualize/static/styles/circle-styles.css
CirclePackingView
is the main view. CirclePackingViewManager
is its wrapper.
CirclePackingViewManager
contains the plot view (CirclePackingView
) and the document info view (SingleDocumentView
).
CirclePackingView
contains eight global variables:
-
totalData
is the object that contains all topic info. It contains an array called "children." Each child is a topic with a name and a total. The total is the number of tokens that belong to that topic in the entire dataset.totalData
is populated and sorted inrender()
. -
percentageData
is similar tototalData
. Each child is a topic with a name and a total, but the total is calculated differently. The total is the average percentage of the topic per document over all documents.percentageData
is populated and sorted inrender()
. -
displayData
is the current object that is displayed through the visualization. It becomes eithertotalData
orpercentageData
based on which way the Top Topics Calculation Method (calculation-control
) switch is selected. -
topicArray
contains the integers from 1 through the number of topics in the current analysis. It is populated inrender()
. It is used for the Number of Topics control. -
documentArray
contains the integers from 1 through the number of documents in the current analysis. It is populated inrender()
. It is used for the Documents Displayed Per Topic control. -
numTopics
corresponds to the selection in the Number of Topics control. Ininitialize()
, it is set to 20, which is an arbitrary number that I found to be reasonable for most datasets. -
numDocuments
corresponds to the selection in the Documents Displayed Per Topic control. Ininitialize()
, it is set to 10, which is an arbitrary number that I found to be reasonable for most datasets. -
calcTotal
corresponds to the Top Topics Calculation Method control. It is true if the switch is set to “Total Tokens,” false if it is set to “Average Percentage.”
Each view needs a readableName
and a shortName
. The readableName
for the CirclePackingView
is “Top Topics.” Its shortName
is “circlepack.”
mainTemplate
is the HTML template for the view and the controls.
controlsTemplate
contains the HTML for the controls.
initialize()
sets the initial values for some of the global variables.
cleanup()
is a function in the other visualizations, but I haven’t found a use for it.
getQueryHash()
returns the available data for use in the visualization.
renderControls()
sets up the controls’ behavior.
alterDisplayData()
populates the displayData
object based on the controls’ values and calls renderChart()
.
renderChart()
sets up the HTML and behavior of the visualization. Most of the code is taken from the online example, with adjustments for behavior that was needed for this particular visualization. The link is in the “Inspiration” section.
render()
sets up all of the data needed for the visualization and calls renderControls()
and alterDisplayData()
.
renderHelpAsHtml()
returns HTML of instructions for the visualization.
I have come up with two different ways of calculating which topics are the "top" topics. I call the first way "Total Tokens." I simply go through the dataset and count all of the tokens that correspond to each topic. The more tokens the topic has, the higher on the list it will be.
Here is the code for populating and sorting the totalData
object:
//Populate totalData with all topic info
self.totalData = (function() {
var newdata = {};
newdata.name = "topics";
newdata.children = [];
for (var i = 0; i < Object.keys(analysis.topics).length; i++) {
var topic = {};
topic.name = analysis.topics[i].names["Top 2"];
topic.children = [];
newdata.children.push(topic);
}
for (var key in documents) {
for (var j = 0; j < Object.keys(analysis.topics).length; j++) {
if (documents[key] !== undefined && documents[key].topics[j]
!== undefined) {
var docObj = {};
docObj.name = key;
docObj.size = documents[key].topics[j];
newdata.children[j].children.push(docObj);
}
}
}
return newdata;
})();
//Sort totalData by top topic
self.totalData.children.sort(function(a, b) {
var keyA = 0;
var keyB = 0;
for (var i = 0; i < a.children.length; i++) {
keyA += a.children[i].size;
}
for (var j = 0; j < b.children.length; j++ {
keyB += b.children[j].size;
}
if (keyA > keyB) return -1;
if (keyA < keyB) return 1;
return 0;
});
The second way is called "Average Percentage." For each topic, I go through every document in the dataset and calculate the percentage of that topic within that document. (The percentage is the number of tokens corresponding to that topic divided by the number of tokens in the entire document.) I add up these percentages for all of the documents. Once I am finished with that topic, I divide the summed percentage by the number of documents in the dataset. The higher a topic's average percentage, the higher on the list it will be.
Here is the code for populating and sorting the percentageData
object:
//Populate percentageData with all topic info
self.percentageData = (function() {
var newdata = {};
newdata.name = "topics";
newdata.children = [];
for (var i = 0; i < Object.keys(analysis.topics).length; i++) {
var topic = {};
topic.name = analysis.topics[i].names["Top 2"];
topic.children = [];
topic.percentage = 0;
newdata.children.push(topic);
}
for (var key in documents) {
for (var j = 0; j < Object.keys(analysis.topics).length; j++) {
if (documents[key] !== undefined && documents[key].topics[j]
!== undefined) {
var docObj = {};
docObj.name = key;
docObj.size = (documents[key].topics[j] /
documents[key].metrics["Token Count"]) * 100;
newdata.children[j].children.push(docObj);
newdata.children[j].percentage += docObj.size / 100;
}
}
}
for (var k = 0; k < Object.keys(newdata.children).length; k++) {
newdata.children[k].percentage = newdata.children[k].percentage /
Object.keys(documents).length;
}
return newdata;
})();
//Sort percentageData by top topic (average percentage)
self.percentageData.children.sort(function(a, b) {
var keyA = a.percentage;
var keyB = b.percentage;
if (keyA > keyB) return -1;
if (keyA < keyB) return 1;
return 0;
});
- Right now, the size of the outer circles does not correspond directly to the prevalence of the topics, but it should. The online example simply creates the sizes of outer circles based on the sizes of their children, but that does not work for this visualization. Currently, in the
d3.layout.pack().value()
function, I divide the document circles' values by their topic's position (e.g. if the topic were the fifth most prevalent topic, each of the topic's document values would be divided by 5). This works fairly well, but it still does not guarantee that the topic circles will be in the correct order in all datasets. It is also misleading, as the sizes of the topic circles in relation to each other do not accurately represent the topics' real prevalence in relation to each other. The actual prevalence of the topics should affect the sizes of the document circles within them in such a way that the topic circles' sizes accurately represent their topics' prevalence. - The visualization determines the sizes of the leaf (white) circles through the override of the
d3.layout.pack().value()
function. I override this function inrenderChart()
. The function's parameter, d, is a document from the displayData object (i.e. a topic's child). - When zoomed in to a topic circle, it would be helpful to place the title of the topic somewhere visible (above the circle or something similar).
- Sliders (similar to the one under the pie chart in Single Document View) might be better input media than drop-down menus for the first two controls.
- Topic circles’ sizes should represent topic prominence.
- The title of the topic should be placed visibly when zoomed in to a topic circle.
- Sliders should replace the drop-down menus in Controls.
https://bost.ocks.org/mike/miserables/
Code borrowed from: http://jsfiddle.net/zy1kxy0c/
The purpose of this visualization is to provide a fast and intuitive way to view the co-occurrence of topics throughout a dataset. The more frequently two topics appear in the same document together, the more they co-occur.
Topic titles appear along the left, top, and bottom of the visualization. Each topic has a row and a column. The cell where two topics cross represents those two topics' co-occurrence. The darker the cell, the more often they co-occur. When mousing over a cell with co-occurrence, the corresponding topic titles should become red. (The topic titles do not need to become red for blank cells, since this means the topics never co-occur, but you may choose to make them red anyway.) The graph is symmetrical, with the diagonal line through the center simply representing how often those topics occur, not co-occur (since they are the same topic).
The graph can be ordered in three different ways. When it is ordered by Name, topic titles appear in alphabetical order from top to bottom, left to right. When it is ordered by Frequency, the topics are ordered by how often they occur in the dataset (i.e., by the "darkness" of their cells on the center diagonal). The third way, by Cluster, should order the topics by the arbitrary group they are put in. Each group would have a different color. You can see this functionality by following the links in the "Inspiration" section. Cluster ordering may or may not have a place here. It may make the visualization easier to read, or it may make it more confusing. If it is included, the topics should be grouped by similarity.
Location of JavaScript: topicalguide/visualize/static/scripts/visualizations/co_occurrence_view.js
Location of CSS: topicalguide/visualize/static/styles/matrix-styles.css
Like Top Topics, there is a main view and a wrapper. CoOccurrenceViewManager
is the wrapper; CoOccurrenceView
is the main view.
topicData
is the global object full of nodes and links, like the object miserables
at the jsfiddle link.
cutoff
is an arbitrary number that was used for experimenting with different algorithms for determining co-occurrence. These will be explained later.
mainTemplate
and controlsTemplate
serve the same purpose as in Top Topics.
initialize()
and cleanup()
serve the same purpose as their counterparts in Top Topics; both are empty here.
getQueryHash()
serves the same purpose as its counterpart in Top Topics.
renderControls()
only inserts the controlsTemplate
into its appropriate section in the mainTemplate
. The behavior of the controls is included in renderChart()
because the code was provided there in the borrowed code.
renderChart()
draws the visualization and is borrowed heavily from the jsfiddle code.
render()
prepares the data for display. It calls renderControls()
and renderChart()
.
renderHelpAsHtml()
serves the same purpose as its counterpart in Top Topics.
There are many different ways co-occurrence could be measured. Here are three that I have come up with, along with their pros and cons.
-
+1 each time topic A appears in the same document as topic B. This, I think, is the most intuitive way to calculate co-occurrence. It is the first measurement I tried. However, I do not think it works in this context because, in most datasets, ALL of the topics appear in ALL of the documents because of the way datasets are analyzed. All of the visualization's cells become the same shade, which is not useful at all.
-
+1 each time both topic A and topic B have x amount of words in the same document. This is an attempt to fix the problems in 1 above. However, it does not measure true co-occurrence, and topics with smaller numbers are left out. Also, the x changes for each dataset. (This is what
cutoff
was used for.) If this becomes the chosen method, I believe another control would have to be added to change x so that the graph makes sense for each new dataset. -
+1 each time a word (token) in topic A appears in the same document as a word (token) in topic B. This is my favorite method because it works fairly well for all datasets I've tried. However, in datasets with long documents, each topic gets huge co-occurrence values, and there are rarely any cells left blank (but these are not necessarily cons).
Currently, the JavaScript implements the third algorithm, but I am not totally convinced that this is the perfect way to measure co-occurrence.
- I believe this visualization would be most useful for datasets with small document sizes (like tweets or small paragraphs).
- I have not yet found a good way to group the topics (for the "Cluster" ordering). This is partly because the analysis info that can be accessed by the visualizations does not include words in the topics, only the top 4 words.
- Similar topics should be grouped with the same color.
- The method for determining co-occurrence should work well for all types of datasets.
- The controls should be decoupled from the chart and their behavior should be written in
renderControls()
.