Project Word Miasma

 

Goal: Build a user interface in D3 which displays word tags learnt from a text document. Let us call this wordmiasma. In this lab, you are required to create a word cloud that highlights ÒwordsÓ in a document.  Many others call this creation a wordcloud. Why call it differently? We will only implement simpler parts of the whole workflow. My pedagogical goal is to really allow you to go behind the scenes and learn the process rather than make you the patron saint of wordclouds.  Below are examples of  word clouds (the one in the middle is from https://tagul.com/).

 

 

 

Philosophically Speaking: Contrast this with cryptography. There in you are taking documents and transforming them into ÒcrypticsÓ that no homo sapien or machines of homo sapiens can fathom. You are destroying perceived intent while actually preserving the underlying meaning and intent.  This technique first originated online in the 1990s as tag clouds (famously described as Òthe mullets of the InternetÒ), which were used to display the popularity of keywords in bookmarks.

 

The Lab: Think about this exercise this way. You are Òre-encodingÓ the document using visual metaphors. Here you amplify information.  You want to pick out word gems and highlight or embellish them visually. Since, we belong to the dojo of Òtask-centricÓ design, we will first write down the tasks which are:

1.    Scrape: Find or scrape all the words in the document using tokenizers.You will read the documents into a ÒtokenizerÓ which will yield tokens.  A token is a word entity in a natural language.  Some manual intervention will also be allowed.

2.    Analyze: Find all salient words in this module. The main task is here to analyze the tokens using frequentist or other statistical approaches applied to occurrence, length of gleaned tokens. Or some other approach and some other characteristics. See below more. Think of a statistical method of capturing the distributions and creating Ònumerical descriptionsÓ.

3.    Visual Encoding & Display:Now comes the visual encoding. Take the measures and characteristics  of tokens you computed in Step 2, transform them, clean them, and do whatever, and then assign visual attributes to each of the tokens using either the plain or transformed numerical representations of the same. Visual encodings can include - position, orientation, scale (font size), look (texture, actual font, shadows, transparency, etc.).

 

Tools: To make wordmisasmas you are free to use any potions, cauldrons, and whatever. However, we make some available at your disposal. We will certainly provide a particular recipe of creating these miasmas. Now some details.

 

1.    Scrape:  There are many tools one could use depending on the software eco-system:

a.    Python: CountVectorizer or NLTK's tokenizer.

b.    javascript: Joseph AdamÕs notes towards the utilization of map, and reduce functions in javascript

c.    Many other tools exist here (choose DataHandling).

2.    Analyze: Use statistical methods using Python, or R, or Matlab, you can generate ÒstatisticsÓ, or characterizations. Nothing fancy; but the bare frequentist approaches including histogramming will do.

3.    Visualize: D3.js kicks in. Again, do not hurt yourself with collisions, etc. Control the clutter by scaling, and ranking.  For inspiration you can use Joseph Adams notes.

 

Exemplar Workflows: There are many places where wordmiasmas are used in other ways.

1.    Look at the example of WebLogo.  They have been used to observe the impact of ÒmotifsÓ in genetics and proteomics. Many

2.    Joseph Adams notes using Javascript.

3.    Jason DavisÕs word clouds.

4.    Wordle does this well.

5.    This one does differently.

6.    Not everyone likes them. Especially this guy :).

 

Examples of word miasmas (suggestions from Chaitanya Kulkarni - Grader)

Find examples below of word miasmas. Take a note of the problem context and identify the tasks  which include comparison of word clouds, likely inferences and hypotheses from the word miasma.  For any of the application below ensure that there is enough user interaction to generate hypotheses. Now the list -

1.    Create a simple food chain word miasma representing each population of animal species by  font size. Thus, create a whole food web of two geographical areas or the same geographical area over multiple times.

2.    You can do the same with cities and climates, size of the city represented by font size, location, and orientations.

3.    Planets, galaxies, and their size represented by font size.

4.    Miasma for text classification. Positive words in green color, and negative words in red. A dictionary is needed.

5.    Compare text of novels from different genres. Make word clouds for each genre like, horror, sci fi , thriller, non-fiction etc.

6.    Make word cloud of speeches/ramblings of famous and infamous folks and have the class guess the speaker? The goal is to increase the success of recognition.

7.    Analyze presidential speeches, or historical speeches by MK Gandhi or Martin luther. (I want to see who uses the word non-violenceÕ' more).

8.    Shakespearean English vs Normal Joe English. Here the emphasis will be on the phrases than on just individual words, so tonkenizer should tokenize phrases and not words (HARD)

9.    Compare the works of rappers and find out who makes the most use of English vocabulary :).

10. Or make your own ...