Ask Watson!

Continuing our text series we now wish to add a note about exploiting API (Application Programming Interfaces). We mentioned NLTK,  GENSIM,  spaCY previously and that really involves installing these packages either ‘on-premise’ or in the ‘cloud’ and dealing with dependencies, version changes, and any associated host and operating system issues.  The other option is to just ‘Ask Watson’.

So we did! The notebook is in our Github Repository

Here are some examples of the values returned by the API.  The job title was

‘Client Partner, eCommerce.’

Screen Shot 2018-08-13 at 22.12.19

The most relevant keywords (.97) is ‘Online Advertising experience’ which seems congruent.

Screen Shot 2018-08-13 at 22.12.09

The most relevant entity is ‘Facebook’ @ .91.   Our experiment was based on sending the entire text to Watson, and this represents our first interaction with the Watson NLU service.

 

 

 

 

 

 

 

 

 

 

 

Advertisements

ArrayFire with Python

A short note to follow up on a recent article about Julia.

At the time we wrote about a Julia wrapper for the ArrayFire library.  Now we have also evaluated the Python wrapper for the library.

This time we created a Gist on Github using  nteract

You can examine the entire Notebook utilising the Gist but here are some illustrative screenshots

Capture

Information about the default device

Capture1

Information about all the devices on the system

Capture2222222

Finally, a simple two-dimensional array,  sampled from the uniform distribution, with a matrix operation, computed on the Quadro_K1100M

 

Wall Street Data Science

Often times you might hear the term Finance Quant,  or Quant’s,  and then Data Scientist.

The widespread view of the ‘magical’ Data Scientist is usually presented using the image below.  From here

Screenshot from 2018-08-11 15-17-47

Finance Quantitative Analysts then surely must do Data Science.

Screenshot from 2018-08-11 15-24-11

As in all things we went investigating.  Now Wall Street guys are not likely to tell us their secrets so just sign up for a training course.  We did, and we really enjoyed it

https://www.datacamp.com/courses/intro-to-portfolio-risk-management-in-python

Screenshot from 2018-08-11 15-30-44

Take the course, and you will also see that the ‘Finance Quant’ world is not that scary. But I guess the instructor was also brilliant!

External validation and checking

This is a short post to demonstrate a technique we believe firmly in.  Testing and validation of work and new findings.  We came across a service called Textio

Textio is an application that helps Employers to write good job descriptions.  We found that interesting.  We use Grammarly to check our writing, but we hadn’t been aware of Textio heretofore.  With this little gem, we could now validate our work with an external source by running a little test.

Testing

To run an early test, we opted to select a single job description,  this time just the first one,  and to run the job description through Textio and compare the metrics to those that we calculated ourselves.  Sounds like fun! Let us get to it then!

Our data and code,  retrieved from GitHub,  allows us to draw the single sample easily.

Capture

Next to retrieve the actual job description.  We opted for the original URL rather than any of our own data. It is necessary to copy out the text from the job description and to paste that into Textio.  We liked Textio, and it is very efficient.

 

The Textio output is very friendly and informative.  On our side, ‘Dale-Chall’ is showing 12.13 which suggests an audience with a College education (Grade 13-15).  Flesch is 0.52 which suggests the reader will find the writing ‘very confusing’.  ‘Smog’ is 16.7 indicating an education of about 16 years is required.  Gunning is 35.84 indicating the reader will find the article ‘too difficult’ to read.

Capture

Here is another test score which shows the Textio scale.

If we ask Grammarly to get a view of the same text.  We get a score of 79

Capture

Spelling errors,  duplicated words,  Grammar issues, passive writing

Capture

The report from Grammarly appears to agree with our own output.

Concluding remarks

Here we have demonstrated good practice in cross-checking and validating early outcomes.  Our work takes the ‘job applicants’ perspective with Textio taking the ‘advisers’ perspective.  Both views suggest the written text could be improved if so desired.

Capture

Meanwhile, the FAANG group have a market cap. of $3.015 trillion with net income of around $80 billion annually.   Would we expect spelling errors and readable text? If the job posting is ordinary what does that actually mean internally?

Parsing Natural Language

In our previous article, we explained the process for creating a Corpus of employee job descriptions.  In this article, we will continue the project and attempt to identify critical attributes from the job descriptions.  Those attributes include:-

  1. The role,
  2. Skills of the candidate,
  3. Education,
  4. Experience,
  5. Market information,
  6. What would the candidate be doing?

First a word on tools

Tool selection

Initially, we had in our mind to use the Natural Language Toolkit.  They wrote a book (Python 3 version available online), and everything is explained very well.  Having studied the book (again), we realised that we would have to train our own models for the critical components of the text processing pipeline.  Those are:-

  1. Tokens or words,
  2. Part of speech tagging,
  3. Chunking and chinking,
  4. Chunk parsing.

Which appeared a little heavy for our purpose.  A lot of regular expressions and all that. We then sought to search for other NLP libraries.  Our search brought us to Gensim and spaCY.

Gensim

gensim tagline

spaCY

“Industrial-Strength Natural Language Processing in Python”

They do exactly as they say on the label.  Both brilliant libraries.  We selected spaCY as this library better supported on our workloads.  Next to getting on with it.

Getting on

With our library selected we moved on to designing our own Language parsing script.  Our script for this exercise is available on GitHub here.  The whole exercise warrants some discussion. First an overview of the notebook.

  1. Our script starts by loading the Pickle library and then building the Pandas DataFrame from the previous work by re-loading.
  2. Next, we include a Class from a book ‘Python 3 Text Processing with NLTK 3 Cookbook’ which demonstrates how to fix a standard issue.  Consider the words a) “can’t” and b)”cannot”.  One is a  poor token the other a proper English word.
    class RegexpReplacer

    Provides a mechanism to treat contractions

  3. Next, we build a dictionary to translate dependency codes into something approaching English. Necessary for the exploration phase of any work.
  4. Next, we built some small references tables to help to convert Part of Speech tags, and Named Entity Types to equivalent English descriptions.
  5. Capture
  6. The idea is that, through the code, we can convert a Tag or NER code to an equivalent English description.
  7. Next, we have to include spaCY in the script
  8. Capture
  9. With the installation of spaCY, there is also the need to download a language model.  When installing the model, using Windows 10, ensure to use the ‘As adminstrator’ option.
  10. Two steps remain
    1. We define our custom parser,
    2. We run that parser over all the Job descriptions.  Running the custom parser over the DataFrame is not complicated.  Building a parser is!
    3. b['lang_parse'] = b['job_des'].apply(lambda x: job_des_parse(x))

 

A custom parser is complicated

A custom parser is very complicated.  We found a reasonable explanation of using spaCY in a Github GIST. Following along with the author,  taking some guidance from NLTK,  the book, StackOverflow, ‘ask google’,  Wikipedia and others,  we managed to code a function.

spaCY essentially does all the work. Our code takes a text, and that text is converted into a spaCY document.  That process of switching to a spaCY object is actually where the parsing takes place.  We need to iterate over the sentences in each job description.

During each iteration, a single job description,  we pull out each sentence chunked up ‘noun chunked’,  the root word,  Named Entities,   the subject of the sentence, and phrases before the root word (prep_phrases).   These elements are stored in a python Dictionary and finally added to the DataFrame.

We are still dealing with a preliminary parse or some form of annotation.  However, we have now added many additional attributes to the text mining exercise.  Line 155, below, shows the structure of the dictionary.  {0:  tells us about the first sentence.  ‘noun-chunks’ is a list of chunks of the sentence around nouns.  ‘root’ gives us information about the root of the sentence.  The root word is ‘description’,  the POS tag is ‘NN’ or noun, singular or mass’.  Perhaps not a great example.  Line 154 shows us what the job description actually looks like in the raw state.

Capture1Capture

 

Perhaps another example will explain our enthusiasm

Capture22

Line 156 shows us that our function has recovered ‘Dublin’ as an entity, which is a ‘Countries, cities, states’ type entity.

Another example

Capture33

Here at line 157 – we see that sentence 11 and 12 appear to discuss ‘responsibilities’.  Early days in the design of the parser but good progress so far.  Tune in next time where we will continue to process the data refining the parser.

 

Note

We used a lot of 3rd parties and external references, in this work.  Some are referenced directly in the code while others are referenced here.  Substantially this is the work of the people who maintain the Python Libraries (Gensim, spaCY, Pandas, numpy, NLTK) and of those who work to provide explanations of how to use those tools.