Courses & TutorialsProgramming

Awesome Natural Language Processing With Ruby – Massive Collection of Resources

Useful resources for text processing in Ruby

HLT (Human Language Technology) and can be brought in conjunction with
Artificial Intelligence,
Machine Learning,
Information Retrieval,
Text Mining,
Knowledge Extraction
and other related disciplines.

Contents

✨ Tutorials

Please help us to fill out this section! 😃

NLP Pipeline Subtasks

An NLP Pipeline starts with a plain text.

Pipeline Generation

  • composable_operations
    Definition framework for operation pipelines.
  • ruby-spark
    Spark bindings with an easy to understand DSL.
  • phobos
    Simplified Ruby Client for Apache Kafka.
  • parallel
    Supervisor for parallel execution on multiple CPUs or in many threads.
  • pwrake
    Rake extensions to run local and remote tasks in parallel.

Multipurpose Engines

On-line APIs

Language Identification

Language Identification is one of the first crucial steps in every NLP Pipeline.

  • scylla
    Language Categorization and Identification.

Segmentation

Tools for Tokenization, Word and Sentence Boundary Detection and Disambiguation.

  • tokenizer
    Simple multilingual tokenizer.
    [tutorial]
  • pragmatic_tokenizer
    Multilingual tokenizer to split a string into tokens.
  • nlp-pure
    Natural language processing algorithms implemented in pure Ruby with minimal dependencies.
  • textoken
    Simple and customizable text tokenization library.
  • pragmatic_segmenter
    Word Boundary Disambiguation with many cookies.
  • punkt-segmenter
    Pure Ruby implementation of the Punkt Segmenter.
  • tactful_tokenizer
    RegExp based tokenizer for different languages.
  • scapel
    Sentence Boundary Disambiguation tool.

Lexical Processing

Stemming

Stemming is the term used in information retrieval to describe the process for
reducing wordforms to some base representation. Stemming should be distinguished
from Lemmatization since stems are not necessarily have
linguistic motivation.

  • ruby-stemmer
    Ruby-Stemmer exposes the SnowBall API to Ruby.
  • uea-stemmer
    Conservative stemmer for search and indexing.

Lemmatization

Lemmatization is considered a process of finding a base form of a word. Lemmas
are often collected in dictionaries.

  • lemmatizer
    WordNet based Lemmatizer for English texts.

Lexical Statistics: Counting Types and Tokens

  • wc
    Facilities to count word occurrences in a text.
  • word_count
    Word counter for String and Hash objects.
  • words_counted
    Pure Ruby library counting word statistics with different custom options.

Filtering Stop Words

  • stopwords-filter – Filter and
    Stop Word Lexicon based on the SnowBall lemmatizer.

Phrasal Level Processing

  • n_gram
    N-Gram generator.
  • ruby-ngram
    Break words and phrases into ngrams.
  • raingrams
    Flexible and general-purpose ngrams library written in pure Ruby.

Syntactic Processing

Constituency Parsing

  • stanfordparser
    Ruby based wrapper for the Stanford Parser.
  • rley
    Pure Ruby implementation of the Earley
    Parsing Algorithm for Context-Free Constituency Grammars.
  • rsyntaxtree
    Visualization for syntactic trees in Ruby based on RMagick.
    [dep: ImageMagick]

Semantic Analysis

  • amatch
    Set of five distance types between strings (including Levenshtein, Sellers, Jaro-Winkler, ‘pair distance’).
  • damerau-levenshtein
    Calculates edit distance using the Damerau-Levenshtein algorithm.
  • hotwater
    Fast Ruby FFI string edit distance algorithms.
  • levenshtein-ffi
    Fast string edit distance computation, using the Damerau-Levenshtein algorithm.
  • tf_idf
    Term Frequency / Inverse Document Frequency in pure Ruby.
  • tf-idf-similarity
    Calculate the similarity between texts using TF/IDF.

Pragmatical Analysis

  • SentimentLib
    Simple extensible sentiment analysis gem.

High Level Tasks

Spelling and Error Correction

Text Alignment

  • alignment
    Alignment routines for bilingual texts (Gale-Church implementation).

Machine Translation

  • google-api-client
    Google API Ruby Client.
  • microsoft_translator
    Ruby client for the microsoft translator API.
  • termit
    Google Translate with speech synthesis in your terminal.
  • zipf
    implementation of BLEU and other base algorithms.

Sentiment Analysis

Numbers, Dates, and Time Parsing

  • chronic
    Pure Ruby natural language date parser.
  • chronic_between
    Simple Ruby natural language parser for date and time ranges.
  • chronic_duration
    Pure Ruby parser for elapsed time.
  • kronic
    Methods for parsing and formatting human readable dates.
  • nickel
    Extracts date, time, and message information from naturally worded text.
  • tickle
    Parser for recurring and repeating events.
  • numerizer
    Ruby parser for English number expressions.

Named Entity Recognition

  • ruby-ner
    Named Entity Recognition with Stanford NER and Ruby.
  • ruby-nlp
    Ruby Binding for Stanford Pos-Tagger and Name Entity Recognizer.

Text-to-Speech-to-Text

  • espeak-ruby
    Small Ruby API for utilizing ‘espeak’ and ‘lame’ to create text-to-speech mp3 files.
  • tts
    Text-to-Speech conversion using the Google translate service.
  • att_speech
    Ruby wrapper over the AT&T Speech API for speech to text.
  • pocketsphinx-ruby
    Pocketsphinx bindings.

Dialog Agents, Assistants, and Chatbots

  • chatterbot
    Straightforward ruby-based Twitter Bot Framework, using OAuth to authenticate.
  • lita
    Highly extensible chat operation bot framework written with persistent storage on Redis.

Linguistic Resources

Machine Learning Libraries

Machine Learning Algorithms
in pure Ruby or written in other programming languages with appropriate bindings
for Ruby.

For more up-to-date list please look at the Awesome ML with Ruby list.

  • rb-libsvm
    Support Vector Machines with Ruby.
  • weka
    JRuby bindings for Weka, different ML algorithms implemented through Weka.
  • decisiontree
    Decision Tree ID3 Algorithm in pure Ruby
    [post].
  • rtimbl
    Memory based learners from the Timbl framework.
  • classifier-reborn
    General classifier module to allow Bayesian and other types of classifications.
  • lda-ruby
    Ruby implementation of the LDA
    (Latent Dirichlet Allocation) for automatic Topic Modelling and Document Clustering.
  • liblinear-ruby-swig
    Ruby interface to LIBLINEAR (much more efficient than LIBSVM for text classification).
  • linnaeus
    Redis-backed Bayesian classifier.
  • maxent_string_classifier
    JRuby maximum entropy classifier for string data, based on the OpenNLP Maxent framework.
  • naive_bayes
    Simple Naive Bayes classifier.
  • nbayes
    Full-featured, Ruby implementation of Naive Bayes.
  • omnicat
    Generalized rack framework for text classifications.
  • omnicat-bayes
    Naive Bayes text classification implementation as an OmniCat classifier strategy.
  • ruby-fann
    Ruby bindings to the Fast Artificial Neural Network Library (FANN).
  • rblearn – Feature Extraction and Crossvalidation library.

Data Visualization

Please refer to the Data Visualization
section on the Data Science with Ruby list.

Optical Character Recognition

Text Extraction

  • yomu
    library for extracting text and metadata from files and documents
    using the Apache Tika content analysis toolkit.

Full Text Search, Information Retrieval, Indexing

Language Aware String Manipulation

Libraries for language aware string manipulation, i.e. search, pattern matching,
case conversion, transcoding, regular expressions which need information about
the underlying language.

  • fuzzy_match
    Fuzzy string comparison with Distance measures and Regular Expression.
  • fuzzy-string-match
    Fuzzy string matching library for Ruby.
  • active_support
    RoR ActiveSupport gem has various string extensions that can handle case.
  • fuzzy_tools
    Toolset for fuzzy searches in Ruby tuned for accuracy.
  • u
    U extends Ruby’s Unicode support.
  • unicode
    Unicode normalization library.
  • CommonRegexRuby
    Find a lot of kinds of common information in a string.
  • regexp-examples
    Generate strings that match a given regular expression.
  • verbal_expressions
    Make difficult regular expressions easy.
  • translit_kit
    Transliterate Hebrew & Yiddish text into Latin characters.
  • re2
    hight-speed Regular Expression library for Text Mining and Text Extraction.
  • regex_sample
    sample string generation from a given Regular Expression.

Articles, Posts, Talks, and Presentations

Projects and Code Examples

Books

  • Miller, Rob.
    Text Processing with Ruby: Extract Value from the Data That Surrounds You.
    Pragmatic Programmers, 2015.
    [link]
  • Watson, Mark.
    Scripting Intelligence: Web 3.0 Information Gathering and Processing.
    APRESS, 2010.
    [link]
  • Watson, Mark.
    Practical Semantic Web and Linked Data Applications. Lulu, 2010.
    [link]

Community

Needs your Help!

All projects in this section are really important for the community but need
more attention. Please if you have spare time and dedication spend some hours on the code here.

Related Resources

Related Post:

https://learnpracticeandshare.com/awesome-natural-language-processing-massive-collection-of-resources/

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button