R packages by jwijffels

udpipe - Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

This natural language processing toolkit provides language-agnostic 'tokenization', 'parts of speech tagging', 'lemmatization' and 'dependency parsing' of raw text. Next to text parsing, the package also allows you to train annotation models based on data of 'treebanks' in 'CoNLL-U' format as provided at <https://universaldependencies.org/format.html>. The techniques are explained in detail in the paper: 'Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe', available at <doi:10.18653/v1/K17-3009>. The toolkit also contains functionalities for commonly used data manipulations on texts which are enriched with the output of the parser. Namely functionalities and algorithms for collocations, token co-occurrence, document term matrix handling, term frequency inverse document frequency calculations, information retrieval metrics (Okapi BM25), handling of multi-word expressions, keyword detection (Rapid Automatic Keyword Extraction, noun phrase extraction, syntactical patterns) sentiment scoring and semantic similarity analysis.

Last updated 2 years ago

conlldependency-parserlemmatizationnatural-language-processingnlppos-taggingr-pkgrcpptext-miningtokenizerudpipecpp

11.63 score 215 stars 8 dependents 1.2k scripts 4.0k downloads

word2vec - Distributed Representations of Words

Learn vector representations of words by continuous bag of words and skip-gram implementations of the 'word2vec' algorithm. The techniques are detailed in the paper "Distributed Representations of Words and Phrases and their Compositionality" by Mikolov et al. (2013), available at <arXiv:1310.4546>.

Last updated 12 months ago

embeddingsnatural-language-processingword2veccpp

8.09 score 70 stars 5 dependents 236 scripts 920 downloads

cronR - Schedule R Scripts and Processes with the 'cron' Job Scheduler

Create, edit, and remove 'cron' jobs on your unix-alike system. The package provides a set of easy-to-use wrappers to 'crontab'. It also provides an RStudio add-in to easily launch and schedule your scripts.

Last updated 1 years ago

cronrstudioscheduler

7.16 score 291 stars 165 scripts 887 downloads

textrank - Summarize Text by Ranking Sentences and Finding Keywords

The 'textrank' algorithm is an extension of the 'Pagerank' algorithm for text. The algorithm allows to summarize text by calculating how sentences are related to one another. This is done by looking at overlapping terminology used in sentences in order to set up links between sentences. The resulting sentence network is next plugged into the 'Pagerank' algorithm which identifies the most important sentences in your text and ranks them. In a similar way 'textrank' can also be used to extract keywords. A word network is constructed by looking if words are following one another. On top of that network the 'Pagerank' algorithm is applied to extract relevant words after which relevant words which are following one another are combined to get keywords. More information can be found in the paper from Mihalcea, Rada & Tarau, Paul (2004) <https://www.aclweb.org/anthology/W04-3252/>.

Last updated 4 years ago

natural-language-processingnlptextranktextrank-algorithm

7.04 score 77 stars 1 dependents 96 scripts 532 downloads

textplot - Text Plots

Visualise complex relations in texts. This is done by providing functionalities for displaying text co-occurrence networks, text correlation networks, dependency relationships as well as text clustering and semantic text 'embeddings'. Feel free to join the effort of providing interesting text visualisations.

Last updated 3 years ago

6.78 score 54 stars 1 dependents 75 scripts 443 downloads

ruimtehol - Learn Text 'Embeddings' with 'Starspace'

Wraps the 'StarSpace' library <https://github.com/facebookresearch/StarSpace> allowing users to calculate word, sentence, article, document, webpage, link and entity 'embeddings'. By using the 'embeddings', you can perform text based multi-label classification, find similarities between texts and categories, do collaborative-filtering based recommendation as well as content-based recommendation, find out relations between entities, calculate graph 'embeddings' as well as perform semi-supervised learning and multi-task learning on plain text. The techniques are explained in detail in the paper: 'StarSpace: Embed All The Things!' by Wu et al. (2017), available at <arXiv:1709.03856>.

Last updated 12 months ago

classificationembeddingsnatural-language-processingnlpsimilaritystarspacetext-miningcpp

6.65 score 101 stars 44 scripts 282 downloads

crfsuite - Conditional Random Fields for Labelling Sequential Data in Natural Language Processing

Wraps the 'CRFsuite' library <https://github.com/chokkan/crfsuite> allowing users to fit a Conditional Random Field model and to apply it on existing data. The focus of the implementation is in the area of Natural Language Processing where this R package allows you to easily build and apply models for named entity recognition, text chunking, part of speech tagging, intent recognition or classification of any category you have in mind. Next to training, a small web application is included in the package to allow you to easily construct training data.

Last updated 1 years ago

chunkingconditional-random-fieldscrfcrfsuitedata-scienceintent-classificationnatural-language-processingnernlpcpp

6.34 score 62 stars 35 scripts 584 downloads

BTM - Biterm Topic Models for Short Text

Biterm Topic Models find topics in collections of short texts. It is a word co-occurrence based topic model that learns topics by modeling word-word co-occurrences patterns which are called biterms. This in contrast to traditional topic models like Latent Dirichlet Allocation and Probabilistic Latent Semantic Analysis which are word-document co-occurrence topic models. A biterm consists of two words co-occurring in the same short text window. This context window can for example be a twitter message, a short answer on a survey, a sentence of a text or a document identifier. The techniques are explained in detail in the paper 'A Biterm Topic Model For Short Text' by Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng (2013) <https://github.com/xiaohuiyan/xiaohuiyan.github.io/blob/master/paper/BTM-WWW13.pdf>.

Last updated 2 years ago

biterm-topic-modellingnatural-language-processingtopic-modelingcpp

6.24 score 95 stars 74 scripts 589 downloads

spark.sas7bdat - Read in 'SAS' Data ('.sas7bdat' Files) into 'Apache Spark'

Read in 'SAS' Data ('.sas7bdat' Files) into 'Apache Spark' from R. 'Apache Spark' is an open source cluster computing framework available at <http://spark.apache.org>. This R package uses the 'spark-sas7bdat' 'Spark' package (<https://spark-packages.org/package/saurfang/spark-sas7bdat>) to import and process 'SAS' data in parallel using 'Spark'. Hereby allowing to execute 'dplyr' statements in parallel on top of 'SAS' data.

Last updated 4 years ago

sas7bdatsparksparklyr

6.08 score 26 stars 23 scripts 4.1k downloads

text.alignment - Text Alignment with Smith-Waterman

Find similarities between texts using the Smith-Waterman algorithm. The algorithm performs local sequence alignment and determines similar regions between two strings. The Smith-Waterman algorithm is explained in the paper: "Identification of common molecular subsequences" by T.F.Smith and M.S.Waterman (1981), available at <doi:10.1016/0022-2836(81)90087-5>. This package implements the same logic for sequences of words and letters instead of molecular sequences.

Last updated 1 years ago

cpp

5.80 score 10 stars 14 scripts 213 downloads

doc2vec - Distributed Representations of Sentences, Documents and Topics

Learn vector representations of sentences, paragraphs or documents by using the 'Paragraph Vector' algorithms, namely the distributed bag of words ('PV-DBOW') and the distributed memory ('PV-DM') model. The techniques in the package are detailed in the paper "Distributed Representations of Sentences and Documents" by Mikolov et al. (2014), available at <arXiv:1405.4053>. The package also provides an implementation to cluster documents based on these embedding using a technique called top2vec. Top2vec finds clusters in text documents by combining techniques to embed documents and words and density-based clustering. It does this by embedding documents in the semantic space as defined by the 'doc2vec' algorithm. Next it maps these document embeddings to a lower-dimensional space using the 'Uniform Manifold Approximation and Projection' (UMAP) clustering algorithm and finds dense areas in that space using a 'Hierarchical Density-Based Clustering' technique (HDBSCAN). These dense areas are the topic clusters which can be represented by the corresponding topic vector which is an aggregate of the document embeddings of the documents which are part of that topic cluster. In the same semantic space similar words can be found which are representative of the topic. More details can be found in the paper 'Top2Vec: Distributed Representations of Topics' by D. Angelov available at <arXiv:2008.09470>.

Last updated 3 years ago

doc2vecembeddingsnatural-language-processingparagraph2vecword2veccpp

5.74 score 48 stars 23 scripts 715 downloads

image.ContourDetector - Implementation of the Unsupervised Smooth Contour Line Detection for Images

An implementation of the Unsupervised Smooth Contour Detection algorithm for digital images as described in the paper: "Unsupervised Smooth Contour Detection" by Rafael Grompone von Gioi, and Gregory Randall (2016). The algorithm is explained at <doi:10.5201/ipol.2016.175>.

Last updated 1 years ago

canny-edge-detectioncomputer-visioncontoursdarknetdlibf9harris-cornersharris-interest-point-detectorhog-featuresimage-algorithmsimage-recognitionopenpanootsusurfcpp

5.44 score 278 stars 7 scripts 193 downloads

image.libfacedetection - Convolutional Neural Network for Face Detection

An open source library for face detection in images. Provides a pretrained convolutional neural network based on <https://github.com/ShiqiYu/libfacedetection> which can be used to detect faces which have size greater than 10x10 pixels.

Last updated 1 years ago

canny-edge-detectioncomputer-visioncontoursdarknetdlibf9harris-cornersharris-interest-point-detectorhog-featuresimage-algorithmsimage-recognitionopenpanootsusurfcppopenmp

5.29 score 278 stars 14 scripts 191 downloads

image.LineSegmentDetector - Detect Line Segments in Images

An implementation of the Line Segment Detector on digital images described in the paper: "LSD: A Fast Line Segment Detector with a False Detection Control" by Rafael Grompone von Gioi et al (2012). The algorithm is explained at <doi:10.5201/ipol.2012.gjmr-lsd>.

Last updated 1 years ago

canny-edge-detectioncomputer-visioncontoursdarknetdlibf9harris-cornersharris-interest-point-detectorhog-featuresimage-algorithmsimage-recognitionopenpanootsusurfcpp

5.14 score 278 stars 7 scripts 239 downloads

image.CornerDetectionF9 - Find Corners in Digital Images with FAST-9

An implementation of the "FAST-9" corner detection algorithm explained in the paper 'FASTER and better: A machine learning approach to corner detection' by Rosten E., Porter R. and Drummond T. (2008), available at <arXiv:0810.2434>. The package allows to detect corners in digital images.

Last updated 1 years ago

canny-edge-detectioncomputer-visioncontoursdarknetdlibf9harris-cornersharris-interest-point-detectorhog-featuresimage-algorithmsimage-recognitionopenpanootsusurfcpp

5.14 score 278 stars 7 scripts 156 downloads

image.Otsu - Otsu's Image Segmentation Method

An implementation of the Otsu's Image Segmentation Method described in the paper: "A C++ Implementation of Otsu's Image Segmentation Method". The algorithm is explained at <doi:10.5201/ipol.2016.158>.

Last updated 1 years ago

canny-edge-detectioncomputer-visioncontoursdarknetdlibf9harris-cornersharris-interest-point-detectorhog-featuresimage-algorithmsimage-recognitionopenpanootsusurfcpp

5.14 score 278 stars 189 downloads

image.CornerDetectionHarris - Implementation of the Harris Corner Detection for Images

An implementation of the Harris Corner Detection as described in the paper "An Analysis and Implementation of the Harris Corner Detector" by Sánchez J. et al (2018) available at <doi:10.5201/ipol.2018.229>. The package allows to detect relevant points in images which are characteristic to the digital image.

Last updated 1 years ago

canny-edge-detectioncomputer-visioncontoursdarknetdlibf9harris-cornersharris-interest-point-detectorhog-featuresimage-algorithmsimage-recognitionopenpanootsusurfcppopenmp

5.14 score 278 stars 2 scripts 212 downloads

image.CannyEdges - Implementation of the Canny Edge Detector for Images

An implementation of the Canny Edge Detector for detecting edges in images. The package provides an interface to the algorithm available at <https://github.com/Neseb/canny>.

Last updated 1 years ago

canny-edge-detectioncomputer-visioncontoursdarknetdlibf9harris-cornersharris-interest-point-detectorhog-featuresimage-algorithmsimage-recognitionopenpanootsusurffftw3cpp

5.14 score 277 stars 6 scripts 216 downloads

ETLUtils - Utility Functions to Execute Standard Extract/Transform/Load Operations (using Package 'ff') on Large Data

Provides functions to facilitate the use of the 'ff' package in interaction with big data in 'SQL' databases (e.g. in 'Oracle', 'MySQL', 'PostgreSQL', 'Hive') by allowing easy importing directly into 'ffdf' objects using 'DBI', 'RODBC' and 'RJDBC'. Also contains some basic utility functions to do fast left outer join merging based on 'match', factorisation of data and a basic function for re-coding vectors.

Last updated 5 years ago

4.72 score 20 stars 26 scripts 253 downloads

tokenizers.bpe - Byte Pair Encoding Text Tokenization

Unsupervised text tokenizer focused on computational efficiency. Wraps the 'YouTokenToMe' library <https://github.com/VKCOM/YouTokenToMe> which is an implementation of fast Byte Pair Encoding (BPE) <https://aclanthology.org/P16-1162/>.

Last updated 1 years ago

bpebyte-pair-encodingtext-miningtokenizationcpp

4.56 score 15 stars 48 scripts 248 downloads

recogito - Interactive Annotation of Text and Images

Annotate text with entities and the relations between them. Annotate areas of interest in images with your labels. Providing 'htmlwidgets' bindings to the 'recogito' <https://github.com/recogito/recogito-js> and 'annotorious' <https://github.com/recogito/annotorious> libraries.

Last updated 2 years ago

4.25 score 21 stars 17 scripts 205 downloads

udpipe - Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

word2vec - Distributed Representations of Words

cronR - Schedule R Scripts and Processes with the 'cron' Job Scheduler

textrank - Summarize Text by Ranking Sentences and Finding Keywords

textplot - Text Plots

ruimtehol - Learn Text 'Embeddings' with 'Starspace'

crfsuite - Conditional Random Fields for Labelling Sequential Data in Natural Language Processing

BTM - Biterm Topic Models for Short Text

spark.sas7bdat - Read in 'SAS' Data ('.sas7bdat' Files) into 'Apache Spark'

text.alignment - Text Alignment with Smith-Waterman

doc2vec - Distributed Representations of Sentences, Documents and Topics

image.ContourDetector - Implementation of the Unsupervised Smooth Contour Line Detection for Images

image.libfacedetection - Convolutional Neural Network for Face Detection

image.LineSegmentDetector - Detect Line Segments in Images

image.CornerDetectionF9 - Find Corners in Digital Images with FAST-9

image.Otsu - Otsu's Image Segmentation Method

image.CornerDetectionHarris - Implementation of the Harris Corner Detection for Images

image.CannyEdges - Implementation of the Canny Edge Detector for Images

ETLUtils - Utility Functions to Execute Standard Extract/Transform/Load Operations (using Package 'ff') on Large Data

tokenizers.bpe - Byte Pair Encoding Text Tokenization

recogito - Interactive Annotation of Text and Images

image.binarization - Binarize Images for Enhancing Optical Character Recognition

dlib - Allow Access to the 'Dlib' C++ Library

sentencepiece - Text Tokenization using Byte Pair Encoding and Unigram Modelling

image.textlinedetector - Segment Images in Text Lines and Words

nametagger - Named Entity Recognition in Texts using 'NameTag'

topicmodels.etm - Topic Modelling in Embedding Spaces

RMOA - Connect R with MOA for Massive Online Analysis

RMOAjars - External jars Required for Package RMOA