0.0.5 • Published 5 years ago

@s2s/language-service-client v0.0.5

Weekly downloads
5
License
MIT
Repository
github
Last release
5 years ago

Docker Microservice for Language Detection and Text Extraction

The service

The service is based on CLD2Owners's Compact Language Detector 2, using commoncrawl's java wrapper, for language detection, and on Apache Tika for text extraction.

It's a docker container hosted at scotti2scotti/language-service, to pull:

docker pull scotti2scotti/language-service

It's a web service listening on 0.0.0.0:5656 by default, you may configure the host and port via two ENV variables: LS_HOST and LS_PORT:

docker run -d -e LS_HOST=${LS_HOST} -e LS_PORT=${LS_PORT} -p 5656:${LS_PORT} --rm --name language-service language-service

The web service has two POST endpoints:

POST /?url=
POST /stream

The first accepts the url of the resource you want to analyze as the value of the query string key url.

The second accepts the content of the resource as the body of the request.

Both endpoints return a json object of the following form, with some ... of truncated values for brevity:

POST /?url=https://scotti2scotti.com/about
{
  "text": " scotti&scotti home notes Web Applications NLP DevOps Web Applications are at the core of scotti&scotti [...] ",
  "language": {
    "language": 0,
    "languages": ["ENGLISH"],
    "reliable": true,
    "languageName": "ENGLISH",
    "languageCode": "en",
    "languageCodeISO639_3": "eng",
    "languageCodes": ["en"]
  },
  "meta": {
    "og:image": "https://scotti2scotti.com/images/s2s-logo-300.png",
    "X-Parsed-By": "org.apache.tika.parser.DefaultParser",
    "og:type": "website",
    "keywords": "web applications, natural language processing, software, database",
    "og:title": "scotti&scotti software house: the craft of knowledge",
    "description": "scotti&scotti llc is the software house of web applications [...] ",
    "title": "scotti&scotti software house: the craft of knowledge",
    "og:description": "scotti&scotti llc is the software house of web [...]",
    "X-UA-Compatible": "ie=edge",
    "viewport": "width=device-width, initial-scale=1.0",
    "dc:title": "scotti&scotti software house: the craft of knowledge",
    "Content-Encoding": "UTF-8",
    "og:url": "https://scotti2scotti.com",
    "Content-Language": "en",
    "Content-Type": "text/html; charset=UTF-8",
    "format-detection": "telephone=no"
  }
}

where:

  • text is the extracted text
  • language is an object whose keys are defined by CLD2Owners
  • meta are the meta tags extracted by tika

The client

A very simple javascript client written in typescript is available as an npm package:

npm i -S @s2s/language-service-client

It defines a few interfaces and a class:

export interface LanguageInfo {
  text: string;
  language: {
    language: number;
    languages: string[];
    languageName: string;
    languageCode: string;
    reliable: boolean;
    languageCodeISO639_3: string;
    languageCodes: string[];
  };
  meta: {
    [key: string]: string;
  };
}
export interface ClientOptions {
  protocol: "http" | "https";
  host: string;
  port: number;
}
export declare const DefautlOptions: ClientOptions;
export declare class Client {
  constructor(options?: ClientOptions);
  readonly url: (url: string) => Promise<LanguageInfo>;
  readonly stream: (stream: Stream) => Promise<LanguageInfo>;
}

License

Source code and docker image are Apache 2.0 licensed. Also CLD2, weslang, commoncrawl and all dependencies use the Apache 2.0 license.