Lợi V. Nguyễn

CI & CD with Docker and Nginx

Essential steps in deploying

1. Build/Compile the app

2. Take the content of the dist directory and add them to our Nginx docker image.

3. Serve the application from Nginx

Angular Build Changes

config at package.json

"scripts": {
    "ng": "ng",
    "start": "ng serve",
    "build": "ng build --aot --buildOptimizer=true --prod --progress",
    "test": "ng test  --watch=false",
    "lint": "ng lint",
    "e2e": "ng e2e"
  }

NGINX config and Docker

# service nginx restart * Restarting nginx # ps -ef --forest | grep nginx root 32475 1 0 13:36 ? 00:00:00 nginx: master process /usr/sbin/nginx -c /etc/nginx/nginx.conf nginx 32476 32475 0 13:36 ? 00:00:00 _ nginx: worker process nginx 32477 32475 0 13:36 ? 00:00:00 _ nginx: worker process nginx 32479 32475 0 13:36 ? 00:00:00 _ nginx: worker process nginx 32480 32475 0 13:36 ? 00:00:00 _ nginx: worker process nginx 32481 32475 0 13:36 ? 00:00:00 _ nginx: cache manager process nginx 32482 32475 0 13:36 ? 00:00:00 _ nginx: cache loader process

NGINX config:

worker_processes  auto;

events {
    worker_connections  8192;
}


http {
  # formatting log to json
  log_format json_combined escape=json
  '{'
    '"time_local":"$time_local",'
    '"remote_addr":"$remote_addr",'
    '"remote_user":"$remote_user",'
    '"request":"$request",'
    '"status": "$status",'
    '"body_bytes_sent":"$body_bytes_sent",'
    '"request_time":"$request_time",'
    '"http_referrer":"$http_referer",'
    '"http_user_agent":"$http_user_agent"'
  '}';
    server {
        listen 80;
        server_name  localhost;
        #sending logs to console (standard out) in a predefined json fromat
        access_log /dev/stdout json_combined;
        error_log /dev/stdout info json_combined;
        root   /usr/share/nginx/html;
        index  index.html index.htm;
        include /etc/nginx/mime.types;
        # compression
        gzip on;
        gzip_min_length 1000;
        gzip_proxied expired no-cache no-store private auth;
        gzip_types text/plain text/css application/json application/javascript application/x-javascript text/xml application/xml application/xml+rss text/javascript;

        # angular index.html location
        location / {
            try_files $uri $uri/ /index.html;
        }
        # potential reverse proxy for sending api calls
        location /reverse-proxy/ {
                proxy_set_header Host $host;
                proxy_set_header Content-Type application/json;
                proxy_set_header X-Real-IP $remote_addr;
                # proxy_pass http://pointer-api:8080/;
          }
    }
}

Dockerfile config:

	FROM nginx:1.17.8-alpine
	COPY nginx/nginx.conf /etc/nginx/nginx.conf
	WORKDIR /usr/share/nginx/html/
	COPY dist/heroes .

Digital Ocean - Docker VM

# Create User

$ usermod -aG sudo deployer
$ adduser deployer
$ su -  deployer
$ sudo usermod -aG docker $USER
$ docker ps

GitLab CI/CD

1. Configuration

Before going into gitlab-ci.yml, there are a few variables that must be defined under /settings/ci_cd :

CI_REGISTRY_USER — GitLab user
DOCKER_CI_TOKEN — GitLab token used by docker login instead of a password in command: docker login -u $CI_REGISTRY_USER -P $DOCKER_CI_TOKEN registry.gitlab.com. They are generated from here. Make sure to have api and read_registry checked.
IMAGE_NAME — docker image name. Your image name needs to follow this convention: registry.gitlab.com/$GROUP_NAME/$PROJECT_NAME:tag, for this reason, our image name will be: registry.gitlab.com/blogging4t/angular-ci-cd
Next are the SERVER_IP_ADDRESS and the user for it, SERVER_USER that we created in the Digital Ocean section

Lastly, we must generate a custom RSA key to be able to ssh from GitLab to our server:

ssh-keygen -t rsa -b 4096
cd ~/.ssh
cat gitlab
Copy/Paste the private key in your PRIVATE_KEY variable on Gitlab.

ssh on a server with root and execute the following commands:

cp -r ~/.ssh/ /home/deployer
cd /home/deployer/.ssh
chown deployer:deployer *
chmod 644 authorized_keys
#edit your authorized keys file and add gitlab.pub

PIPELINE:

# docker image, we need a image with chrome installed so we can run unit tests in a virtualized environment such as docker containers
image: "registry.gitlab.com/gitlab-org/gitlab-build-images:ruby-2.6.5-golang-1.12-git-2.24-lfs-2.9-chrome-73.0-node-12.x-yarn-1.21-postgresql-9.6-graphicsmagick-1.3.34"

#the majority of our builds require npm dependencies
before_script:
  - npm install

# thre stages, build -> unit tests and compilation, release -> docker image creation, deploy-> start created image on a remote server
stages:
  - build
  - release
  - deploy

build:
  stage: build #due to this common stage we have paralel builds to gain some speed
  script:
    - npm run-script build # makes sure we can still build our code
  artifacts: #take the output of the compilation and transfer it to gain speed and create a docker image out of it
    paths:
      - dist/
tests:
  stage: build #due to this common stage we have paralel builds to gain some speed
  script:
    - npm test #runs unit tests

image-creation:
  image: docker:git # image with docker installed to execute docker commands
  stage: release # notice a new stage
  services:
    - docker:dind #used to be able to execute docker commands inside of a docker container
  before_script:
    - docker ps #overrides previous docker script
  script:
    # Non interactive ssh gracefully reloads server
    - docker login -u $CI_REGISTRY_USER -p $DOCKER_CI_TOKEN registry.gitlab.com #logs into gitlab docker registery, make sure to have this variables defined
    - docker build -t $IMAGE_NAME . # creates a docker image
    - docker push $IMAGE_NAME:latest # pushes the create docker image to docker registry
  dependencies:
    - build #dependent on build to get the dist directory


deploy:
  image: ubuntu
  before_script: #checks if ssh installed and if not, attempts to install it
    - "which ssh-agent || ( apt-get update -y && apt-get install openssh-client git -y )"
    - eval $(ssh-agent -s) Essential steps in deploying

1. Build/Compile the app

2. Take the content of the dist directory and add them to our Nginx docker image.

3. Serve the application from Nginx


Angular Build Changes 
config at package.json
"scripts": {
    "ng": "ng",
    "start": "ng serve",
    "build": "ng build --aot --buildOptimizer=true --prod --progress",
    "test": "ng test  --watch=false",
    "lint": "ng lint",
    "e2e": "ng e2e"
  }

NGINX config and Docker
# service nginx restart
* Restarting nginx
# ps -ef --forest | grep nginx
root     32475     1  0 13:36 ?        00:00:00 nginx: master process /usr/sbin/nginx 
                                                -c /etc/nginx/nginx.conf
nginx    32476 32475  0 13:36 ?        00:00:00  _ nginx: worker process
nginx    32477 32475  0 13:36 ?        00:00:00  _ nginx: worker process
nginx    32479 32475  0 13:36 ?        00:00:00  _ nginx: worker process
nginx    32480 32475  0 13:36 ?        00:00:00  _ nginx: worker process
nginx    32481 32475  0 13:36 ?        00:00:00  _ nginx: cache manager process
nginx    32482 32475  0 13:36 ?        00:00:00  _ nginx: cache loader process

NGINX config:
worker_processes  auto;

events {
    worker_connections  8192;
}


http {
  # formatting log to json
  log_format json_combined escape=json
  '{'
    '"time_local":"$time_local",'
    '"remote_addr":"$remote_addr",'
    '"remote_user":"$remote_user",'
    '"request":"$request",'
    '"status": "$status",'
    '"body_bytes_sent":"$body_bytes_sent",'
    '"request_time":"$request_time",'
    '"http_referrer":"$http_referer",'
    '"http_user_agent":"$http_user_agent"'
  '}';
    server {
        listen 80;
        server_name  localhost;
        #sending logs to console (standard out) in a predefined json fromat
        access_log /dev/stdout json_combined;
        error_log /dev/stdout info json_combined;
        root   /usr/share/nginx/html;
        index  index.html index.htm;
        include /etc/nginx/mime.types;
        # compression
        gzip on;
        gzip_min_length 1000;
        gzip_proxied expired no-cache no-store private auth;
        gzip_types text/plain text/css application/json application/javascript application/x-javascript text/xml application/xml application/xml+rss text/javascript;

        # angular index.html location
        location / {
            try_files $uri $uri/ /index.html;
        }
        # potential reverse proxy for sending api calls
        location /reverse-proxy/ {
                proxy_set_header Host $host;
                proxy_set_header Content-Type application/json;
                proxy_set_header X-Real-IP $remote_addr;
                # proxy_pass http://pointer-api:8080/;
          }
    }
}
Dockerfile config:
FROM nginx:1.17.8-alpine
COPY nginx/nginx.conf /etc/nginx/nginx.conf
WORKDIR /usr/share/nginx/html/
COPY dist/heroes .

Digital Ocean - Docker VM
# Create User
$ usermod -aG sudo deployer
$ adduser deployer
$ su -  deployer
$ sudo usermod -aG docker $USER
$ docker ps

GitLab CI/CD
1. Configuration
Before going into gitlab-ci.yml, there are a few variables that must be defined under /settings/ci_cd :

CI_REGISTRY_USER — GitLab user
DOCKER_CI_TOKEN — GitLab token used by docker login instead of a password in command: docker login -u $CI_REGISTRY_USER -P $DOCKER_CI_TOKEN registry.gitlab.com. They are generated from here. Make sure to have api and read_registry checked.
IMAGE_NAME — docker image name. Your image name needs to follow this convention: registry.gitlab.com/$GROUP_NAME/$PROJECT_NAME:tag, for this reason, our image name will be: registry.gitlab.com/blogging4t/angular-ci-cd
Next are the SERVER_IP_ADDRESS and the user for it, SERVER_USER that we created in the Digital Ocean section
Lastly, we must generate a custom RSA key to be able to ssh from GitLab to our server:

ssh-keygen -t rsa -b 4096
cd ~/.ssh
cat gitlab
Copy/Paste the private key in your PRIVATE_KEY variable on Gitlab.
ssh on a server with root and execute the following commands:
cp -r ~/.ssh/ /home/deployer
cd /home/deployer/.ssh
chown deployer:deployer *
chmod 644 authorized_keys
#edit your authorized keys file and add gitlab.pub

PIPELINE:
# docker image, we need a image with chrome installed so we can run unit tests in a virtualized environment such as docker containers
image: "registry.gitlab.com/gitlab-org/gitlab-build-images:ruby-2.6.5-golang-1.12-git-2.24-lfs-2.9-chrome-73.0-node-12.x-yarn-1.21-postgresql-9.6-graphicsmagick-1.3.34"

#the majority of our builds require npm dependencies
before_script:
  - npm install

# thre stages, build -> unit tests and compilation, release -> docker image creation, deploy-> start created image on a remote server
stages:
  - build
  - release
  - deploy

build:
  stage: build #due to this common stage we have paralel builds to gain some speed
  script:
    - npm run-script build # makes sure we can still build our code
  artifacts: #take the output of the compilation and transfer it to gain speed and create a docker image out of it
    paths:
      - dist/
tests:
  stage: build #due to this common stage we have paralel builds to gain some speed
  script:
    - npm test #runs unit tests

image-creation:
  image: docker:git # image with docker installed to execute docker commands
  stage: release # notice a new stage
  services:
    - docker:dind #used to be able to execute docker commands inside of a docker container
  before_script:
    - docker ps #overrides previous docker script
  script:
    # Non interactive ssh gracefully reloads server
    - docker login -u $CI_REGISTRY_USER -p $DOCKER_CI_TOKEN registry.gitlab.com #logs into gitlab docker registery, make sure to have this variables defined
    - docker build -t $IMAGE_NAME . # creates a docker image
    - docker push $IMAGE_NAME:latest # pushes the create docker image to docker registry
  dependencies:
    - build #dependent on build to get the dist directory


deploy:
  image: ubuntu
  before_script: #checks if ssh installed and if not, attempts to install it
    - "which ssh-agent || ( apt-get update -y && apt-get install openssh-client git -y )"
    - eval $(ssh-agent -s)
    # Inject the remote's private key
    - echo "$PRIVATE_KEY" | tr -d '\r' | ssh-add - > /dev/null #adding a ssh private key from variables, pair of the one registered on digital ocean
    - mkdir -p ~/.ssh
    - chmod 700 ~/.ssh
    # Append keyscan output into known hosts
    - ssh-keyscan $SERVER_IP_ADDRESS >> ~/.ssh/known_hosts
    - chmod 644 ~/.ssh/known_hosts
  stage: deploy #new stage after release
  script:
    # Non interactive ssh gracefully reloads server
    - ssh $SERVER_USER@$SERVER_IP_ADDRESS  ls
    - ssh $SERVER_USER@$SERVER_IP_ADDRESS "docker login -u ${CI_REGISTRY_USER} -p ${DOCKER_CI_TOKEN} registry.gitlab.com;
     docker stop angular-ci;
     docker rm angular-ci; 
     docker rmi "$(docker images -aq)"
     docker pull ${IMAGE_NAME};
     docker run --name angular-ci -d -p 80:80 ${IMAGE_NAME}"

  only:
    # Trigger deployments only from master branch
    - master
  dependencies:
   - image-creation #dependency on creating a docker image with latest changes

Glossary

    # Inject the remote's private key
    - echo "$PRIVATE_KEY" | tr -d '\r' | ssh-add - > /dev/null #adding a ssh private key from variables, pair of the one registered on digital ocean
    - mkdir -p ~/.ssh
    - chmod 700 ~/.ssh
    # Append keyscan output into known hosts
    - ssh-keyscan $SERVER_IP_ADDRESS >> ~/.ssh/known_hosts
    - chmod 644 ~/.ssh/known_hosts
  stage: deploy #new stage after release
  script:
    # Non interactive ssh gracefully reloads server
    - ssh $SERVER_USER@$SERVER_IP_ADDRESS  ls
    - ssh $SERVER_USER@$SERVER_IP_ADDRESS "docker login -u ${CI_REGISTRY_USER} -p ${DOCKER_CI_TOKEN} registry.gitlab.com;
     docker stop angular-ci;
     docker rm angular-ci; 
     docker rmi "$(docker images -aq)"
     docker pull ${IMAGE_NAME};
     docker run --name angular-ci -d -p 80:80 ${IMAGE_NAME}"

  only:
    # Trigger deployments only from master branch
    - master
  dependencies:
   - image-creation #dependency on creating a docker image with latest changes

MapReduce Hadoop

I. Giới Thiệu MapReduce Hadoop

Đặc điểm quan trọng của các hệ thống phân tán là khối lượng lớn dữ liệu mà chúng dự kiến sẽ xử lý. Có rất nhiều loại dữ liệu được tạo ra liên tục mỗi ngày (Video, hình ảnh, âm thanh, ..) và được đăng tải lên mạng xã hội. Amazon và eBay thường xuyên xử lý lượng lớn dữ liệu khổng lồ ở khắp nơi trên thế giới. Các công cụ tìm kiếm Google, Yahoo và Bing thường xuyên thu thập dữ liệu lớn để xử lý các truy vấn của người dùng từ khắp nơi trên thế giới. Do đó các công cụ xử lý dữ liệu song song khổng lồ đã hình thành và trở nên không thể thiếu đối với các phản hồi nhanh chóng. Để thu thập khối lượng lớn dữ liệu, vào khoản 2004-2007 Google đã phát minh ra MapReduce, sau đó Yahoo đã đưa nó vào dự án mã nguồn mở Hadoop, về cơ bản Hadoop có thể hiểu là một hệ điều hành cho phép các chương trình MapReduce chạy trên các cụm máy tính.

II. Cách thức hoạt động

1. MapReduce

Các thuật toán MapReduce khai thác hiệu quả tính song song được tích hợp sẵn trong nhiều bài toán dữ liệu lớn hoặc quy mô lớn. Người ta bắt đầu với 2 hoạt động cơ bản dùng hàm Map và hàm Reduce . Hàm Map được áp dụng một chức năng cụ thể cho tất cả trong List và thực thi chúng một cách song song. Mỗi Process or Thread được giao một nhiệm vụ như vậy. Hàm Reduce làm nhiệm vụ tổng hợp kết quả Output từ hàm Map và tổng hợp lại thành kết quả cuối cùng (Aggregate).

2. Hadoop

Hadoop là một open-source cung cấp infrastructure mà MapReduce cần, Hadoop sử dụng kiến trúc Master-Slave. Một nút chính (Master node hoặc NameNode) nhận một công việc từ người dùng, dữ liệu đầu vào được chia thành nhiều mẫu và được lưu trữ trong hệ thống phấn tán Hadoop (Hadoop distribution file system (HDFS)). Về mặt vật lý, dữ liệu được nằm rải rác trên lượng lớn các nút trong hệ thống. Master node chỉ định các thành phần công việc cho các nút phụ (Slave node hoặc DataNodes). Mỗi Slave giúp lưu trữ tại chỗ và có khả năng tính toán để giải quyết problem. Khi các nút Slave hoàn thành công việc, nó trả kết quả về cho Master. Master sẽ chạy reduce để tổng hợp các mẫu và đưa ra kết quả. Thậm chí nếu cần thì Slave có thể chia thêm mẫu cho các Slave khác.

III. Các loại MapReduce

Hàm map và reduce có dạng chung như sau:

map: (K1, V1) → list(K2, V2)

reduce: (K2, list(V2)) → list(K3, V3)

Nhìn chung, hàm map input (key và value types) và map output (key và value types) là khác type nhau. Tuy nhiên hàm reduce input và map output phải giống type nhau, mặc dù reduce out có thể khác type.

Nếu có hàm combiner được sử dụng thì type giống như hàm reduce

map: (K1, V1) → list(K2, V2)

combiner: (K2, list(V2)) → list(K2, V2)

reduce: (K2, list(V2)) → list(K3, V3)

Hàm partition hoạt động giữa key và value types và trả về index phân vùng dữ liệu.

partition: (K2, V2) → integer

IV. Các tính năng của MapReduce

1. Counter

Thường có những điều bạn muốn biết về dữ liệu bạn đang phân tích bằng thiết bị ngoại vi đang thực thi. Ví dụ: nếu bạn đang đếm các bản ghi không hợp lệ và phát hiện ra rằng tỷ lệ các bản ghi không hợp lệ trong toàn bộ tập dữ liệu rất cao, bạn có thể được nhắc kiểm tra lý do tại sao rất nhiều bản ghi được đánh dấu là không hợp lệ - có lẽ có lỗi trong phần chương trình phát hiện các bản ghi không hợp lệ?

Hoặc nếu dữ liệu có chất lượng kém và thực sự có rất nhiều bản ghi không hợp lệ, sau khi phát hiện ra điều này, bạn có thể quyết định tăng kích thước của tập dữ liệu để số lượng bản ghi tốt đủ lớn để phân tích có ý nghĩa.

Bộ đếm là một kênh hữu ích để thu thập số liệu thống kê về công việc: để kiểm soát chất lượng hoặc cho thống kê cấp ứng dụng

2. Sorting

Khả năng sắp xếp dữ liệu là trọng tâm của MapReduce. Ngay cả khi ứng dụng của bạn không quan tâm đến việc sắp xếp, nó có thể sử dụng giai đoạn sắp xếp mà MapReduce cung cấp để sắp xếp dữ liệu của nó. Sắp xếp dữ liệu Avro được đề cập riêng, trong Sắp xếp bằng cách sử dụng Avro MapReduce.

- Partial sort

- Total sort

3. Joins

MapReduce cho phép joins lượng lớn dữ liệu giữa các datasets, nhưng việc viết code từ đầu sẽ khá phức tạp. Do đó chúng ta có thể sử dụng các framework như là Pig, Hive, Cascading, Cruc, or Spark

- Map-Side-Join

- Reduce-Side-join

4. Side data distribution

Side data cho phép định nghĩa read-only nhằm mục đích cho việc xử lý tập dữ liệu chính, giúp cung cấp tất cả dữ liệu cho Map hoặc Reduce một cách hiệu quả và thuận tiện nhất.

IV. Ví dụ

Khai triển Partial sorting trên Hadoop dùng code JAVA

1. Prepare

2. Partial Sort

Lợi V. Nguyễn

Navbar-tab

CI & CD with Docker and Nginx

MapReduce Hadoop

The OCR Service to extract the Text Data