NLP란 무엇인가 : What is NLP

Applied NLP

Applied : we work with real language dat

What is a natural language?

- A means of communication that has envolved naturally in humans through use and repetition without conscious plagnning, e.g., English, Korean, etc.

- 사람들이 말을 하고 반복하면서 진화하는 커뮤니케이션 수단

What is not a natural language?

- Programming language : python

- Formal languages : first order logic

Processing : How to program computers to anlayze large amounts of natural language data.

- Natural language understanding ~ amazon reviews -> we can complement the product, : Retailer refer the amazon reviews in order to complement the product. (상품평을 보고 retailer는 상품을 보완할 수 있다.)

- Natural language genration ~ google translation.

- Speech understanding

Applications

- Keyword search ex. google search

- Spell Checking ex. grammarly

- Chatbot

- Machine Translation

- Dialogue Systems

What is different between Lignuistics and NLP

Linguistics	NLP
Phonetics, Phonology 음성학	Speech recognition, synthesis 음성인식, 음성합성
Syntax 구문	Part-of-speech tagging, Parsing
Lexical Semantics 어휘적 의미론	Entitiy recognition
Compositional Semantics 구성 의미론	Role labeling, Reference resolution Text classification
Production 제작	Language generation Summarizaiton Machine translation

Evolution of NLP

[Rule-based(1950) -> Statistical Learning(1990) -> Deep Learning(2010)]

1. Rule based NLP?

- ex) Who is the prime minister of Indea?

- To make question, you should use (who/when/Is/Does) primary way. (주어진 질문에 답을 알려줌)

- A hand-crafted system of rules that parse text and match patters is used to imitate the human way of building language structures.

- It can have high performance in specific use cases ex) Q&A, but often suffer performance degradation when generallized

- Requires domain knowledge about the language -> 해당 영역에 지식이 있어야 답을 저장해 놓을 수 있다.

2. NLP based on statistical learning : leanring from extracted features.

- Learning is performed based on probabilistic modeling, likelihood maximization, and training classifiers.

- Requires an annotated training dataset along with a suitable method of feature engineering, potentially based on domain knowledge about the laugnage.

- A parametric model is trained followed by evaluation on a test dataset similar, yet different, to the training dataset

- More generalizable than rule-based NLP and applicable to a broader range of applications like machine translation.

3. Deep learning: learning directly from large textual datasets.

- Similar to ML with Key differences:

- feature engineering is automated, as deep networks can learn important features from text

특징공학은 딥 네트워크가 텍스트로부터 중요한 특징을 배우면서 자동적으로 실행됨

-> domain knowlege about langue is minimized. 도메인 지식은 최소한만 필요

- Raw text with minimal pre-processing is fed into the models

- It requires very large datasets to circumvent lack of using human domain knolwege.

- Applicable to more challenging tasks ex) dialogue system

Why do we still use rule-based NLP and tranditional ML:

- Still good for sequence labeling

- Some ideas in deep learning are exteded versions of ealirer methods

- Can help to imporve deep learning-based algorithms.

NLP pipleline

- Data Acquisition

- Text Cleaning

- Pre-Processing

- Features Extraction

- Machine Leanring

- Evaluation

- Deployment

- Monitoring

Ambiguity in Natural Language

Ambiguity

- The shortstop caugh the fly (lexical ambiguity) 어휘적 모호성

- Flying planes can be dangerous (structural/ syntactic ambiguity)

위의 문장의 경우 1) 비행기가 위험할 수 있다. (비행기 자체가 위험)

2) 날고 있는 비행기가 위험할 수 있다. (나는 flying 행위가 위험)

이 두가지로 해석이 되기 때문에 어휘적 모호성이 존재한다.

Local ambiguity VS global ambiguity : Context

- Fat people at accumulates

- The man who hunts ducks out on weekends.

Psycholinguistics : the study of human sentence processing 심리언어학

Preprocessing NLP

- two words or one words? ex) hot dog, New York, Hand-writing->handwriting, He is -> He's

Tokenization : splitting the text into units for processing

- anywhere on the scale chracter to word

- removing extra spaces

- removing stop words

- removing unhelpful words ex) external URL links

- removing unhelpful characters ex) non-alphabetical characters

- EX. Tokenization is the first step in NLP (7 tokens)

Stemming : wordform stripped of some characters 어간 추출

- ex) tokenization -> tokeniz

Lemmatization : the base (or citation) form of a word 표제어 찾기 (그 단어가 문장 속에서 어떤 품사로(POS) 사용되었는지도 판단)

- ex) Tokenization, tokenize -> token

- Went, gone, goes -> go

* Stemming과 Lemmatization의 차이는, 영단어 flies 가 있을 때, stemming은 어근을 알려주지만, Lemmatization은 문장 속에서 flies가 '날다'인지 '파리'인지까지 결정

	Stemming	Lemmatization
am	am	be
the going	the go	the going
having	hav	have

Machine Learning

- Preprocessing + feature extraction converts input text into feature vectors.

'NLP' 카테고리의 다른 글

Limitations of feedforward neural network / RNN / Backpropagation Through Time (BPTT) / Gated RNN / Bidirection RNN / Skip-Thought Vectors (0)	2021.10.25
Feedforward Neural Network / Multilayer Perceptron / Backpropagation / MLP for NLP Tasks / ReLu (0)	2021.10.24
Evaluation Metrics/ NL Representation / Distributional Hypothesis / Vector Embedding / Word Embedding / Matrix Factorization / Word2Vec / Skip-Gram / CBOW / Negative Sampling (0)	2021.10.24
Perceptron, SVM, Computing Margin, Logistic Regression (0)	2021.10.24
NLP preprocessing / BoW (Bag of Words) / TF-IDF / Statistical Learning / Generative models VS Discriminative models / Naive Bayes Classification / Linear classification (0)	2021.10.23

박휴지의 프로그래밍 일기

NLP란 무엇인가 : What is NLP

'NLP' 카테고리의 다른 글

댓글

티스토리툴바

NLP란 무엇인가 : What is NLP

'NLP' 카테고리의 다른 글

관련글

댓글

티스토리툴바