ETC - 2022. 3. 17.

Node.js로 크롤링하기(맛보기)

🧑🏻‍💻Node.js로 크롤링하기(맛보기)

안녕하세요 TriplexLab 입니다.
오늘은 Node.js로 웹 크롤링에 관해서 살펴보도록 하겠습니다.

사실은 크롤링이라는것은 구글에 검색앤진, 네이버에 검색앤진이 내 사이트를 퍼가는 행위를 크롤링이라고 부릅니다.
내가 원하는 사이트에 원하는 데이터를 가져오는 행위가 웹스크래핑이라고 부릅니다.
두가지 단어를 혼용해서 사용하기도 합니다.

👉목표

이번에 할것은 Node.js를 이용해서 인프런 사이트에 데이터를 웹스크래핑을 하고 mongodb에 저장하는것을 해보겠습니다.

👉모듈 설치

yarn add axios cheerio dotenv mongoose

그리고 여기서 사용할 팩키지들을 모두 불러 옵니다.

// app.js
const axios = require('axios');
const cheerio = require('cheerio');
const inflearn = require('./models/inflearn');
require('dotenv').config();

MongoDB 접속에 관한 코드를 작성합니다.

// app.js
//MongoDB 접속
const mongoose = require('mongoose');
mongoose.Promise = global.Promise;

const db = mongoose.connection;
db.on('error', console.error);
db.once('open', () => {
    console.log('mongodb connect');
});

.env파일에 있는 내용 "MONGO_URI=mongodb://localhost:27017/"을 불러와서
then, catch문으로 성공와, 실패를 판단합니다.

// app.js
const { MONGO_URI } = process.env;
mongoose
.connect(MONGO_URI, { useNewUrlParser: true, useUnifiedTopology: true })
.then(() => console.log('Successfully connected to mongodb'))
.catch(e => console.error(e));

이제 특정 웹사이트 전체 HTML코드를 크롤링할 함수를 만들어 봅시다.
함수 파라미터를에 2가지를 넣을수 있는데

👉크롤링 타겟 파악

첫번째 인자는 url이고, 두번째 인자는 키워드 입니다.
두번째 인자를 넣지 않으면 자동으로 첫번째 인자의 url기준으로 크롤링을 하는것 입니다.

const getHTHL = async(url, keyword) => {
	try {
		return await axios.get(!keyword?`${url}`:`${url}${encodeURI(keyword)}`);
	} catch(err){
		console.log(err);
	}
}

👉jQuery로 데이터 파싱

await getHTHL(url, keyword);를 호출하게 되면 전체 HTML코드를 가지고 올수 있게 되고,
여기서 cheerio.load(html.data); 이와 같이 함수안에 인자로 data를 넣으면 이제 jQuery처럼 DOM을 조작 할수 있습니다.
사용할수 있습니다. 👇👇

const inflearn = require('./models/inflearn');

...
...

const parsing = async (url, keyword) => {
	const html = await getHTHL(url, keyword);
	const $ = cheerio.load(html.data);
	const $coursList = $('#courses_section > div > div > div > main > div.courses_container > div > div');

	$coursList.each((idx, el) => {
		...
        
		const dataCheckout = new inflearn({
			title,
			instructor,
			price: [{del:price_del, pay:price_pay}],
			rating,
 			img
		})
		dataCheckout.save();
	});
}

그리고 new inflearn라고 인스턴스를 생성했는데 이것은 mongoose를 이용해서 스키마를 정의 했기 때문 입니다.
dataCheckout라고 변수에 담아서 DB에 저장합니다.

mongoose를 이용해서 스키마를 정의하고 타입을 체크합니다.
(크롤링할때는 굳이 할필요 없다고 생각하는데 연습삼아서 해본것 입니다~)

// models / inflearn.js
const mongoose = require('mongoose');
const Schema = mongoose.Schema;

//생성될 필드명을 정한다.
const inflearnSchema = new Schema({
	title: String,
	instructor: String,
	price: [{del:String, pay:String}],
	rating: String,
	img: String
},
{
  timestamps: true
});

// Create Model & Export
module.exports = mongoose.model('inflearn', inflearnSchema);

👉결과 확인

👉코드확인

전체 코드는 github에서 확인 할수 있습니다. 👇👇😃👍🔥

GitHub - younhoso/Node-web-scraping

Contribute to younhoso/Node-web-scraping development by creating an account on GitHub.

github.com

해당 코드 파일로 공유합니다. 👇👇

Node-web-scraping.zip

0.01MB

저작자표시

Contents

새소식

인기 검색어