Categories
Nodejs

How to Scrape Websites with Node.js

Spread the love

Web scraping is a technique where you extract data from a website by parsing the HTML. This is done by loading the website’s content into a DOM tree, parsing the elements, extracting data from the desired elements, and storing the results in a database.

In this article, we will make a web scraper that fetches listings from a deals website called Red Flag Deals and stores the data in a database. We will have a backend that manages a queue of scrape requests that we will call “jobs”. We provide the ability add, view, and delete jobs. Then we create a frontend to let us view the data we scraped. The web scraper will be a Node.js script. The backend will be an Express app with Sequelize as the ORM. We will use SQLite as our database to keep things simple for this demonstration app. The frontend will be written with React.

Web Scraper Backend

We will start with the backend. Create a project folder and within it, create a backend folder for our backend code. Next, we run Express Generator by executing npx express-generator. This will create the files we need for the base Express app. Now we install the packages for our backend app by running npm install.

Once the packages are installed, we will add Babel to the app so that we can use features of JavaScript that aren’t available in Node.js such as import.

Run npm i @babel/cli @babel/core @babel/node @babel/preset-env. Next create a file called .babelrc in the backend folder and add the following:

{
    "presets": [
        "@babel/preset-env"
    ]
}

In package.json, we add the following to the scripts section:

"start": "nodemon --exec npm run babel-node --  ./bin/www",
"babel-node": "babel-node"

We start our app with babel-node runtime instead of the normal Node runtime so that we can use Babel. Note that we also need nodemon so that the app will restart when files are changed. To install it, run npm i -g nodemon.

Next we need to install our own libraries. We will use the Bull package for creating a job queue for scraping data, CORS for allowing requests from the frontend, Request and Request Promise for fetching the website text, Sequelize for the ORM, and SQLite3 for storing data. We install the packages by running npm i cors bull cheerio request request-promise sequelize sqlite3.

We also need Redis for our Bull job queue to work. The latest version only runs on Linux so we have to use it. In Ubuntu, we run:

sudo apt-get update
sudo apt-get upgrade
sudo apt-get install redis-server
redis-server

This installs and runs Redis. Detailed instructions are at https://tecadmin.net/install-redis-ubuntu/.

Next we add the code for Sequelize to our backend. Run npx sequelize-cli init. In the config.json, replace the code with the following:

{
  "development": {
    "dialect": "sqlite",
    "storage": "development.db"
  },
  "test": {
    "dialect": "sqlite",
    "storage": "test.db"
  },
  "production": {
    "dialect": "sqlite",
    "storage": "production.db"
  }
}

Then we need to create migrations and models for our database to store our data. Run npx sequelize-cli model:create --name Deal --attributes name:string,description:string,url:string,content:text. Running this will create a migration and the associated model.

When you run npx sequelize-cli db:migrate you will get the Deals table in your SQLite database.

With the database code done, we can start writing the logic. Create a queues folder in the backend folder and add a file called scraperQueue.js and add the following:

const Queue = require("bull");
const cheerio = require("cheerio");
const rp = require("request-promise");
const models = require("../models");

const scraperQueue = new Queue("web scraping");

scraperQueue.process(async (job, done) => {
  const deal = job.data.deal;
  const response = await rp(deal.url);
  const $ = cheerio.load(response);
  const items = $(".list_item");
  let deals = [];
  Object.keys(items).forEach(key => {
    const dealer = $(items[key])
      .find(".offer_dealer")
      .text();
    const title = $(items[key])
      .find(".offer_title")
      .text();

    let link = $(items[key])
      .find(".offer_title a")
      .attr("href");
    link = `https://www.redflagdeals.com${link}`;
    deals.push({ dealer, title, link });
  });
  await models.Deal.update(
    {
      content: JSON.stringify(deals),
    },
    {
      where: {
        id: deal.id,
      },
    }
  );
  done();
});

export { scraperQueue };

This is the queue for our web scraping jobs. What we are doing in this file is loading the web page we want to scrape, which is the Red Flag Deals deals page. We open the response of the page with Cheerio by loading the HTML into the DOM tree, locate the items with the class list-item and then iterate through each element and extract the text from the offer_dealer and offer_title classes. For the link, we load the element into the DOM tree with Cheerio and then get the a element from the node with the class offer_title. Then get the href attribute and add the URL prefix and root URL to it since it’s a relative path.

This means if the page structure changes, for example, when the class names change, then you have to update the code to keep it working.

After getting the data, we save it with Sequelize to the job entry we created.

Next we create the API endpoints to add, get, and delete the data. Create a file called deals.js in the routes folder and add the following:

const express = require("express");
const models = require("../models");
const router = express.Router();
import { scraperQueue } from "../queues/scraperQueue";

router.post("/scrape", async (req, res, next) => {
  const { name, description, url } = req.body;
  const deal = await models.Deal.create({
    name,
    description,
    url,
  });
  scraperQueue.add({ deal });
  res.json(deal);
});

router.delete("/job/:id", async (req, res, next) => {
  const id = req.params.id;
  await models.Deal.destroy({ where: { id } });
  res.json({ id });
});

router.get("/jobs", async (req, res, next) => {
  const deals = await models.Deal.findAll();
  res.json(deals);
});

router.get("/job/:id", async (req, res, next) => {
  const id = req.params.id;
  const deals = await models.Deal.findAll({ where: { id } });
  const deal = deals[0];
  deal.content = JSON.parse(deal.content);
  res.json(deal);
});

module.exports = router;

We have the scrape route to add the job to the queue and save the data to the Deals table. The delete /job/:id route is used to delete data from the Deals table. The /jobs table is for getting the jobs, and the get/job/:id route is for returning the deals data to the user.

Finally we have to make some changes to app.js. We replace the existing code with the following:

const createError = require("http-errors");
const express = require("express");
const path = require("path");
const cookieParser = require("cookie-parser");
const logger = require("morgan");
const cors = require("cors");

const indexRouter = require("./routes/index");
const dealsRouter = require("./routes/deals");

const app = express();

// view engine setup
app.set("views", path.join(__dirname, "views"));
app.set("view engine", "jade");

app.use(logger("dev"));
app.use(express.json());
app.use(express.urlencoded({ extended: false }));
app.use(cookieParser());
app.use(express.static(path.join(__dirname, "public")));
app.use(cors());

app.use("/", indexRouter);
app.use("/deals", dealsRouter);

// catch 404 and forward to error handler
app.use(function(req, res, next) {
  next(createError(404));
});

// error handler
app.use(function(err, req, res, next) {
  // set locals, only providing error in development
  res.locals.message = err.message;
  res.locals.error = req.app.get("env") === "development" ? err : {};

// render the error page
  res.status(err.status || 500);
  res.render("error");
});

module.exports = app;

We add the cors add-on to let the app talk to front end, and we added const dealsRouter = require(“./routes/deals”); and app.use(“/deals”, dealsRouter); to expose the deals routes.

This finishes the backend of our web scraping app. Now we need the frontend to add or remove jobs and display the results of the web scraping jobs.

Web Scraper Frontend

We start by creating a React project with Create React App. Create the app by going to the project’s root folder and run npx create-react-app frontend.

Next we need to install some packages. Run npm i axios bootstrap formik mobx mobx-react react-bootstrap react-router-dom yup. Axios is our HTTP client for making requests to our backend. Bootstrap and React Bootstrap are used for styling, MobX and MobX React are for state management, and Formik and Yup are for handling form value changes and form validation respectively.

Now we add and modify files in the src folder unless otherwise mentioned.

First we replace the code in App.js with the following:

import React from "react";
import { Router, Route, Link } from "react-router-dom";
import HomePage from "./HomePage";
import ResultsPage from "./ResultsPage";
import TopBar from "./TopBar";
import { createBrowserHistory as createHistory } from "history";
import { JobsStore } from "./stores";
import "./App.css";
const history = createHistory();
const jobsStore = new JobsStore();

function App() {
  return (
    <div className="App">
      <Router history={history}>
        <TopBar />
        <Route
          path="/"
          exact
          component={props => <HomePage {...props} jobsStore={jobsStore} />}
        />
        <Route path="/results/:id" exact component={ResultsPage} />
      </Router>
    </div>
  );
}

export default App;

We have the TopBar component which displays our top bar which we will create soon. Below that we have the routes for our pages which we will create. Note that we pass in our store to the components in the route. This is the MobX store which will persist the list of jobs for easy access.

In App.css we replace the existing code with:

.center {
    text-align: center;
}

Next we will build the home page. Create a file called HomePage.js and add the following:

import React from "react";
import { useEffect, useState } from "react";
import { Formik } from "formik";
import Form from "react-bootstrap/Form";
import Col from "react-bootstrap/Col";
import Button from "react-bootstrap/Button";
import * as yup from "yup";
import Table from "react-bootstrap/Table";
import { Redirect } from "react-router";
import "./HomePage.css";
import { observer } from "mobx-react";
import { scrape, removeJob, getAllJobs } from "./requests";

const schema = yup.object({
  name: yup.string().required("Name is required"),
  description: yup.string().required("Description is required"),
  url: yup.string().required("URL is required"),
});

function HomePage({ jobsStore }) {
  const [redirect, setRedirect] = useState(false);
  const [jobId, setJobId] = useState(0);
  const [initialized, setInitialized] = useState(false);
  const handleSubmit = async evt => {
    const isValid = await schema.validate(evt);
    if (!isValid) {
      return;
    }
    jobsStore.jobs.push(evt);
    jobsStore.setJobs(jobsStore.jobs);
    await scrape(evt);
    getJobs();
  };

  const selectJob = id => {
    setJobId(id);
    setRedirect(true);
  };

  const getJobs = async () => {
    const response = await getAllJobs();
    jobsStore.setJobs(response.data);
  };

  const deleteJob = id => {
    removeJob(id);
    getJobs();
  };

  useEffect(() => {
    if (!initialized) {
      getJobs();
      setInitialized(true);
    }
  });

  if (redirect) {
    return <Redirect to={`/results/${jobId}`} />;
  }

  return (
    <div className="home-page">
      <h1>Web Scraping Jobs</h1>
      <Formik
        validationSchema={schema}
        onSubmit={handleSubmit}
      >
        {({
          handleSubmit,
          handleChange,
          handleBlur,
          values,
          touched,
          isInvalid,
          errors,
        }) => (
          <Form noValidate onSubmit={handleSubmit}>
            <Form.Row>
              <Form.Group as={Col} md="12" controlId="name">
                <Form.Label>Name</Form.Label>
                <Form.Control
                  type="text"
                  name="name"
                  placeholder="Name"
                  value={values.name || ""}
                  onChange={handleChange}
                  isInvalid={touched.name && errors.name}
                />
                <Form.Control.Feedback type="invalid">
                  {errors.name}
                </Form.Control.Feedback>
              </Form.Group>
              <Form.Group as={Col} md="12" controlId="chatRoomName">
                <Form.Label>Description</Form.Label>
                <Form.Control
                  type="text"
                  name="description"
                  placeholder="Description"
                  value={values.description || ""}
                  onChange={handleChange}
                  isInvalid={touched.description && errors.description}
                />

                <Form.Control.Feedback type="invalid">
                  {errors.description}
                </Form.Control.Feedback>
              </Form.Group>

              <Form.Group controlId="url" as={Col} md="12">
                <Form.Label>Website</Form.Label>
                <Form.Control
                  as="select"
                  name="url"
                  value={values.url || ""}
                  onChange={handleChange}
                  isInvalid={touched.url && errors.url}
                >
                  <option value="">Select</option>
                  <option value="https://www.redflagdeals.com/in/toronto/deals">
                    Red Flag Deals
                  </option>
                </Form.Control>
                <Form.Control.Feedback type="invalid">
                  {errors.url}
                </Form.Control.Feedback>
              </Form.Group>
            </Form.Row>
            <Button type="submit" style={{ marginRight: "10px" }}>
              Add
            </Button>
          </Form>
        )}
      </Formik>
      <br />
      <Table striped bordered>
        <thead>
          <tr>
            <th>Name</th>
            <th>Description</th>
            <th>URL</th>
            <th></th>
            <th></th>
          </tr>
        </thead>
        <tbody>
          {jobsStore.jobs.map((j, i) => {
            return (
              <tr key={i}>
                <td>{j.name}</td>
                <td>{j.description}</td>
                <td>{j.url}</td>
                <td>
                  <Button onClick={selectJob.bind(this, j.id)}>Results</Button>
                </td>
                <td>
                  <Button onClick={deleteJob.bind(this, j.id)}>Delete</Button>
                </td>
              </tr>
            );
          })}
        </tbody>
      </Table>
    </div>
  );
}

export default observer(HomePage);

This page has the form for adding the jobs. We validate that all fields are required with Yup, which we use to create the schema object. We nest our Form component, which is provided by React Bootstrap with the Formik component so that it will handle all the form value changes and return the entered values in the handleSubmit function. We display the form validation errors with the Form.Control.Feedback components. When the user clicks “Add”, the handleSubmit function is executed.

In the handleSubmit function will validate the form values by running the schema.validate function, which returns a promise. Once that is executed and the validation passes, then we call thescrape function in requests.js, which we will create to add the jobs to the queue. After that, it will get the latest jobs by calling getJobs in our component, put the returned entries into our jobsStore and display the items in the table at the bottom.

We need the useEffect callback to call getJobs during the first load of the component so that we can populate the table when the page loads. We do this by checking the initialized flag. If it’s false, the getJobs runs, the initialized will be set to true by calling setInitialized(true) .

We have a button to let us view the jobs with the Result button, and we have a Delete button where we call the removeJob function from requests.js to make the API request to delete the job.

For styling the HomePage component, create HomePage.css and add:

.home-page {
  padding: 20px;
}

Next we writethe code for the requests. Create requests.js and add:

const APIURL = "http://localhost:3000";
const axios = require("axios");

export const getAllDeals = id => axios.get(`${APIURL}/deals/job/${id}`);

export const getAllJobs = () => axios.get(`${APIURL}/deals/jobs`);

export const scrape = data => axios.post(`${APIURL}/deals/scrape`, data);

export const removeJob = id => axios.delete(`${APIURL}/deals/job/${id}`);

Next we create a page to display the results. Create a file called ResultsPage.js and add:

import React, { useState, useEffect } from "react";
import { withRouter } from "react-router-dom";
import { getAllDeals } from "./requests";
import Card from "react-bootstrap/Card";
import "./ResultsPage.css";

function ResultsPage({ match: { params } }) {
  const [deals, setDeals] = useState([]);
  const [name, setName] = useState("");
  const [initialized, setInitialized] = useState(false);

  const getDeals = async () => {
    const id = params.id;
    const response = await getAllDeals(id);
    setDeals(response.data.content);
    setName(response.data.name);
  };

  useEffect(() => {
    if (!initialized) {
      getDeals();
      setInitialized(true);
    }
  });

  return (
    <div className="results-page">
      <h1 className="center">Deals Results: {name}</h1>
      {deals.map((d, i) => {
        return (
          <Card key={i}>
            <Card.Title className="title">{d.dealer}</Card.Title>
            <Card.Body>
              {d.title}
              <br />
              <a className="btn btn-primary" href={d.link}>
                Go
              </a>
            </Card.Body>
          </Card>
        );
      })}
    </div>
  );
}

export default withRouter(ResultsPage);

We get the web scrape job entry by ID and display the results in a list. We get the ID by checking for the ID parameter in the route. When we wrap ResultsPage with the withRouter function, we get the query parameter in the props.

For styling, we create a file called ResultsPage.css and add:

.results-page {
  padding: 20px;
}

.title {
  margin: 0 20px;
}

Now we need to create the MobX store to store the jobs data. Create a file stores.js and add:

import { observable, action, decorate } from "mobx";

class JobsStore {
  jobs = [];

setJobs(jobs) {
    this.jobs = jobs;
  }
}

JobsStore = decorate(JobsStore, {
  jobs: observable,
  setJobs: action,
});

export { JobsStore };

The setJobs is used in HomePage.js to set the jobs, and we accessed the list of jobs stored in the jobs field by using jobsStore.jobs since we passed in an instance of this as a prop to HomePage .

Finally, we create the top bar for our page to display the title and link for the app. Create TopBar.js and add:

import React from "react";
import Navbar from "react-bootstrap/Navbar";
import Nav from "react-bootstrap/Nav";
import NavDropdown from "react-bootstrap/NavDropdown";
import "./TopBar.css";
import { withRouter } from "react-router-dom";

function TopBar({ location }) {
  const { pathname } = location;

  return (
    <Navbar expand="lg" variant="dark">
      <Navbar.Brand href="#home">Web Scraper App</Navbar.Brand>
      <Navbar.Toggle aria-controls="basic-navbar-nav" />
      <Navbar.Collapse id="basic-navbar-nav">
        <Nav className="mr-auto">
          <Nav.Link href="/" active={pathname == "/"}>
            Home
          </Nav.Link>
        </Nav>
      </Navbar.Collapse>
    </Navbar>
  );
}

export default withRouter(TopBar);

This just displays the Navbar provided by React Bootstrap with the link to the home page. If we are on the home page, then the link will be highlighted. This is the purpose of the active property.

Next we create TopBar.css and add:

nav {
    background-color: green!important;
}

Finally, in index.html in the public folder, replace the existing code with:

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8" />
    <link rel="shortcut icon" href="%PUBLIC_URL%/favicon.ico" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <meta name="theme-color" content="#000000" />
    <meta
      name="description"
      content="Web site created using create-react-app"
    />
    <link rel="apple-touch-icon" href="logo192.png" />
    <!--
      manifest.json provides metadata used when your web app is installed on a
      user's mobile device or desktop. See https://developers.google.com/web/fundamentals/web-app-manifest/
    -->
    <link rel="manifest" href="%PUBLIC_URL%/manifest.json" />
    <!--
      Notice the use of %PUBLIC_URL% in the tags above.
      It will be replaced with the URL of the `public` folder during the build.
      Only files inside the `public` folder can be referenced from the HTML.

      Unlike "/favicon.ico" or "favicon.ico", "%PUBLIC_URL%/favicon.ico" will
      work correctly both with client-side routing and a non-root public URL.
      Learn how to configure a non-root public URL by running `npm run build`.
    -->
    <title>Web Scraper App</title>
    <link
      rel="stylesheet"
      href="https://maxcdn.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css"
      integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T"
      crossorigin="anonymous"
    />
  </head>
  <body>
    <noscript>You need to enable JavaScript to run this app.</noscript>
    <div id="root"></div>
    <!--
      This HTML file is a template.
      If you open it directly in the browser, you will see an empty page.

      You can add webfonts, meta tags, or analytics to this file.
      The build step will place the bundled scripts into the <body> tag.

      To begin the development, run `npm start` or `yarn start`.
      To create a production bundle, use `npm run build` or `yarn build`.
    -->
  </body>
</html>

We added the Bootstrap styles and change the title.

Now that the hard work is done. We can run the app.

First run the backend app by running npm start from the backend folder. Next go to the frontend folder and run npm start. When it asks you if you want it to run it from a different port, type yes.

By John Au-Yeung

Web developer specializing in React, Vue, and front end development.

Leave a Reply

Your email address will not be published. Required fields are marked *