Big Data Learning and Technologies

Code	School	Level	Credits	Semesters
COMP4124	Computer Science	4	20	Spring UK

Code: COMP4124
School: Computer Science
Level: 4
Credits: 20
Semesters: Spring UK

Summary

Big Data involves data whose volume, diversity and complexity requires new technologies, algorithms and analyses to extract valuable knowledge, which go beyond the normal processing capabilities of a single computer. The field of Big Data has many different faces such as databases, security and privacy, visualisation, computational infrastructure or data analytics/mining.

This module will provide the following concepts:

1.  Introduction to Big data: introducing the main principles behind distributed/parallel systems with data intensive applications, identifying key challenges: capture, store, search, analyse and visualise the data.

2. SQL Databases vs. NoSQL Databases: understand the growing amounts of data; the relational database management systems (RDBMS); overview of Structured Query Languages (e.g. SQL); introduction to NoSQL databases; understanding the difference between a relational DBMS and a NoSQL database; Identifying the need to employ a NoSQL DB.

3. Big Data frameworks and how to deal with big data: this includes the MapReduce programming model, as well as an overview of recent technologies (Hadoop ecosystem, and Apache Spark). Then, you will learn how to interact with the latest APIs of Apache Spark (RDDs, DataFrames and Datasets) to create distributed programs capable of dealing with big datasets (using Python and/or Scala).

4. Finally, we will dive into the data mining and machine learning part of the course, including data pre-processing approaches (to obtain quality data), distributed machine learning algorithms and data stream algorithms. To do so, you will use the Machine learning library of Apache Spark (MLlib) to understand how some machine learning algorithms (e.g. Decision Trees, Random Forests, k-means) can be deployed at a scale.

Target Students

Available to Level 3 and 4 students in the School of Computer Science. This module is not available to students not listed above without explicit approval from the module convenor(s). Prior knowledge in Machine Learning equivalent to COMP3009 is required. This module is part of the Artificial Intelligence, Modelling and Optimisation theme in the School of Computer Science.

Assessment

100% Coursework 1: Group coursework and 4 lab assessments.

Assessed by end of spring semester

Educational Aims

The aim of this module is to provide an overview of the big data problem and present the main principles and technologies behind distributed/parallel systems with data intensive applications.

Learning Outcomes

Knowledge and Understanding

Understand the importance of the data.
The principles that allow the processing of big data sets.
Understand the working and features of existing machine learning algorithms capable of handling big data.
Learn to use the main tools of the big data ecosystem.
The current limitations of big data technologies to allow distributed machine learning.

Intellectual Skills

Understand complex ideas and relate them to specific problems or questions in the area of parallel computation.
Be able to identify distributed solutions/approaches to handle big datasets with existing technologies.

Professional/Practical Skills

Hands-on experience with state-of-the-art technologies to handle big data.

Transferable/Key Skills

Experience in problem solving.
Experience in working in groups.
Retrieve information from appropriate sources (e.g. Spark API).
Understanding the issues with biased data and consideration of diversity.
Address real problem around solving big data issues and assess the value of their proposed solutions, retrieve and analyse information from a variety of sources and produce detailed written reports on the result to support the United Nations Sustainable Development Goals (SDGs).

Conveners

View in Curriculum Catalogue

Last updated 07/01/2025.