Teachings

Select Academic Year:     2017/2018 2018/2019 2019/2020 2020/2021 2021/2022 2022/2023
Professor
DIEGO ANGELO GAETANO REFORGIATO RECUPERO (Tit.)
Period
First Semester 
Teaching style
Convenzionale 
Lingua Insegnamento
INGLESE 



Informazioni aggiuntive

Course Curriculum CFU Length(h)
[60/73]  INFORMATICS [73/00 - Ord. 2017]  PERCORSO COMUNE 6 48

Objectives

KNOWLEDGE AND UNDERSTANDING: The course aims at teaching the student the foundations of some architectures for the analysis and management of big data and build a solid foundation for understanding their problems and the implementation of software exploiting these methodologies. The project delivered (in English) will focus on resolution via software and using big data architectures and frameworks on a research or industrial problem. The student will act autonomously through personal accounts on dedicated servers managed by the professor or on personal pc and will be able to integrate the appropriate technologies with the big data methodologies and to solve certain tasks by using machine learning where necessary. Examples of integration of various information technologies with the Apache Spark and Hadoop frameworks or through the use of NVidia's CUDA architecture for GPUs or through TensorFlow and Keras will be shown in class. Final seminars on various application domains (Semantic Web, Natural Language Processing, Sentiment Analysis, Text Classification, etc.) will show students the problems and best practices of real problems.

APPLICATION CAPACITIES: the student must be able to use the proposed big data frameworks and use the programming languages (Python, Java, Scala) to work with them and the supporting libraries (eg MLlib library, cudnn for the development of neural convolution networks on NVU, TensorFlow and Keras GPUs to define neural networks for solving a machine learning problem). Several servers present at the department will form a cluster and will be made available to students for their tests and tests. The student will be able to carry out design and development activities of complex problems related to the big data world. Moreover, through Google Colaboratory, each student will be able to use GPUs provided by Google remotely.

JUDGMENT AUTONOMY: the course aims to stimulate students by proposing exercises, and problems to be solved by implementing them with the frameworks shown to autonomously judge their work. Skills developed concern the students' understanding of the big data problem and its analysis, development and solution.

ABILITY IN COMMUNICATION: the student must be able to express with appropriate terminology the fundamental concepts of big data, machine learning, the concepts of map reduce, HDFS, CUDA and the software that is produced for the management of big data. A final report in English will be given to the professor and must be well organized and easy to read.

ABILITY TO LEARN: the student will be able to learn some fundamental concepts of the use of big data and machine learning frameworks to understand the difference between software written for the management of large amount of data with and without the use of big architectures date and recognize machine learning tasks. It will also be able to manage a cluster of PCs with big data and machine learning solutions installed by working easily in Linux environments that host such architectures.

EXPECTED SKILLS: General and solid understanding of the fundamentals and application aspects related to the world of big data. Knowledge of big data programming using Apache Hadoop, Apache Spark, CUDA, TensorFlow and Keras. Management of clusters of machines with installed big data frameworks, their maintenance and tuning of parameters for efficient management. The student will be able to develop code for Apache Spark and NVidia GPUs and use and develop known algorithms to solve more efficiently problems related to machine learning and their future projects. It will be possible to deepen the knowledge in particular domains through one or more conclusive workshops that will show current topics in different fields of application.

Prerequisites

Knowledge of the Java language, Python (Scala optional). Knowledge of programming. Knowledge of the Linux shell, and its main commands.

Contents

1. Architecture of Big Data systems
2. Definition of Big Data
3. Introduction to the Hadoop framework
4. Use of the HDFS filesystem
5. Concept of MapReduce
6 hours

6. Introduction to the Spark framework SPARK
7. Programming with RDDs
8. Use of key/value pairs
6 hours

9. Loading and Saving data
10. Advanced Programming with Spark
11. Execution on a cluster
12. Tuning and Debugging with SPARK
10 hours

13. SPARK SQL and GraphX
14. Streaming with SPARK
4 hours

15. Fundaments of Machine and Deep Learning
16. Machine learning with MlLib
8 hours

17. CUDA Architecture to program GPU Nvidia
18. CUDA Programming Model
19. CUDA C Runtime (Device memory e Shared memory)
20. Numba to use CUDA with Python
21. Introduction to the TensorFlow framework
22. Introduction to the Keras framework
23. Machine and deep learning with TensorFlow, Keras and GPUs
24. Introduction to Colaboratory
6 hours

25. Cluster Amazon AWS
6 hours

26. Use case within the Robotics domain
27. Use case within the Financial Technology domain
2 hours

Teaching Methods

Forum created on Moodle for questions, insights, solutions of exercises and software programs.

Facebook group and Telegram for further comments, and communications.

Further possibility to contact the teacher via email and Skype and get immediate answers.

Using repositories on GitHub to load your project and source code.

Verification of learning

- Exercises and examples during the course

- Use of Mentimeter to maintain constant and high interaction with students.

- Development of small software tasks directly on the servers and immediate debugging and analysis of the results.

- Project assigned to small groups on a problem that uses large data. The project will be agreed with the teacher. Once developed and tested on the servers available, a scientific description in English of the whole method created together with results, experiences and comments will be given to the teacher. The descriptions will sometimes be extended to Workshop or Journal jobs by the teacher and sent to International Workshops or Conferences or Journals. The score will depending on the size of the project and how many students will be part of it, one vote will be agreed. It will depend on the computational efficiency of the developed program related to the degree of the adopted parallelism and of the obtained precision using such a parallelism.

- Oral verification on the essential concepts of big data and the frameworks seen during the course and discussion of the project.

- The software created for the project, and its written and oral description will give an overall mark in 30.

The following indicates a classification of the final scores for the subject:
18-20: students know how to use some big data technologies for the management of simple tasks without optimization

20-22: students know how to use some big data technologies for the resolution of simple tasks with a basic level optimization

22-24: students know how to use some big data technologies to solve simple tasks with medium level optimization

24-26: students know how to use some big data technologies for the resolution of simple tasks with an advanced level optimization

26-28: students know how to use different big data technologies for the resolution of easy and difficult tasks by exploiting advanced optimization and innovative cloud platforms for the execution of the code.

28-30 cum laude: students know how to use different big data technologies to solve research problems that require advanced optimization of operations with data and using innovative cloud platforms for the execution of the code.

Texts

Learning Spark, Lightining-Fast Data Analysis. O'Reilly. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia

- Hadoop with Python. O'Reilly. Zachary Radtka & Donald Miner

- Deep Learning with Python. Francois Chollet. Manning

- https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programming-model

- http://numba.pydata.org/numba-doc/latest/index.html

On the professor's website (http://people.unica.it/diegoreforgiato/didattica/materiale-didattico/) the slides shown during the course (both theoretical lessons and exercises) will be shown. They contain pointers to web pages where additional notes will be shown.

More Information

Use of slides, videos and Internet pages during the lessons both for theoretical lessons and for exercises. Use of a Facebook group, Telegram and possibly Moodle to post exercises, answer students' questions and create discussions on specific topics.
Use of Linux servers and TitanX GPUs of NVidia where Apache Spark and CUDA, TensorFlow and Keras are installed and where students will find examples, exercises, etc. Each student will be assigned an account with which to work and test their project.
Use of email, social (e.g. Whatsapp, Telegram, Facebook) and Skype for doubts and/or make appointments.

Lectures: 50%
Tutorials: 50%

Questionnaire and social

Share on:
Impostazioni cookie