- Oggetto:
Applied bioinformatics
- Oggetto:
Applied bioinformatics
- Oggetto:
Academic year 2023/2024
- Course ID
- NEU0293
- Teachers
- Ivan Molineris (Lecturer)
Davide Marnetto (Lecturer) - Year
- 2nd year
- Teaching period
- Second semester
- Type
- Related or integrative
- Credits/Recognition
- 4
- Course disciplinary sector (SSD)
- BIO/11 - molecular biology
- Delivery
- Formal authority
- Language
- English
- Attendance
- Optional
- Type of examination
- Practice test
- Prerequisites
- Theoretical knowledge of molecular biology concepts and high-throughput analyses such as DNA and RNA sequencing. Basic knowledge of programming notions, such as: file system, commands, variables, control flow (if/else/loops), lists, functions.
It is necessary to master the concepts seen in the modules of Programming and Bioinformatic of the Data Science teacing.
- Oggetto:
Sommario del corso
- Oggetto:
Course objectives
The aim of the course is to provide the students with the tools necessary to autonomously run computational analyses, with a specific focus on methods for the analysis of Next-Generation Sequencing big data. It is designed therefore as a natural prosecution of the "Programming for Data Science" and “Bioinformatics” modules from the “Data Science” course. The first module will cover the bash textual interface, which is the most commonly used environment in bioinformatics, covering basic bash tools and using NGS bioinformatics tools as case-study. In the second module the students will learn how to integrate such tools in a computational pipeline, managed by Python Snakemake, completing the necessary competences to reach the course objectives.
- Oggetto:
Results of learning outcomes
Knowledge of linux/unix bash textual interface, fundamental for most big data analyses especially but not exclusively in bioinformatics. Understanding of the principles behind the integration of modular steps in a computational analysis to build a complex pipeline. Knowledge of the Snakemake workflow management framework, basic concepts of Conda and Python. Knowledge of commonly used bioinformatics tools for the analysis of NGS data.
Ability to apply and integrate this knowledge to build bioinformatics pipelines to solve biological questions using NGS data, and to apply this knowledge to other problems. Ability to organize and develop independently computational pipelines for the analysis of bology-derived big data, making judgements about the available and necessary computational resources. Autonomy in the usage and integration of computational tools to analyze NGS data.
Knowledge of the vocabulary necessary to communicate with informatics professionals within the scope of the covered topics, ability to formulate biological problems within a computational perspective and to communicate algorithmic solutions.
Improved ability to learn new coding languages thanks to a basic knowledge of underlying principles and thanks to the analogy with known languages, frameworks and conditions.
- Oggetto:
Program
Module 1
- Computer science concepts (reviewed from the Programming for Data Science module):
- Computer architecture
- Process
- The file system
- Interface and API concept
- Structure of a linux/unix system
- Exchange of data and services, servers
- Encoding: everything in bioinformatics is text
- The shell and commands
- Navigate the filesystem
- Filesystem permission system
- Unix power tools and basic programming principles
- awk
- Principles and application of parallel computing
- Next generation sequencing data analysis
- The fasta and fastq files
- Fastqc
- Analysis of overrepresented sequences
- Annotation of genomes and GTF
- Mapping with STAR or bowtie
- The bam format and its display
- Expression quantification or peack-calling
- Error controls and quality assessment
Module 2
- Pipeline organizing principles, introduction to Python Snakemake. Conda environments and portability. Installation of Conda and Snakemake.
- Introduction to rules (input, output, shell), rule dependency. First pipeline of 2 example rules.
- Snakemake options and wildcards. Testing and debugging the example pipeline.
- Pipeline automatization, wildcards, expand, “all” rules. Fastq quality control rules.
- Pipeline generalization, configuration files. Rules to map fastqs and obtain bam.
- Advanced pipelines with parameters, output attributes, rule priorities. Aligment quality control rules
- Exploiting computational resources: parallelization, Memory resources. expression quantification rules
- Snakemake is Python. Python basics, functions as input. Rules for the analysis of gene expression
- Oggetto:
Course delivery
The course will be entirely held in computer room, alternating short frontal lectures with long hands-on practical sessions to implement what explained.
- Oggetto:
Learning assessment methods
Practical tests in which the students will analyze data using pipelines of bioinformatics and UNIX power tools (assigned at home).
Report describing and commenting the practical test (procedures and results).
The code and the report produced should be turned it few days before the exam.
The exam will consist of oral discussion of the code and the report.
Suggested readings and bibliography
- Oggetto:
Teaching Modules
- Computational pipelines (NEU0293B)
- Next generation sequencing data analysis using linux power tools (NEU0293A)
- Oggetto: