(Social) Web Single Sign-On in the Wild

Aims: Perform a (semi-)automatic analysis on the usage of social single sign-on systems in the web.

Background:

(Social) web single sign-on systems, such as Facebook Login or Google Sign-in, have found a widespread adoption in the web to authenticate users. These systems are typically based on the standards OpenID Connect 1.0 and OAuth 2.0. These protocols, however, can be used incorrectly, making their usage insecure. While many of these problems are addressed by guidance and protocol improvements, web sites might not follow this advice.

In this project, we want to (semi-)automatically scan the web for web sites that use social web single sign-on and check if they are using these authentication systems in insecure ways.

Prerequisites:

Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub).
Good programming skills.

Early Deliverables

A report describing (social) web single sign-on and their security problems and explaining which of these security problems can be detected without knowledge of the inner workings of the web services.

Final Deliverables

A web crawler that detects insecure usage of (social) web single sign-on based on OpenID Connect 1.0 and OAuth 2.0.
Complete report

Reading

https://tools.ietf.org/html/rfc6749 (OAuth 2.0)
https://tools.ietf.org/html/rfc6750 (Bearer Token Usage)
https://tools.ietf.org/html/rfc6819 (OAuth 2.0 Threat Model and Security Considerations)
https://tools.ietf.org/html/draft-ietf-oauth-security-topics-21 (Draft OAuth Security BCP)
https://tools.ietf.org/html/draft-ietf-oauth-browser-based-apps-12 (Draft Best Practices for OAuth Browser-Based Apps)
https://openid.net/specs/openid-connect-core-1_0.html (OpenID Connect Core 1.0)
Ronghai Yang, Guanchen Li, Wing Cheong Lau, Kehuan Zhang, and Pili Hu. 2016. Model-based Security Testing: An Empirical Study on OAuth 2.0 Implementations. ASIA CCS '16. ACM, New York, 651–662. DOI:https://doi.org/10.1145/2897845.2897874
Shernan E., Carter H., Tian D., Traynor P., Butler K. 2015. More Guidelines Than Rules: CSRF Vulnerabilities from Noncompliant OAuth 2.0 Implementations. In: Almgren M., Gulisano V., Maggi F. (eds) Detection of Intrusions and Malware, and Vulnerability Assessment. DIMVA 2015. Lecture Notes in Computer Science, vol 9148. Springer, Cham. https://doi.org/10.1007/978-3-319-20550-2_13
San-Tsai Sun and Konstantin Beznosov. 2012. The devil is in the (implementation) details: an empirical analysis of OAuth SSO systems. CCS '12. ACM, New York, 378–390. DOI:https://doi.org/10.1145/2382196.2382238 - Yuchen Zhou and David Evans. 2014. SSOScan: Automated Testing of Web Applications for Single Sign-On Vulnerabilities. USENIX Security 2014. https://www.usenix.org/conference/usenixsecurity14/technical-sessions/presentation/zhou
Daniel Fett, Ralf Küsters, Guido Schmitz. 2016. A Comprehensive Formal Security Analysis of OAuth 2.0. In: 23rd ACM SIGSAC Conference on Computer and Communications Security (CCS 2016). https://dl.acm.org/doi/pdf/10.1145/2976749.2978385
Daniel Fett, Ralf Küsters, Guido Schmitz. 2017. The Web SSO Standard OpenID Connect: In-Depth Formal Security Analysis and Security Guidelines. In: IEEE 30th Computer Security Foundations Symposium (CSF 2017). https://ieeexplore.ieee.org/iel7/8048777/8049639/08049720.pdf

A Chat Bot to Manage Security Services

Aims: Develop a Chatbot for MS Teams or Slack to manage a security service via an API.

Background: Productivity tools such as MS Teams and Slack have made it easier to integrate access to other services and APIs via chatbots. These bots can be used to set calendar invites, notify team members about new commits to a repository or make polls (amont other things). Many managed security services today provide, in addition to the standard web interface, API entry points so the customers can customise their experience and access their services in an automated way. These APIs are normally used in other web based applications. In this project we want to develop a chat bot so we can interact with one of these APIs via Slack or MS Teams.

Prerequisites:

Good programming skills, particularly Python and/or NodeJS.
Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub).

Early Deliverables

A report describing current approaches to integrate apps into MS Teams and Slack
A report with an overview of possible security services to integrate in the chatbot (VirusTotal, etc.)
A proof of concept app for either MS Teams and Slack (following the tutorials would be enough)

Final Deliverables

A chatbot that allows users to access the functionality of a specific API via MS Teams or Slack (the bot may include a setup phase were API keys need to be provided)
A test suite to verify that all implemented tests work as intended
A security analysis of the chatbot app
A user manual for the bot
A critical comparison with related works

Reading

https://api.slack.com
https://docs.microsoft.com/en-us/microsoftteams/platform/
https://link.springer.com/chapter/10.1007/978-3-319-99927-2_3
https://s3lab-rhul.slack.com/apps/A81FQ3116-1password

A Dangerous Impasse: Identifying how Lightweight Statistical Tests of Randomness fail to Identify Supply Chain Threat Vectors, and Relevant Remediation Strategies

Aims: Exploring the degree to which ‘lightweight’ (for example hardware implemented) tests of randomness can be fooled by constructed ‘random’ sequences and identifying context-appropriate remediation measures.

Background: FIPS 140-2 initially included a now-defunct statistical test battery. Though no longer included in the standard, the so-called ‘FIPS Tests’ have proven popular as a basis for lightweight self-tests implemented in hardware random number generators as a way of identifying flaws (non-random behaviour) at point of use. Recent research (see reading list) has identified critical issues with the supposition that these tests can identify tampering or structured information in apparently random output streams. This project will identify cost-effective, simple methods of trojanising (structurally biasing) hardware RNGs, the aim being to identify methods for end-users to independently identify those issues.

Prerequisites:

Willingness to learn and engage with simulation and statistical analysis software (NS3/SimPy and MATLAB/Octave/R).
Enthusiasm for mathematical problems of intermediate complexity.
A general knowledge of how randomness is used in crypto-systems, and the supply-chain involved in distributing security-related hardware to critical infrastructure/end-users.
Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub).

Early Deliverables

A report on hardware RNG and appropriate third-party secure modules (such as TPM), including the methods used by manufacturers to test randomness (and how end users are extended the opportunity to perform some tests for themselves).
A report on the current state of the art in supply chain attacks against critical security infrastructure.
A clear workplan describing the set of tests to be performed, including any simulation, emulation or experimental systems that will be used.

Final Deliverables

A simulation or experimental demonstration of structured information that can pass simple statistical tests of randomness.
Appropriate remediation measures, be they existing or novel tests of randomness, with appropriate justifications (explaining why the additional time, cost or other resources are warranted and under what circumstances).
A final report documenting the project. This report will extend to cover the experimental/simulation design, generation of biased random sequences, and how lightweight statistical tests were selected and evaluated. A clear grasp of experimental process and hypothesis testing should be demonstrated through appropriate use of tools and techniques.

Reading

https://ieeexplore.ieee.org/abstract/document/9069949/
https://link.springer.com/chapter/10.1007/978-3-030-04762-7_8
https://dl.acm.org/doi/abs/10.1145/3264628
https://dl.acm.org/doi/abs/10.1145/1268776.1268777
https://ieeexplore.ieee.org/abstract/document/6135498/

A concurrency based game environment

Aims: The aim of the project is to develop a game-playing environment using a programming language such as Java. Besides providing appropriate error reports, the system should have a nice graphical interface and provide such advanced features such as allowing multiple games to be played at the same time.

Background: Game-playing environments are popular systems with a variety of functionalities. For example, such an environment usually allows multiple games to be played at the same time. This can be implemented by means of the concurrency mechanisms supported in a programming language: multiple games are implemented as different threads that share the same modules that provide the basic functionalities such as moves in board games and error reporting procedures.

Early Deliverables

Proof of concept program: A prototype implementation of a board game (e.g., chess).
Proof of concept program: A prototype program that exhibits the behaviour of concurrent execution.
Report: A survey report on game environments.
Report: A description of the prototype implementations.

Final Deliverables

The system must be implemented according to modern software engineering principles.
The environment should contain a working system, with functionalities such as error reporting.
A graphical interface should be implemented.
The environment should allow the users to play multiple games at the same time.
The report should provide an overview of game-playing environments.
The report should describe the environment including its functionalities and design.
The report should describe the implementation issues (such as the choice of data structures, numerical methods etc) necessary to apply the theory.
The report should describe the testing procedures of the environment.
The report should describe the software engineering process involved in the design and implementation.

Suggested Extensions

Advanced GUI with other useful features such as chat room.
Advanced implementation issues such as multiple game playing based on concurrency and sharing.

Prerequisites: Good command of programming languages such as Java.

A concurrency based game environment

Early Deliverables

Proof of concept program: A prototype implementation of a board game (e.g., chess).
Proof of concept program: A prototype program that exhibits the behaviour of concurrent execution.
Report: A survey report on game environments.
Report: A description of the prototype implementations.

Final Deliverables

The system must be implemented according to modern software engineering principles.
The environment should contain a working system, with functionalities such as error reporting.
A graphical interface should be implemented.
The environment should allow the users to play multiple games at the same time.
The report should provide an overview of game-playing environments.
The report should describe the environment including its functionalities and design.
The report should describe the implementation issues (such as the choice of data structures, numerical methods etc) necessary to apply the theory.
The report should describe the testing procedures of the environment.
The report should describe the software engineering process involved in the design and implementation.

Suggested Extensions

Advanced GUI with other useful features such as chat room.
Advanced implementation issues such as multiple game playing based on concurrency and sharing.

Prerequisites: Good command of programming languages such as Java.

A study in (HCI) human computer interaction

Aims: To compare various user interfaces and evaluate their design in terms of human usability

Background: User interfaces are becoming increasingly more important as the world conducts a web- based conversation with itself, along with the continuing computerization of products and facilities. When interfaces are situated in safety-critical contexts, their design and usability can be a matter of life and death: consider the fatalities associated with the Therac-25 radiation therapy machine. The USA Gore-Bush presidential campaign in 2000 was significantly disrupted by voter confusion over the computerized butterfly ballot design. Other classic interface issues include users mistaking their CD-ROM tray for a cupholder, or looking for the "any key”. In terms of e-commerce, companies invest in the design of customer web-sites with consideration to visual appeal and usability. Current directions for interface applications include mobile, wearable and ubiquitous computing.

HCI issues include: colour theory; human perception; haptic/tactile technology; gender / age /cultural / special needs issues; speech recognition / generation; graphic design; cognitive issues such as memory, learning and problem solving; design of fonts; navigation; feedback to the user; usability; aesthetics; ethical issues; and interface problems.

For this project the student will design and implement at least 3 different software interfaces (just focussing on the interface) - for instance a web-page/site, a data-base, an interactive sketch tool, a distance learning facility, or a GUI. A more challenging goal is to implement a mobile interface such as for the Android operating system for touchscreen devices.

The report will comprise a comprehensive survey on HCI discussing both software and hardware interfaces. In particular, the software interfaces implemented by the student will be evaluated in the report in terms of HCI principles.

This project is not based on any of your courses, therefore some HCI material will be provided.

Early Deliverables

A text-based (non-interactive) monochrome web-page
A colourful web-site including images and navigation
GUI built with buttons etc.
Report: about 15 pages including sketches of designs.

Final Deliverables

Design and implement a more advanced interface(s)
Complete report
The programs must have an object-oriented design, using modern software engineering principles.
The report will describe the software engineering processes involved in generating your software.
The report will include comparisons of interfaces with a discussion of their meanings.
The report will include a User Manual.

Prerequisites: Interaction Design module (PC3001)

Reading

http://hci.rwth-aachen.de/HCIBooks
http://www.netmagazine.com/features/top-50-books-web-designers-and-developers

A study in (HCI) human computer interaction

Aims: To compare various user interfaces and evaluate their design in terms of human usability

This project is not based on any of your courses, therefore some HCI material will be provided.

Early Deliverables

A text-based (non-interactive) monochrome web-page
A colourful web-site including images and navigation
GUI built with buttons etc.
Report: about 15 pages including sketches of designs.

Final Deliverables

Design and implement a more advanced interface(s)
Complete report
The programs must have an object-oriented design, using modern software engineering principles.
The report will describe the software engineering processes involved in generating your software.
The report will include comparisons of interfaces with a discussion of their meanings.
The report will include a User Manual.

Reading

http://hci.rwth-aachen.de/HCIBooks
http://www.netmagazine.com/features/top-50-books-web-designers-and-developers

AI-Based Underhanded C: Automatically Creating Malicious Code that Looks Benign

Aims: Automatically creating malicious code that looks benign when inspected.

Background:

The Underhanded C Contest was a programming contest to create code that is malicious, but passes a rigorous inspection, and looks like an honest mistake even if discovered.

The goal of the project is to understand whether it is possible (and to what extent) to create an AI-based system that automatically generates malicious code that looks benign by training the system on existing examples/rules (e.g., the underhanded context samples)."

Prerequisites:

Good programming skills, particularly C.
Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub).

Early Deliverables

A report describing techniques/methods to create malicious code that look benign
Techniques (e.g., AI-based) that can be used to generate malicious code autonomously
A clear workplan describing the design of the system to be implemented and set of tests to be performed to verify the generated code

Final Deliverables

A thorough analysis of examples of automatically-generated malicious code that pass rigorous inspection
Describe in detail the design of the system and of the tests
Describe and analyse the (AI-based) techniques and features that have been used during the project to automatically generate malicious code
Critically compare the findings with related works
Suggestions for improving the inspection of code generated by the proposed tool

Reading

https://en.wikipedia.org/wiki/Underhanded_C_Contest
http://underhanded-c.org/

Advanced Web Development

Aims: Design and implement a web application realising an online service of your choice (such as an online store, social media or web commerce platform).

Background:

Web development is at the core of a large number of software projects. Thus, it is nowadays a fundamental skill that computer scientists are expected to comprehend and be able to apply - if not to essentially dominate. The main goal of the project is to develop a full-stack Web (online) application/service using state-of-the-art technologies and established best-practices. Example services include one of the following:

an online store, e.g. providing the ability to authenticate, post items for sale, choose items to buy, search, perform payments, etc.
a local exchange trading system, e.g. a system in which members of a community can exchange goods and services without the use of a conventional monetary currency.
a social media application.
an online tool for drawing and animating interactive diagrams (such as logic circuits, network graphs, or state machines), whose features include the creation/editing interface and the capability of exporting or embedding the diagrams easily into other web pages

Students may propose their own Web applications/service to implement. However, a clear scope (i.e. a set of expected, well-defined use cases) should be specified and approved by their supervisor prior to commencing the project.

The application’s design must follow an N-tier architectural pattern consisting of a web browser interface, Web and Application server layers and a Database.

The basic elements to be supported must include the application logic and user interface implemented using one of the existing web application technologies/platforms (such as PHP, Ruby on Rails, Django, JavaScript-based frameworks such as NodeJS, REACT, Angular etc.), a relational database schema equipped with sound constraints and designed according to the sound normalization rules, and the interaction mechanisms between the application logic and the database. It is expected that prior to selecting the concrete technologies to be used, an evaluation of the state-of-the-art options will be performed along a number of comparison criteria (e.g. overall set of capabilities, support for aspects such as concurrency, security, modularity, performance, UX practices). The student is expected to (i) adequately establish such criteria (ii) investigate and reflect on a number of candidate technologies and (iii) appropriately compare and justify the concrete choices made.

Advanced features may include supporting ACID transactions, comprehensive security/privacy analysis and advanced security mechanisms (such as protection against SQL injection attacks), using NoSQL data stores (such as MongoDB) to speed up user interaction while maintaining sound levels of transactional consistency, advanced UI features (such as push notifications), using emerging platforms (such as a blockchain), etc.

Early Deliverables

A report describing the state-of-the-art of Web development, including technologies/frameworks/platforms and their comparison as described above. The report will also explain the concrete choices made and the respective rationale/justification from a software engineering perspective (i.e. a justification merely based on a skills learning perspective will not suffice)
A report on Web development architectural paradigms and applicable design patterns, including an initial discussion of security, privacy and key-operational aspects and considerations (e.g. cloud deployments and DevOps aspects)
A report describing the system to be implemented, including a comprehensive set of well-defined use cases (or user stories).
A functional prototype implementation of the system, clearly demonstrating sound design principles and implementation practices (e.g. on DB design, on UI design and interaction).

Final Deliverables

A fully functional software application, with an appropriate end-to-end architecture, applicable design patterns, using modern Web technologies and rigorous software engineering principles.
Complete report, encompassing also relevant content (further explained/discussed) from the early deliverables reports.
The report will describe the software engineering processes involved in generating your software.
The report will include a User Manual.

Aggregating Algorithm

Aims: The aim of the project is to implement methods of prediction with expert advice such as the aggregating algorithm, apply them to real-world data and analyse the performance under different settings.

Background:

Methods of prediction of expert advice allow us to combine a pool of different prediction algorithms (called experts) in the on-line mode. By using them we can perform nearly as well as any expert in the pool.

There is a range of prediction with expert advice methods that are applicable in different situations. First, we may have different requirements for the quality of predictions: the deviation between predictions and outcomes can be measured using different loss functions. The absolute loss is very different from square loss and is usually handled by different algorithms. Secondly, one may be interested in different scenarios of using expert advice. The aggregating algorithm performs as well as any expert in the pool. The fixed share competes with "switching experts" that can switch from one original expert to another once in a while. The guarantees that can be provided in this situation are weaker but cover a much larger pool of predictors. Finally there are recent developments such as specialist experts that can skip a turn without giving a prediction. There are ways to handle this scenario.

On this project you will implement prediction with expert algorithms and use them to merge predictions by experts, which will normally be standard prediction algorithms. One may think of prediction with expert advice as an alternative to model selection: instead of picking one algorithm, we merge a group.

The project is recommended for Computational Finance students.

Early Deliverables

A simple proof-of-concept implementation of the aggregating algorithm and fixed share algorithm.
Report: An overview of the problem of prediction with expert advice and the work of the aggregating algorithm.
Report: The performance of the aggregating algorithm on a small artificial dataset.

Final Deliverables

The program will work with a real dataset, read the data from a file, preprocess it, and use a standard algorithm to generate experts' predictions (if necessary).
The program will implement aggregating algorithm plus possibly fixed share and the aggregating algorithm for sleeping experts. The results will be visualised.
The report will describe implementation issues (such as the choice of data structures, numerical methods etc) necessary to apply the theory.
The report will describe computational experiments, analyse their results and draw conclusions.

Suggested Extensions

Application of sleeping experts for predicting the implied volatility of options.
Experiments with the weak aggregating algorithm.

Reading

Y.Kalnishkan, The AA And Laissez-Faire Investment http://onlineprediction.net/?n=Main.TheAAAndLaissez-FaireInvestment
Kalnishkan, Y. Prediction with expert advice for a finite number of experts: A practical introduction. Pattern Recognition, 126 (2022).
Vovk, V. Competitive on‐line statistics. International Statistical Review, 69(2), 213-248 (2001).
M.Herbster and M.Warmuth, Tracking the Best Expert, Machine Learning, 32, 151-178 (1998)
Y.Kalnishkan and M.V.Vyugin. The weak aggregating algorithm and weak mixability. Journal of Computer and System Sciences, 74(8): 1228--1244 (2008)

Prerequisites: Taking CS5100 (Data Analysis) is required. Taking CS5200 (On-line Learning) is highly recommended.

Aggregating Algorithm

Background:

The project is recommended for Computational Finance students.

Early Deliverables

A simple proof-of-concept implementation of the aggregating algorithm and fixed share algorithm.
Report: An overview of the problem of prediction with expert advice and the work of the aggregating algorithm.
Report: The performance of the aggregating algorithm on a small artificial dataset.

Final Deliverables

The program will work with a real dataset, read the data from a file, preprocess it, and use a standard algorithm to generate experts' predictions (if necessary).
The program will implement aggregating algorithm plus possibly fixed share and the aggregating algorithm for sleeping experts. The results will be visualised.
The report will describe implementation issues (such as the choice of data structures, numerical methods etc) necessary to apply the theory.
The report will describe computational experiments, analyse their results and draw conclusions.

Suggested Extensions

Application of sleeping experts for predicting the implied volatility of options.
Experiments with the weak aggregating algorithm.

Reading

Y.Kalnishkan, The AA And Laissez-Faire Investment http://onlineprediction.net/?n=Main.TheAAAndLaissez-FaireInvestment
Kalnishkan, Y. Prediction with expert advice for a finite number of experts: A practical introduction. Pattern Recognition, 126 (2022).
Vovk, V. Competitive on‐line statistics. International Statistical Review, 69(2), 213-248 (2001).
M.Herbster and M.Warmuth, Tracking the Best Expert, Machine Learning, 32, 151-178 (1998)
Y.Kalnishkan and M.V.Vyugin. The weak aggregating algorithm and weak mixability. Journal of Computer and System Sciences, 74(8): 1228--1244 (2008)

Prerequisites: Taking CS5100 (Data Analysis) is required. Taking CS5200 (On-line Learning) is highly recommended.

Algorithms for Economics and Social Choice

Aims: In this project, you will select a topic emerging from Economics, Game Theory, or Social Choice, you will study and implement algorithms for this topic, and report how they perform in practice.

Background: It is well known to Computer Scientists that it is not faster computers but it is better algorithms that have driven the success of computers. One of the key areas of investigation for computing has been the whole area of social choice. How is it that a collection of individuals with varying preferences over a set of options can lead to a general decision of the community over those same options. Sometimes we want to find a winner and sometimes an order over the options. Most notably this is what is done every time we vote in an election and there is much discussion about the correct and fair form voting that we should use. The archetypal result in this theory is Arrows's impossibility theorem - that there is no fair voting system. A related concept is a study of autonomous agents taking part in a competition for resources. Here participants all have their own strategies rather than an ordering over options and we have to decide what will happen. This situation occurs whenever people or machines are fighting over rare resources, or trying to beat each other in a strategic game. The archetypal result would be that there is no sensible way to determine the best strategy for players in a multi-player multi-objective game - and this was first shown by Nash. It is clear that social choice and economics includes many and varied situations. Some available topics that this project might choose to concentrate on include (you could choose more than one):

Computation of Nash equilibria (two-player games, polymatrix games, or net coordination games).
Dynamics, like best response and fictitious play, in games (net coordination games, networks of complements, Schelling games).
Fair division of indivisible items.
Fair division of divisible goods (Cake cutting algorithms, Rent division).
Stable Matching Algorithms.
Matching markets.
Simulation and calculation of results in many voting like systems

Early Deliverables

Report: A general report on algorithms for economics social choice, finishing with a determination of the project topic or topics.
Report: A full formal description of the problem and the solution concept.
Report: Known algorithms on the topic. How they work?
Proof of concept: A program that generates a (small) instance of the problem(s) you are studying and checks whether a given solution is indeed correct.
Proof of concept: Simple greedy heuristics for the topic.
Implementation of a known algorithm on the topic that works on some toy-examples of the problem.

Final Deliverables

A program developped with proper software engineering techniques
A nice and easy to use GUI for the chosen topic that allows interesting experiments (the GUI and its functionalities depend on the topic).
The program should work on at least three special cases of the problem.
Implementation and comparison of different heuristics and standard algorithms.
Final report should contain a section on the outcomes of the experiments done and conclusions drawn about which algorithms seem suitable in different situations.

Suggested Extensions

A more careful theory report, including notions of problem hardness, or running time for the algorithms.
More involved algorithms for the problem.
In depth analysis of the algorithms for specific classes of instances.
Nice analysis options for your problem including an implementation in some Discrete Event Simulation framework
An interactive Web page allowing people to explore and understand your topic using examples and simulation.

Reading

The table of contents of the Handbook of Computational Social Choice is very useful to see what kinds of project you might want to do
This course on Algorithmic Game Theory is worth having a skim over. At least the list of titles.
There is also a good book on Algorithmic Game Theory
Nash equilibrium: The Nash Equilibrium Wiki page is nice as an introduction.
Stable Marriage: The Stable Marriage Wiki page is also nice as an introduction.
Some fun Youtube videos about matching - worth watching even if you do not take this project: video one, or video two.
A more technical introduction to Networks of Complements.
A more technical introduction to Schelling games.
A more technical introduction to Coordination Games on Graphs.
A more technical introduction to Fair Allocation of Indivisible Goods.
A (slightly) more technical introduction to Cake cutting algorithms.
A more technical introduction to Matching Markets

Prerequisites: CS2860 Algorithms and Complexity and a general interest in the mathematics of Computer Science and in algorithms, game theory, and economics. Reasonable programming skills.

Algorithms for Economics and Social Choice

Computation of Nash equilibria (two-player games, polymatrix games, or net coordination games).
Dynamics, like best response and fictitious play, in games (net coordination games, networks of complements, Schelling games).
Fair division of indivisible items.
Fair division of divisible goods (Cake cutting algorithms, Rent division).
Stable Matching Algorithms.
Matching markets.
Simulation and calculation of results in many voting like systems

Early Deliverables

Report: A general report on algorithms for economics social choice, finishing with a determination of the project topic or topics.
Report: A full formal description of the problem and the solution concept.
Report: Known algorithms on the topic. How they work?
Proof of concept: A program that generates a (small) instance of the problem(s) you are studying and checks whether a given solution is indeed correct.
Proof of concept: Simple greedy heuristics for the topic.
Implementation of a known algorithm on the topic that works on some toy-examples of the problem.

Final Deliverables

A program developped with proper software engineering techniques
A nice and easy to use GUI for the chosen topic that allows interesting experiments (the GUI and its functionalities depend on the topic).
The program should work on at least three special cases of the problem.
Implementation and comparison of different heuristics and standard algorithms.
Final report should contain a section on the outcomes of the experiments done and conclusions drawn about which algorithms seem suitable in different situations.

Suggested Extensions

A more careful theory report, including notions of problem hardness, or running time for the algorithms.
More involved algorithms for the problem.
In depth analysis of the algorithms for specific classes of instances.
Nice analysis options for your problem including an implementation in some Discrete Event Simulation framework
An interactive Web page allowing people to explore and understand your topic using examples and simulation.

Reading

The table of contents of the Handbook of Computational Social Choice is very useful to see what kinds of project you might want to do
This course on Algorithmic Game Theory is worth having a skim over. At least the list of titles.
There is also a good book on Algorithmic Game Theory
Nash equilibrium: The Nash Equilibrium Wiki page is nice as an introduction.
Stable Marriage: The Stable Marriage Wiki page is also nice as an introduction.
Some fun Youtube videos about matching - worth watching even if you do not take this project: video one, or video two.
A more technical introduction to Networks of Complements.
A more technical introduction to Schelling games.
A more technical introduction to Coordination Games on Graphs.
A more technical introduction to Fair Allocation of Indivisible Goods.
A (slightly) more technical introduction to Cake cutting algorithms.
A more technical introduction to Matching Markets

Prerequisites: CS2860 Algorithms and Complexity and a general interest in the mathematics of Computer Science and in algorithms, game theory, and economics. Reasonable programming skills.

Algorithms for facility location problems

Aims: Operational research (also known as operations research) is a field concerned with the use of combinatorial optimisation tools to support decision making (in businesses and more generally). It is the source of many important problems, and a lot of algorithmic work has gone into it.

Background: A prime example of a problem from operational research is the facility location problem. Roughly speaking, in this problem you want to open some number of "facilities" (let's say, factories) to serve customers, and you want to decide how many facilities to open and where to place them - to open too many may be too costly, but if you open too few they end up being too far away from where they are needed (which means higher transportation costs). This is a classical problem in combinatorial optimisation.

The full technical name of the problem this project description is concerned with is uncapacitated facility location (with metric costs). Much of the below can also be done for OR problems other than facility location, but facility location is a problem for which both good benchmarks and interesting algorithms have been developed.

In this project, you will try out some powerful tools for solving such problems (in particular, you will learn to use an LP-solver), and you will implement and compare some of the algorithms which have been proposed.

Prerequisites: CS3490 Combinatorial Optimisation (or similar), and an interest in algorithms.

Early Deliverables

Report: A full description of the facility location problem
Report: P vs NP, NP-completeness, NP-hardness
Report: Strategies for solving NP-complete problems (ex: heuristics, approximations, algorithms for special cases and non-worst case inputs).
Report: Linear programming, LP-relaxations and LP Software packages
Worker threads in a GUI?
Proof of concept program: A simple greedy algorithm
Proof of concept program: A program which solves the LP-relaxation of the problem (by using an LP-solver software package)

Final Deliverables

The program should be able to read a problem description from an input file, in one of the formats used in popular benchmark libraries
The program should have a GUI, which displays a problem instance and can be used to run several different algorithms to solve the instance
The program should support several different algorithms to solve a problem, including at least one heuristic (e.g., a greedy algorithm) and at least one algorithm using the LP-solution as guidance. (It is recommended that one of the algorithms comes from the literature on approximation algorithms; see Extensions.)
The program should also provide the option of showing the cost of the LP solution, and should ideally provide the option of attempting to produce an exact solution via integer programming (so-called MIP).
Since some of the computations can be lengthy (e.g., the MIP solver), the program must allow the user a way to abort an algorithm. Therefore the program must be threaded, so that the user can interact with the GUI despite some algorithm being busy in the background.

Suggested Extensions

A more careful treatment of approximation: A report on "Approximation algorithms and approximation ratios", and an implemention of various types of approximation algorithm
More algorithms and solvers. Examples include such meta-heuristics as local search, simulated annealing, etc.)
Reports on problem variants and extensions (examples: capacitated, multi-level, or the k-center/k-median problem), perhaps including motivations and theoretical properties (NP-hardness, approximation properties)
Careful experiments with running times and result quality for different algorithms and problem instances

Algorithms on Graphs

Aims: To provide a tool for demonstrating algorithms on directed and undirected graphs.

Background: Algorithms on graphs have a lot of applications in various areas of science and technology. One example is search for shortest paths using GPS. Many of the algorithms are quite simple to describe but often harder to implement and even harder to prove their correctness. This project investigates some of these algorithms: connected components in undirected graphs and strongly connected components in directed graphs, distances in directed graphs, optimal vertex colouring of interval graphs, maximum matchings in bipartite graphs, Euler trails, etc. The investigation includes implementation and testing of the algorithms, as well as their animation. Proofs of algorithm correctness is optional, for those with good mathematical background.

Early Deliverables

Report on connected components in undirected graphs and and strongly connected components in directed graphs
Report on distances in directed graphs.
Some pseudo-code for important algorithms
Implementation and testing of some important algorithms.

Final Deliverables

Reports for a wider selection of graph algorithms
Extensive and accurate commented Pseudo-code
Full implementation and testing of algorithms.

Suggested Extensions

Nice User friendly GUI for algorithms.

An extendible clustering package for bioinformatics

Aims: To design and implement an application for clustering different biological data sets with different algorithms using the Java plug-in technology.

Background: Bioinformatics is currently a quickly growing field within the Computer Science ecosystem. The advent of new sequencing technologies along with the high amount of funding devoted to research in life sciences, have produced a huge amount of biological data which waits to be analysed. Clustering is a technique which is often used for such analysis: given a set of elements (protein sequences, gene expression profiles, etc), performing a clustering amounts to finding meaningful groups in the data set.

There are different clustering algorithms, each one with different features: hierarchical/non-hierarchical, overlapping/non-overlapping, graph-based, etc. Also, for different biological datasets, some clustering algorithms are more appropriate than others. Finally, once a clustering has been obtained there are different techniques to assess the validity of the groupings.

In this project a clustering environment for bioinformatics should be developed using the Java language. The environment should be as generic as possible, with a very important feature: it should allow for future extensions. Namely, it should be extensible in three ways:

New biological data types
New clustering algorithms
New techniques for the visualisation/assessment of the clusters

In order to do that, the Java plugin architecture should be used. This will allow other people to write future extensions to this system, without the need of rewriting its kernel. Thus, the project should be carefully designed and should contain enough generality to prevent further modifications and allow any future plugins to be added to it.

The final aim of the project is to produce a solid piece of software which will be made available to the scientific community. It is possible that this work will lead to a publication in a bioinformatics journal.

Prerequisites: Students attempting this project should be good at programming in Java. The student should be interested in Java software development, and know basic software engineering techniques. Also, interest in biological problems is desirable.

Early Deliverables

Proof of concept programs: parsers for biological databases,
Proof of concept programs: implementation of one clustering algorithm,
Proof of concept programs: design of an interface.
Reports: Biological databases.
Reports: Clustering algorithms.
Reports: Design Patterns in Java (mainly plugin architecture).

Final Deliverables

The software should be able to work with at least two kind of biological data (microarray data and sequence data) and two clustering algorithms (for instance k-means and hierarchical clustering).
At least one visualisation/assessment technique should be added (for example, over-representation analysis).
The report should include a biological overview of the problem,
The report should include a detailed description of the implementation of the plugin-architecture,
The report should include a comprehensive documentation on how to extend the program.

Suggested Extensions

Adding the possibility of processing more biological data (PPI data, for example, CSV, etc)
Adding more clustering algorithms, possibly using the Façade pattern with existing implementations (SCSP, Cluster ONE, ...).
Developing cluster visualisation techniques: MDS visualisation, dendrograms, etc.

Analysing jQuery Web-pages as Horn-Clauses (Software Verification)

Aims: Implement a static analysis tool for identifying redundant CSS classes in HTML documents that use jQuery.

Background: Modern websites are dynamic in nature and are built from a tree structure (HTML document) with accompanying Javascript code to provide active content. Manipulating HTML documents directly using Javascript can be a cumbersome task, hence many libraries exist. A popular library is the jQuery library which allows the user to quickly identify elements of the HTML document and add to/update the matched elements.

CSS stylesheets are the industry standard for specifying the look of a website. Each element of the HTML document is assigned a class, and the CSS stylesheet indicates how elements of a given class are to be presented. Since CSS stylesheets often grow large as websites mature, the redundancy problem is to identify CSS classes that will not be used (and thus clutter and confuse the stylesheet). This problem is straightforward for static HTML content, but becomes more difficult when pages are active and may be updated using, for example, jQuery operations.

In recent, as-yet unpublished work, Lin and Ong present a path-insensitive static analysis algorithm for identifying redundant CSS classes in a simplified tree update language that may be used to approximate jQuery.

This analysis algorithm is a fixed-point computation based on the analysis of pushdown systems. Algorithms for pushdown analysis can be encoded and solved using Horn-clauses which may be solved with Microsoft's powerful Z3 solver. (Horn-clauses are similar to Prolog programs.)

The aim of this project will be to implement Lin and Ong's algorithm by translation to Z3. The project will initially focus on analysing the simple HTML language and can be extended to taking as large a fragment as possible of HTML and jQuery as input.

Early Deliverables

Introduction to Horn clauses and their analysis in Z3.
Exposition of the encoding of pushdown systems as Horn-clauses (based on a summary algorithm).
Implementation of pushdown analysis by reduction to Horn-clauses.

Final Deliverables

Encoding of the CSS redundancy problem for the simplified tree language as Horn-clauses.
Implementation of the above encoding.
Benchmarking of implementation on examples derived from jQuery code.
Automatic, implemented translation from some fragment of HTML and jQuery to the CSS redundancy problem for the simplified tree language.

Prerequisites: Not for students doing the MSci Computer Science (Artificial Intelligence)

Analysis of Drive-by-Download Attack Behaviour

Aims: To create an environment to gather data and provide analysis of drive-by-download attacks - visualise and analyse the data. The project includes data cleansing and can apply machine-learning algorithm to the dataset.

Background: A drive-by-download occurs (typically) when a visit to a website results in the download and execution of malicious software (malware), which often results in the host computer becoming compromised and becoming part of a botnet. The drive-by-download starts with a visit to a URL and ends with the (attempted) installation of malware on the host, which then leads to a compromise. Data on this can be acquired using a client honeypot such as Thug [5][6], Capture-BAT (Behavioural Analysis Tool) [4], which are a low interaction and a high interaction system respectively which gather data and collect downloaded executables. Some standard texts on honeypots [3] and optionally machine learning are suggested as background [1][2].

Prerequisites:

Ability to research a new topic, create/use computer systems and design and implement software.
Good programming skills such as python, java, etc and some knowledge of Windows/Linux internals.
Knowledge/interest in data analysis, databases.
Knowledge/interest in machine learning [optional].

Early Deliverables

A report describing client honeypots and their use.
[Optional] Set up a client honeypot system and gather data.
A report on client honeypot data, gathering, processing, enhancing (e.g. gathering other data from VirusTotal etc).
Design of a visualisation, monitoring and data analysis system.

Final Deliverables

Design and develop proof of concept for some or all of a visualisation, monitoring and data analysis system. An example might be a program for extracting features from honeypot log file traces and clustering the resulting data.
Develop an experiment to gather data and answer a research question e.g. to determine whether phishing sites use drive-by-downloads, or to determine whether a sample of current twitter URLs are malicious, etc.
Data analysis and report, e.g. create open source intelligence or show clustering of data and justification of the machine learning methods used and example outputs.
Final report including critical comparison, evaluation and visualisation of results showing the capabilities of the tool using different selected features and techniques.

Reading

[1] Pattern Recognition and Machine Learning, by C.M. Bishop, Springer 2006
[2] Machine learning: a probabilistic perspective, by K. Murphy, MIT Press 2012
[3] Virtual Honeypots: From Botnet Tracking to Intrusion Detection Paperback, by Neils Provos and Thorsten Holz, Wiley, 2007
[4] Seifert, Christian, Ramon Steenson, Ian Welch, Peter Komisarczuk, and Barbara Endicott-Popovsky. "Capture–A behavioral analysis tool for applications and documents." digital investigation 4 (2007): 23-30. http://dfrws.org/sites/default/files/session-files/paper-capture_-_a_tool_for_behavioral_analysis_of_applications_and_documents.pdf
[5] Dell'Aera, Angelo. "Thug: a new low-interaction honeyclient." Recuperado el 11 (2012). http://www.tlc.unipr.it/veltri/sicurezza-2011-2012/Sec-4-Threats-03_Honeyclient.pdf
[6] https//github.com/buffer/thug

Applicability of Conformity Scores for Transfer Learning

Aims: To study applicability of reliable machine learning methods (conformal prediction) for the case when training and testing distributions differs in some aspects (for example, a difference in the level of noise). To find conformity score functions that allow transferability of the training in such cases.

Background: Conformal prediction (CP) framework allows us to make reliable multi-class prediction on assumption that data follow i.i.d. assumption. It can be linked to any underlying algorithm by its core detail called conformity function. If i.i.d. assumption is broken then validity (calibration) properties of the framework are under threat. However, this problem can be resolved in some special cases of transfer learning. The latter term means that the difference between training and test distributions is partial: some of their common properties are kept. As well, it can be expected that validity of CP in such cases remains true for some of the conformity scores and is broken for the others.

Early Deliverables

Study conformal prediction theory and meaning of its validity and efficiency properties.
Pick artificial or natural data examples with various level of difference between the data segments.
Implement Conformal Prediction with few versions of conformity score functions and compare their performance on these data examples.

Final Deliverables

Survey of conformity scores for CP framework based on standard machine learning algorithms.
For one of the algorithms, develop and compare different ways to define conformity score function.
Compare their validity and efficiency on the data examples including the transfer learning challenge.

Suggested Extensions

To try the same for other underlying algorithms, and to formulate general principles behinds safe conformity score functions

Reading

Hedging predictions in machine learning by Alex Gammerman and Vladimir Vovk. Computer Journal 50:151--177 (2007).
A tutorial on conformal prediction by Glenn Shafer and Vladimir Vovk. Journal of Machine Learning Research 9:371--421 (2008).

Prerequisites: Taking CS5920 Machine Learning is recommended.

Applicability of Conformity Scores for Transfer Learning

Early Deliverables

Study conformal prediction theory and meaning of its validity and efficiency properties.
Pick artificial or natural data examples with various level of difference between the data segments.
Implement Conformal Prediction with few versions of conformity score functions and compare their performance on these data examples.

Final Deliverables

Survey of conformity scores for CP framework based on standard machine learning algorithms.
For one of the algorithms, develop and compare different ways to define conformity score function.
Compare their validity and efficiency on the data examples including the transfer learning challenge.

Suggested Extensions

To try the same for other underlying algorithms, and to formulate general principles behinds safe conformity score functions

Reading

Hedging predictions in machine learning by Alex Gammerman and Vladimir Vovk. Computer Journal 50:151--177 (2007).
A tutorial on conformal prediction by Glenn Shafer and Vladimir Vovk. Journal of Machine Learning Research 9:371--421 (2008).

Prerequisites: Taking CS5920 Machine Learning is recommended.

Applying Logic in the Real World - Modeling with Ontologies

Aims: Understanding Web Ontology Language (OWL) modelling with an emphasis on its theoretical foundation (within Description Logic); and building a model for a real-world scenario using OWL.

Background:

Logic is a cornerstone of science and human thinking by and large. As such, it is an active subject of research for more than two thousands years (e.g., from the time of Aristotle). The computing revolution of recent decades has brought logic into the realm of real-world applications. One such development is that of Description Logic which underlies the modern modelling language OWL (Web Ontology Language).

OWL ontologies are basically descriptions of certain enterprises, reminiscent to a database modelling of an enterprise; the advantage of ontologies over databases though is the ability to model not only basic individual information but rather semantic connections between entities in the enterprise (e.g., that a certain class of entities is a sub-class of a different class of entities). This give the user the ability to infer *new* logical conclusions from the data (and not only to extract already existing information).

Ontologies are one of the technologies of the Semantic Web and are currently used extensively in domains such as medicine, biology, and in the general area of digital humanities including archives.

The project will offer students the opportunity to learn both the rich and fundamental theory behind ontologies, namely Description Logic (which are weak versions of first-order logic), and the use of ontology frameworks (for example, Protege) in real data sets.

A successful project will (1) Investigate and understand the basic concepts of the logical theory of Description Logic; (2) define a developed and sufficiently rich OWL ontology for a particular chosen domain using the the Protege tool; (3) Show the applicability of the ontology as to store and infer new information from existing one.

Early Deliverables

Project plan with tasks and milestones.
An overview of Description Logic that supports ontological reasoning.
Gathering of requirements from a real user, or justifying such requirements from the perspective of a potential user.
Initial experiments with Protege. And building a preliminary model for the chosen domain.

Final Deliverables

An ontology of the knowledge domain associated with the data set.
A discussion of Description Logic with explicit examples in the ontology constructed.
Investigate inferencing algorithms in Description Logic and their run-time complexity.
At least a partial encoding of the data set in the ontology.
The report will include examples of usage of the system.
The report will include discuss the perceived advantages and disadvantages of using ontology frameworks for representing and querying the chosen data set

Suggested Extensions

Connect the ontology to other existing ontologies in related areas.
User-friendly interfaces for data representation and querying.

Reading

Semantic Web for the Working Ontologist. Dean Allemang and Jim Hendler.
A Description Logic Primer. Markus Krotzsch, Frantiek Simancik, Ian Horrocks.
http://protege.stanford.edu

Artificial Intelligence Planning

Aims: The goal of this project is to either effectively model an application as a planning problem or to improve existing algorithms for automatically finding plans.

Background:

Broadly speaking, planning is the art of thinking before acting. Automated planning is a crucial and well-established field of artificial intelligence and is concerned with the automatic synthesis of a course of action to achieve a set of goals based on a given description of the world. A planner is an instance of a general problem solver: it accepts an abstract description of a problem and returns a solution for it. No knowledge of the domain is provided to the planning, which must be able to adapt to any description of the problem. This description consists of a model of how the world functions, the initial state of the environment and a set of objectives to achieve. The solution of a planning problem, a plan, is a sequence of actions that accomplish the goals when executed. The choice of different ways to describe these fundamental elements, world model, initial state and objectives, has determined the developments of a number of distinct planning paradigms. They range from those addressing fully observable, deterministic, static and discrete environments to those dealing with partially observable or stochastic environments. In all cases, the derivation of a course of action from a model is computational intractable. Hence, the central challenge in planning is to achieve both generality and scalability.

Many problems can be formalised as planning tasks. Prototypical examples are single-player games ("puzzles") such as the Rubik's Cube or the Freecell card game. Real-world applications of automated planning are the generation of plans for robotic and virtual agents, logistic problems, airport ground traffic control, testing protocols for logical flaws and printer control.

Within the area of AI planning, a wide variety of projects are possible, ranging from highly theoretical to practical and implementation-oriented. Projects can focus either on modelling a problem as a planning task (e.g. search and rescue, urban traffic control, internet of things) and then solving it with existing state-of-the-art planners or on devising new algorithms for efficiently searching for plans (e.g. generating new heuristics for speeding up the search for plans).

Early Deliverables

Familiarisation with the modelling languages for automated planning. Show examples of models that describe specific applications and validate them.
Familiarisation with state-of-the-art planning systems. Show manually worked examples of their operation.
For modelling projects, identification of a specific scenario/application to model as a planning task. In depth description of why planning is well suited to model such a scenario/application. Early prototypes of the models and their validation. For algorithm-oriented projects, identification of a specific planning algorithm to improve. In depth description of how the algorithm can be improved. Early prototypes of improved algorithm with relating experiments.

Final Deliverables

For modelling projects, planning model for the chosen application as well as in depth report that describes and motivates the model. Experimental evidence that the problem under consideration can be solved efficiently via planning. Code developed in order to adapt existing planning algorithms to solve the problem under consideration efficiently.
For algorithm-oriented projects, code of the algorithms implemented, experimental evidence that the algorithms work efficiently and in depth report on the theoretical approach behind the algorithms implemented.

Reading

Artificial Intelligence: A Modern Approach (Third edition) by Stuart Russell and Peter Norvig. Chapters 10 and 11.
A Concise Introduction to Models and Methods for Automated Planning by H. Geffner and B. Bonet. Morgan and Claypool, 2013
Automated Planning - Theory and Practice by Malik Ghallab, Dana S. Nau, Paolo Traverso. Elsevier, 2004

Prerequisites: Strong interest towards Artificial Intelligence. Good programming skills. Positive attitude towards learning new modelling languages, algorithms and techniques.

Author Attribution of Binaries with Machine Learning

Aims: Attributing a piece of code to a known author using machine learning

Background: Attributing binaries, whether malicious or benign, is a difficult and time consuming task however, there is an increase demand for this either for attributing cyber attacks or preventing plagiarism. The goal of this project is to use machine learning to predict authorship of binaries. You will use a corpus of open source software either for static or dynamic analysis of binaries to classify or cluster the binaries into authors. The project has several technical parts.

Prerequisites:

Basics of binary analysis, knowledge of Python/C, basic/advanced notions of machine learning.
Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub).

Early Deliverables

A report describing techniques/methods to extract features from binaries
A preliminary tool to extract features from binaries (e.g., control-flow graphs)
Creation of a reasonably large dataset composed of malware binaries coming from different authors or groups (e.g., APT)
A preliminary set of tests with machine learning classifiers (using scikit-learn) to classify a subset of the dataset based on the extracted features

Final Deliverables

A thorough analysis of techniques and methods to extract features from binaries
Describe and analyse the techniques and features that have been used during the project to extract and attribute binaries
Describe in detail the design of the attribution system
Test the system on the entire dataset and discuss the results
Suggestions for improving the proposed tool

Reading

https://www.sciencedirect.com/science/article/pii/S1742287614000176
https://arxiv.org/abs/1512.08546
https://www.usenix.org/conference/usenixsecurity15/technical-sessions/presentation/caliskan-islam
https://dl.acm.org/doi/abs/10.1145/3292577
https://link.springer.com/chapter/10.1007/978-3-642-23822-2_10

Automatic Analysis of Android App Privacy Policies

Aims: Analysing app privacy policies to identify aspects that may be missing from the policy but are present on the app.

Background: Privacy policies are complex and difficult to read. This normally results in users accepting them without actually reading them. This can be used app developers to include overeaching statemnts in their policies or to fail to properly inform about the data being collected and how it is going to be processed.

Prerequisites:

Good programming skills, particularly Java, Python (for NLP), and knowledge of Android.
Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub).

Early Deliverables

A report describing previous literature in privacy policy analisys
A report describing libraries for natural language processing
A report describing tools for static analysis of Android apps

Final Deliverables

A tool that uses NLP to perform analysis of Android app privacy policies
A detailed description of the tests and evaluation performed on the tool
A detailed account of the results obtained and an analysis of them
A critical comparison with findings from related works

Reading

https://www.nltk.org
https://spacy.io
https://www.usenix.org/conference/usenixsecurity15/technical-sessions/presentation/wang-ruowen
https://www.usenix.org/conference/usenixsecurity18/presentation/harkous
https://www.usenix.org/conference/usenixsecurity14/technical-sessions/presentation/zimmeck
https://frida.re
https://github.com/skylot/jadx

Automatic Detection of Vulnerabilities and Bugs in Interpreted Software

Aims: To implement and evaluate a symbolic execution-based automated test generation system for an interpreted language (Ruby, Perl, Bash, PHP, Lua, ...).

Background:

Symbolic execution allows to test programs and find potentially exploitable bugs fully automatically. The key idea is that a symbolic execution engine runs a program on symbolic inputs that represent all possible input values, instead of testing just individual concrete values. "If" statements over symbolic inputs can then evaluate both ways, and the symbolic execution engine considers both branches in parallel. Not all sequences of branches can occur in reality because their conditions may contradict each other. The engine rules out such cases by querying an automated constraint solver.

There are several symbolic execution engines available for low-level languages like C, but there is still little support for interpreted languages like Ruby, Python, Perl, Bash, PHP, or JavaScript, to name a few.

As shown recently in the "Chef" system (http://dslab.epfl.ch/proj/chef/), symbolic execution engines for interpreted languages can be derived by symbolically executing the language interpreter running the target program. Chef runs the interpreter on top of a low-level engine running an entire virtual machine stack and has been successfully used to develop engines for Python and Lua.

The goal of this project will be to use a symbolic execution engine such as Chef, KLEE, or CREST to develop another symbolic execution engine for a new interpreted language, and to run a series of experiments to evaluate its efficiency and effectiveness. In the first phase, you will experiment with at least two engines and decide which one to build on for the remainder of the project.

Early Deliverables

Experiments: Set up of an existing symbolic execution engine and experiments with example code
Code: Test infrastructure for symbolically executing the target interpreter
Report: Mechanics of symbolic execution and state of the art
Report: Comparison of two symbolic engines and the results from running them on the interpreter
Report: Overview of the implementation of the target interpreter

Final Deliverables

Code: A mature testing system and performance optimisations for symbolically executing an interpreted language
Report: An evaluation of its efficiency and effectiveness on a small set of benchmarks
Report: Process for developing the new engine, including interpreter modifications
Report: Testing infrastructure and libraries for the target language
Report: Responsible disclosure of vulnerabilities
Development and evaluation of high-level assertions of program correctness that allow to find concrete vulnerabilities and classes of bugs relevant to the target programs under test and/or the entire target interpreted language.

Suggested Extensions

Novel heuristics for improving performance and test coverage.
Detailed evaluation of the effects of different combinations of optimisations.

Reading

Bucur et al.: Prototyping Symbolic Execution Engines for Interpreted Languages. Intl. Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Salt Lake City, UT, March 2014
Cadar and Sen: Symbolic execution for software testing: three decades later. Communications of the ACM, 56:2, February 2013

Prerequisites:

CS2821 Systems Programming

CS3470 Compilers and Code Generation or equivalent prior knowledge.

Good programming skills and ability to quickly understand a new language.

Knowledge of C and willingness to learn basic C++.

Benchmarking of homomorphic encryption libraries

Aims: To implement the homomorphic evaluation of a chosen function, relevant for a certain application area, in two homomorphic encryption libraries, to compare their performance.

Background: A homomorphic encryption scheme is a type of encryption scheme that enables the evaluation of a function on encrypted data, without requiring access to the secret key. Several open-source libraries exist today that implement one or more homomorphic encryption schemes. These implementations can be reasonably performant for low-depth evaluations. This project will compare different libraries to understand for which functions a homomorphic evaluation can be considered as feasible today, and using which library that evaluation would be more efficient.

Prerequisites:

Experience with C++
Being able to autonomously download/compile/install libraries from various repositories (e.g., GitHub).

Early Deliverables

A report describing possible applications of homomorphic encryption
A report describing one homomorphic encryption scheme
A performance comparison of two homomorphic encryption libraries implementing that scheme for a simple homomorphic evaluation, e.g. weighted average on a small dataset

Final Deliverables

Describe commonly-used homomorphic encryption schemes and their implementations in open-source libraries
Explain how scheme / libraries parameters should be set to optimise performance
Compare performance of two libraries for a more involved evaluation, using one or more advanced techniques such as batching or bootstrapping
Based on implementation experience, evaluate suitability of different libraries for a given application

Reading

https://homomorphicencryption.org
http://www.humangenomeprivacy.org/2019/
https://github.com/Microsoft/SEAL
https://github.com/homenc/HElib
https://eprint.iacr.org/2016/421
https://eprint.iacr.org/2012/144

Blockchain consensus algorithms

Aims: Gain a thorough understanding of distributed consensus algorithms by building your own Blockchain.

Background:

High-performance consensus algorithms are a fundamental building block underpinning the distributed infrastructure of many Internet Services. Single-shot consensus algorithms allow a set nodes in a distributed system to agree on a single value. In practice however, real-world applications often require multi-shot consensus, i.e. agreement on an ever-growing sequence of values. For example, in Blockchain applications, participating nodes continually seek agreement on the next block of transactions to append to the Blockchain. Well-known multi-shot consensus algorithms include (Multi-) Paxos[1][2] and Raft[3] for the case where nodes fail by crashing. There is currently also considerable interest in Byzantine-fault tolerant multi-shot consensus algorithms [4][5][6] (e.g. to support 'permissioned' crypto-currencies such as Facebook's Libra/Diem or as a building block in recent 'permissionless' crypto-currencies). In this project you will build your own Blockchain using the Elixir distributed programming language and evaluate its performance. The key challenge will be to implement a suitable multi-shot consensus algorithm. For this, you may base your design on existing algorithms from the literature, or propose your own after discussion with your supervisor. You are free to choose between existing crash-fault tolerant (e.g. Raft/Multi-Paxos), or Byzantine fault-tolerant algorithms (e.g. Streamlet, PBFT).

Early Deliverables

A report giving an overview of prominent distributed multi-shot consensus algorithms.
A report describing consensus algorithms in the context of prominent permissioned and/or permissionless Blockchains.
A proof-of-concept program implementing a simple (possibly) inefficient multi-shot consensus algorithm

Final Deliverables

A permissioned Blockchain that is resilient to crash and/or Byzantine failures and achieves a high transaction throughput.
A report describing the architecture of the implemented Blockchain
A report on the performed experiments and performance evaluation results

Suggested Extensions

Support for nodes joining and leaving the system.
An integration with the Jepsen distributed testing framework.

Reading

Paxos made simple, L. Lamport, ACM SIGACT News, December 2001, https://www.microsoft.com/en-us/research/publication/paxos-made-simple
The Part-Time parliament, L. Lamport, ACM TOCS '98: https://www.microsoft.com/en-us/research/publication/part-time-parliament/
Lecture notes on Leader-based Sequence Paxos - an understandable sequence consensus algorithm, S. Haridi et al, https://arxiv.org/pdf/2008.13456.pdf
In search of an understandable consensus algorithm, D. Ongaro et al, USENIX ATC '14 https://www.usenix.org/conference/atc14/technical-sessions/presentation/ongaro
Streamlet: Textbook Streamlined Blockchains, B. Chan and E. Shi, https://eprint.iacr.org/2020/088
Practical Byzantine fault-tolerance, M. Castro and B. Liskov, OSDI '99: http://pmg.csail.mit.edu/papers/osdi99.pdf

Prerequisites: CS5860; Understanding of the basic concepts of distributed systems. Distributed programming.

Building a Game

Aims: To understand the game development ecosystem and to build and deploy a playable game.

Background:

Games development is one of the largest parts of the computing industry. There are many platforms for writing games and indeed, many architectures and frameworks.

In this project you will build and deploy a game.

In the first term you will research existing games and their underlying technology platforms. You will also look at game building tools such as Unity3D. You will consider game frameworks such as PyGame and also investigate APIs for game development such as Vulkan.

You will need to be familiar with the design patterns required for gaming: Flyweight, publish/subscribe (for message passing), Object Pool, Service, Dirty Flag, Factory, Visitor, Singleton, Game Loop, State, Observer, Command, Event Queue (See perhaps Game Programming Patterns by Robert Nystrom. Also consider reading some of the material on The Game Designs Patterns Wiki

Your target could be a tower defence game written in Python in which case you will need to understand Threading and Graphics in Python.

Perhaps you will use Blender and other tools and do some maths and produce an excellent Unity game.

Maybe you will program a platform game on an Android device, using the sensors and learning about the Android life cycle, Manifesto, and the mobile approach to gaming.

Early Deliverables

A simple program displaying an animation using sprites or similar animation technology.
A program displaying a simple, interactive user interface. You might want to explore the different means of displaying and writing user interfaces (XML, Object Orientated, Immediate Mode).
3D demo using an API
A program that exhibits some 3D graphics using a relatively low level graphics API (Vulkan, OpenGL, WebGL, JavaFX etc. This might provide a starting point for choosing technology: https://en.wikipedia.org/wiki/List_of_3D_graphics_libraries)
A simple game using a game engine / library / framework. Think of it as an opportunity to learn about the technologies you'll use for your main game but keep it simple.
A report on design patterns in games, an enumeration of commons ones, the problems they solve and perhaps discuss some specific instances of their use.
A report on the specific technologies you've tried / intend to use for your game. You should do deep on you intended tech, show as deep an understanding as you can.
A report on animation techniques, enumerate them, perhaps tell a history of them, compare and contrast. The goal is to show a deep understanding of how animation in games is achieved.
A report on Threading an User Interfaces. What problems do games experience with user interfaces without threading, how does multi threading solve these problems and what different ways can this parallelism or concurrency be achieved? Perhaps review how it has actually been achieve in technologies or platform relevant to your game.
A report on designing games with graphical user interfaces for multiple platforms.
A report on the technologies used to write and draw graphical user interfaces.

Final Deliverables

Game Source Code: You should use a consistent architectural style. Our suggestion is object orientation but you could opt for another with sufficient justification.
Game Source Code: The game needs to have at least 3 'screens' or pages that are interactive using the GUI technology of choice.
Game Source Code: You should publish your game! (Android Store, itch.io)
Game Source Code: The game play should make use of animation.
The Final Report: An in depth review of background materials you used to learn the concepts and technologies used to write the game, perhaps presenting a historical timeline of those concepts / technologies.
The Final Report: A description of the more difficult concepts involved in writing graphical games e.g. Multi Threading and threads / processes, event based systems (handlers, events), special consideration for the environment your game with run in (memory foot print, battery usage).
The Final Report: A description of the software engineering processes involved in writing your game.
The Final Report: A description of interesting techniques, hacks or data structures you used (or could have used) in writing your game.

Suggested Extensions

Use of AI, machine learning on the results of the simulation to improve performance.
Use of a camera or bluetooth or other libraries.
High Score stored securely off-phone to enable competition
The goal here is to push the envelop of the project, expand the breadth of technologies and concepts you'd need to learn to complete the game.
Perhaps: Using complex game AI, perhaps also using machine learning to improve the performance of that AI.
Perhaps: Using peripheral devices for interaction with the game.
Perhaps: Enabling multiplayer modes with networking, perhaps through scoreboards or other means.

Reading

Mario Zechner. Beginning Android Games. 2011 - Springer.
http://www.gameprogrammingpatterns.com/contents.html - Game Programming Patterns by Robewrt Nystrom

CS: Aggregating Algorithm

Background:

The project is recommended for Computational Finance students.

Early Deliverables

A simple proof-of-concept implementation of the aggregating algorithm and fixed share algorithm.
Report: An overview of the problem of prediction with expert advice and the work of the aggregating algorithm.
Report: The performance of the aggregating algorithm on a small artificial dataset.

Final Deliverables

The program will work with a real dataset, read the data from a file, preprocess it, and use a standard algorithm to generate experts' predictions (if necessary).
The program will implement aggregating algorithm plus possibly fixed share and the aggregating algorithm for sleeping experts. The results will be visualised.
The report will describe implementation issues (such as the choice of data structures, numerical methods etc) necessary to apply the theory.
The report will describe computational experiments, analyse their results and draw conclusions.

Suggested Extensions

Application of sleeping experts for predicting the implied volatility of options.
Experiments with the weak aggregating algorithm.

Reading

Y.Kalnishkan, The AA And Laissez-Faire Investment http://onlineprediction.net/?n=Main.TheAAAndLaissez-FaireInvestment
Kalnishkan, Y. Prediction with expert advice for a finite number of experts: A practical introduction. Pattern Recognition, 126 (2022).
Vovk, V. Competitive on‐line statistics. International Statistical Review, 69(2), 213-248 (2001).
M.Herbster and M.Warmuth, Tracking the Best Expert, Machine Learning, 32, 151-178 (1998)
Y.Kalnishkan and M.V.Vyugin. The weak aggregating algorithm and weak mixability. Journal of Computer and System Sciences, 74(8): 1228--1244 (2008)

Prerequisites: Taking CS5100 (Data Analysis) is required. Taking CS5200 (On-line Learning) is highly recommended.

CS: Applicability of Conformity Scores for Transfer Learning

Early Deliverables

Study conformal prediction theory and meaning of its validity and efficiency properties.
Pick artificial or natural data examples with various level of difference between the data segments.
Implement Conformal Prediction with few versions of conformity score functions and compare their performance on these data examples.

Final Deliverables

Survey of conformity scores for CP framework based on standard machine learning algorithms.
For one of the algorithms, develop and compare different ways to define conformity score function.
Compare their validity and efficiency on the data examples including the transfer learning challenge.

Suggested Extensions

To try the same for other underlying algorithms, and to formulate general principles behinds safe conformity score functions

Reading

Hedging predictions in machine learning by Alex Gammerman and Vladimir Vovk. Computer Journal 50:151--177 (2007).
A tutorial on conformal prediction by Glenn Shafer and Vladimir Vovk. Journal of Machine Learning Research 9:371--421 (2008).

Prerequisites: Taking CS5920 Machine Learning is recommended.

CS: Blockchain consensus algorithms

Aims: Gain a thorough understanding of distributed consensus algorithms by building your own Blockchain.

Background:

Early Deliverables

A report giving an overview of prominent distributed multi-shot consensus algorithms.
A report describing consensus algorithms in the context of prominent permissioned and/or permissionless Blockchains.
A proof-of-concept program implementing a simple (possibly) inefficient multi-shot consensus algorithm

Final Deliverables

A permissioned Blockchain that is resilient to crash and/or Byzantine failures and achieves a high transaction throughput.
A report describing the architecture of the implemented Blockchain
A report on the performed experiments and performance evaluation results

Suggested Extensions

Support for nodes joining and leaving the system.
An integration with the Jepsen distributed testing framework.

Reading

Paxos made simple, L. Lamport, ACM SIGACT News, December 2001, https://www.microsoft.com/en-us/research/publication/paxos-made-simple
The Part-Time parliament, L. Lamport, ACM TOCS '98: https://www.microsoft.com/en-us/research/publication/part-time-parliament/
Lecture notes on Leader-based Sequence Paxos - an understandable sequence consensus algorithm, S. Haridi et al, https://arxiv.org/pdf/2008.13456.pdf
In search of an understandable consensus algorithm, D. Ongaro et al, USENIX ATC '14 https://www.usenix.org/conference/atc14/technical-sessions/presentation/ongaro
Streamlet: Textbook Streamlined Blockchains, B. Chan and E. Shi, https://eprint.iacr.org/2020/088
Practical Byzantine fault-tolerance, M. Castro and B. Liskov, OSDI '99: http://pmg.csail.mit.edu/papers/osdi99.pdf

Prerequisites: CS5860; Understanding of the basic concepts of distributed systems. Distributed programming.

CS: Clustering and Generative Modelling

Aims: To implement various clustering and generative modeling methods, and to compare them on both artificial and real world data.

Background: Often – perhaps even usually – one believes that items of data can be meaningfully divided into similar groups, or ‘clusters’. Very many techniques have been developed for finding such clusters. The more sophisticated methods fit a generative probability model to the data, and then identify clusters with the most probable values of certain latent variables.

This project is to implement and compare a variety of clustering methods on some artificial and real data.

Early Deliverables

Proof of concept program: Gaussian mixture model applied to a small artificial dataset.
A report giving an overview of clustering and describing three different clustering methods.
A report describing and assessing the results of applying the proof of concept program to data.

Final Deliverables

The program(s) will work with a real dataset(s).
You will generate and compare different clusterings or generative models using several different methods, such as: kernel-based semi-supervised clustering, spectral learning, factor analysis, tree-based methods, or other comparable methods.
You will compare and visualize the clusterings produced.
The report will describe and derive the formulae for the methods that you use, and the report will also discuss the implementation.

Suggested Extensions

There is a very wide range of clustering and generative-modelling algorithms in standard use, and an even wider variety of methods of fitting them to data. This would be an opportunity to implement several different methods and get to understand them by using them in practice.

Reading

Pattern Recognition and Machine Learning, by C.M. Bishop, Springer 2006
Machine learning: a probabilistic perspective”, by K. Murphy, MIT Press 2012
S. Basu, A. Bamerjee and R. Mooney, "Semi-supervised clustering by seeding", Proceedings of the 19th International Conference on Machine Learning, pp.19-26, 2002.

Prerequisites: CS5920 (Computer Learning) would be advantageous (but not required). This project will require some mathematical understanding; it is not simply programming.

CS: Conformal Anomaly Discovery

Aims: Write programs implementing Conformal Predictors with various underlying algorithms such as Support Vector Machines and apply them to a several datasets to detect anomalies in the data.

Background:

Machine Learning algorithms have been successfully used in medicine, pharmaceutical industry and many other fields to handle a large set of data in order to find important patterns and as well as outliers and anomalies.

Recently developed conformal predictors can be used to narrow down the search space and to improve accuracy of classification.

Early Deliverables

A review and a report of the approach in the case of classification and regression.
A design of the system and its extension for conformal predictors.
Implementation of a pilot system and some preliminary experiments.

Final Deliverables

Implementation of the system, experiments with various parameters.
Experimenting with different underlying ML algorithms.
It is desirable that the final system will have a graphical user interface.
The final report will describe the theory, the implementation issues, computational experiments with different data sets and parameters.

Suggested Extensions

Online learning with conformal predictors.

Reading

V.Vovk, A.Gammerman, G.Shafer. "Algorithmic Learning in a Random World", Springer, 2005.

Prerequisites: Good programming skills in Phyton, Java, MATLAB, or R. Excellent knowledge of linear algebra. Taking CS5920 (or CS3920) Maching Learning is an advantage.

CS: Conformal Prediction in Data Editing: Error Correction

Aims: To develop an application of Conformal Prediction to detection and correction of possible mistakes in a data set.

Background: Conformal Prediction is a framework for reliable machine learning, that allows to supply predictions with a measure of confidence. Having a new object, each of the hypotheses about its label is tested on conformity to the training data and rejected if its p-value falls below the threshold (the significance level). Using the same technique for data editing is open for research. The principal idea is transparent. If a new object is suspected to contain a mistake it its record, it is possible to check the hypotheses about the alternative values of its different features, like this is done for the label only in the standard CP. The validity property of CP framework becomes a guarantee against a `false alarm' i.e. offering an unneeded correction. This becomes more challenging if a record may contain more than one mistake. That raises theoretical questions, for example: how to create a scheme of stepwise (iterative) data correction that keeps validity guarantees on any step (not just on the first one), and stops when there are no more convincing correction offers? So, this project can be explored at different levels of theoretical depth and practical complexity.

Early Deliverables

Find and choose a data set
Implement a relevant version of Conformal Prediction algorithm
Impute some mistakes into the data, apply the algorithm, and observe the results.

Final Deliverables

Implement a Conformal Prediction with an additional function of error detection.
Compare different versions of Conformal Prediction by their efficiency for data correction.
Analyse and report the output.

Suggested Extensions

Propose, justify theoretically, and implement a scheme of multiple error correction.
Compare how the approach works in different (on-line, batch, cross-validation, inductive) modes.

Reading

Hedging predictions in machine learning by Alex Gammerman and Vladimir Vovk. Computer Journal 50:151--177 (2007).
A tutorial on conformal prediction by Glenn Shafer and Vladimir Vovk. Journal of Machine Learning Research 9:371--421 (2008).
Mushroom Data Set. https://archive.ics.uci.edu/ml/datasets/Mushroom

Prerequisites: Taking CS5920 Machine Learning is recommended.

CS: Conformal Prediction of Multi-Class Structured Label Spaces

Aims: To select and analyse a dataset for classification with a large (20-200) number of classes, and to implement a method of reliable machine learning (conformal prediction) on it. The implementation should make use of the structure of the label space (for example, semantic similarity metric) by developing special conformity score functions taking it into account.

Background: Multi-class and structured label spaces is a special machine learning challenge, which is identical neither to the classification nor to the regression. The examples are classifications of images, detection of materials, medical diagnostic and others. The structure on the label space may naturally appear from the additional information in a concrete application, or recovered by data analysis. Applying methods of reliable machine learning allows one to give output in a flexible multi-class form.

Early Deliverables

Study data sets and identify one relevant for the project, describe it and the structure of its label space.
Select a basic machine learning approach applicable to this data.
Develop and implement a multi-class version of this approach and get some initial results on the data

Final Deliverables

Survey of the data and machine learning techniques.
Develop a conformity score functions for data with structured label space.
Implement a conformal prediction algorithm applicable to the data.
Analysis of the output.

Suggested Extensions

Compare results with different basic machine learning approaches.
Compare original label space structure attached to the data against its automatic structuring by means of data analysis.

Reading

Hedging predictions in machine learning by Alex Gammerman and Vladimir Vovk. Computer Journal 50:151--177 (2007).
A tutorial on conformal prediction by Glenn Shafer and Vladimir Vovk. Journal of Machine Learning Research 9:371--421 (2008).
Predicting structured objects with support vector machines by T.Joachims, T.Hofmann, Y.Yue, C.N.Yu. Communications of the ACM 52 (11), 97--104

Prerequisites: Taking CS5920 Machine Learning would is recommended.

CS: Conformal predictors for detecting harmful respiratory events

Aims: Implementing several classification-based conformal predictors to detect coughing and sneezing sounds, and comparing their performance validity and efficiency on the recently released ESC dataset.

Background:

Non-speech body sounds recognition had not received major attentions, despite containing valuable information about the person's health conditions.

This project aims at detecting the two most popular infectious events, coughs and sneezes, from raw recordings.

The challenge for such objective is the variety of coughs and sneezes, amongst other non-speech body sounds. Hence, the confidence measure of conformal prediction would be useful in such scenario.

This project is recommended for CS5920 students, who have successfully implemented conformal prediction in the first assignment, and would like to further study its performance on real world problems.

Early Deliverables

Producing a simple proof-of-concept implementation of classification-based conformal prediction with different underlying algorithms (e.g. SVM, k-NN, Random Forests, etc.) from scratch.
A report of the empirical results of such conformal predictors on a small synthetic dataset.

Final Deliverables

A detailed literature review of existing machine learning algorithms used for non-speech sound recognition.
Producing a detailed report of the dataset structure, including some visual analysis (e.g. t-SNE, UMAP, etc.).
Implementing conformal predictors on the whole dataset, optimising the underlying algorithms' parameters.
Carefully analysing the p-values, comparing the validity, efficiency, accuracy of the implemented conformal predictors.

Suggested Extensions

Comparing different methods of extracting sound features (e.g. MFCC, CFA, ZCR) from raw recordings, and their impacts on the prediction performance.
Using the entire dataset to train a new model, then testing such model on other real world recordings.

Reading

Nguyen, K. A. and Luo, Z. (2018). Cover Your Cough: Detection of Respiratory Events with Confidence Using a Smartwatch. Proceedings of Machine Learning Research, 91, 1-18.
Piczak, K. J. (2015). ESC: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia (pp. 1015-1018). ACM.
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/YDEPUT

Prerequisites: Taking CS5920 Machine Learning is required. Taking CS5100 Data Analsysis would be useful, but is not compulsory. Having good knowledge of Matlab or Python is essential.

CS: Conformal predictors in identifying protein sequences

Aims: To assess the viability of conformal prediction in understanding the accuracy of tools in identifying protein sequences with particular protein families.

Background:

An important task in Bioinformatics is identifying if a protein sequence is part of a particular protein family. This is typically carried out using tools such as Hidden Markox Models (HMM's).

The difficulty with such models is that while they produce a probabilistic likelihood of a hit, the probabilities are based on a number of modelled parameters. Typically, such probabilities generically generate incredibly small p-values and yet some suspician remains that these methods are not robust as the results suggest.

Conformal predictors suggets an alternative to estimating such probabilities by moving away from the paradigm on ROC curves etc. and attempt to create probabilities based on the models themselves.

Early Deliverables

A report on Hidden Markov Model (HMM) tools in Bioinformatics.
A report on conformal predictors.
A simple implementation of conformal prediction on a family of proteins from PFAM.

Final Deliverables

A framework to evaluate the credibility and confidence of individual HMM's from PFAM.

Suggested Extensions

Using conformal predictors on HMM models in protein sequence prediction.

Reading

The Pfam protein families database. Finn RD1, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K, Holm L, Sonnhammer EL, Eddy SR, Bateman A. doi: 10.1093/nar/gkp985.
http://jmlr.csail.mit.edu/papers/volume9/shafer08a/shafer08a.pdf

Prerequisites:

Attended CS5100 and be likely to get a distinction in the module.

The student must have an excellent grasp of conformal predictors, Python and Unix systems. In the latter they must be able to build simple workflows of Bioinformatics tools. The student must speak with the advisor in question before applying for this project.

CS: Cross-conformal Prediction with Neural Networks

Aims: Write a program (or programs) implementing a cross-conformal predictor based on Neural Networks.

Background: Neural Networks were among the topics of the CS5100, CS5920, and CS5950 courses. Cross-conformal prediction was briefly covered in CS5920. This project is in the intersection of these two areas, and will give the student a chance to delve deeper into these topics.

Early Deliverables

A proof-of-concept program implementing neural nets both in the case of regression and in the case of classification (where neural networks output probability predictions).
Building a cross-conformal predictor on top of neural nets and exploring its empirical validity.
A preliminary empirical study of the efficiency of the student's implementation and standard implementations (such as the one in scikit-learn).

Final Deliverables

Careful cross-validation of various parameters, including the stopping rule, learning rate, and the numbers of hidden neurons at various levels.
Implementing regularization.
Experimenting with different architectures, possibly including deep architectures.
Ideally, the final product will have a graphical user interface.
The final report will describe the theory behind the algorithms, the implementation issues necessary to apply the theory, the software engineering process involved in generating your software, computational experiments with different data sets, methods, and parameters.

Suggested Extensions

Careful study of the optimal number of folds.
Replacing back-propagation with other optimization algorithms.
Implementing jackknife+, a popular recent version of cross-conformal prediction.

Reading

Hastie, Tibshirani, and Friedman, The Elements of Statistical Learning, 2009, Chapter 11.
Mitchell, Machine Learning, 1997, Chapter 4.
Research papers.

Prerequisites: Python, R, MATLAB, Java, or C++ programming skills. Good knowledge of calculus, especially differentiation.

CS: Deep Learning

Aims: To implement a deep learning algorithm on chosen data, and to improve its performance with architecture search and optimisation of hyperparameters

Background: Within the last 10 years, a great variety of "deep learning" architectures have been proposed, and some have achieved remarkable improvements in performance as compared to previous methods.

This project is to implement and assess a deep learning algorithm - typically for a feed-forward network - and to perform visualisations and experiments to analyse and improve its performance

Early Deliverables

Proof of concept program: Implementation of a deep learning algorithm for a relatively small network on a relatively simple dataset, together with experiments determining the effect of hyperparameters on convergence
A report giving an overview of a range of related deep learning methods.
A report describing and assessing the performance of the initial implementation and any experiments performed

Final Deliverables

The program(s) will work with a real dataset(s).
You will implement and compare multiple architectures and optimisation methods
You will visualize and assess the performance of the algorithms produced.
The report will describe and derive the formulae for the methods that you use, and the report will also discuss the implementation.

Suggested Extensions

This would be an opportunity to implement several deep learning methods and get to understand them in practice by applying them to a significant dataset.

Reading

"Deep Learning", by Goodfellow, Bengio, and Courville, MIT Press 2016
"Deep Learning with Python" by Francois Chollet, Manning Publications 2018

Prerequisites:

CS5920 (Computer Learning) would be advantageous (but not required).

CS5950 (Deep Learning) is required.

This project will require some mathematical understanding; it is not simply programming.

CS: Ensemble methods

Aims: Write a program (or programs) implementing various ensemble methods (such as bagging and boosting) together with some underlying algorithms.

Background: Ensemble methods were one of the topics of the CS5100 course. Assignment 3 only scratched the surface of this important area, and this project will give the student a chance to delve deeper into this topic.

Early Deliverables

A proof-of-concept implementation of bagging, random forests, or boosting both in the case of regression and in the case of classification.
Implementing underlying algorithms, such as decision stumps or decision trees; simple proof-of-concept implementations are sufficient.

Final Deliverables

A final program implementing bagging, random forests, or boosting (or all three) with a careful choice of the parameters.
Implementations of the underlying algorithms should be complemented by a careful empirical study of the effect of various parameters; producing optimized programs.
Comparing the performance of different ensemble techniques in the problems of classification and regression.
Using bagged and boosted decision trees as probability predictors; evaluating their perfomance using proper loss functions.
Ideally, the final product will have a graphical user interface.
The final report will describe the theory behind the algorithms, the implementation issues necessary to apply the theory, the software engineering process involved in generating your software, computational experiments with different data sets, methods, and parameters.

Suggested Extensions

Implementing the out-of-bag estimate of error probability.
Investigating variable importance.
Proximity plots.
Theoretical analysis of the algorithms.

Reading

Hastie, Tibshirani, and Friedman, The Elements of Statistical Learning, 2009, Section 8.7, Chapters 10 and 15-16
James, Witten, Hastie, and Tibshirani, An Introduction to Statistical Learning, 2013, Chapter 8.
Research papers.

Prerequisites: Taking CS5100 Data Analysis. Python, R, MATLAB, Java, or C++ programming skills.

CS: Exploring the graph properties of the DataCite PID graph

Aims: To build a tool that will make GraphQL queries to the datacite PID graph and to explore its graph properties.

Background:

DataCite is a leading global non-profit organisation that provides persistent identifiers (DOIs) for research data and other research outputs. Persistent identifiers (PIDs) are not only important to uniquely identify a publication, dataset, or person, but the metadata for these persistent identifiers can provide unambiguous linking between persistent identifiers of the same type, e.g. journal articles citing other journal articles, or of different types, e.g. linking a researcher and the datasets they produced. DataCite has constructed a PID graph that relates the different types of PID into a graph that can be queried using a language called GraphQL.

A variety of graph properties can be tested with this data. In particular there has been interest in the behaviour of the fraction of nodes with degree k following a pattern. Scale-free networks where this fraction is a power law in k generated significant interest though it is apparent these are less common than throught previously.

This project will require the student to examine the behaviour of the degree of the PID graph.

Early Deliverables

A report on graphs and scale-free networks.
A report on the PID graph.
A report on GraphQL.

Final Deliverables

The development of a tool to query the PID graph and visualise graphs.
Identifying the behaviour of the degree of the PID graph.

Suggested Extensions

Exploring other graph obseravbles e.g. centrality.
Exploring using a recommender system to identify papers of interest.

Reading

Introducing the PID Graph - https://blog.datacite.org/introducing-the-pid-graph/
The Fullstack Introduction for GraphQL - https://www.howtographql.com/
Learning GraphQL, E.Porcello and A. Banks O'Reilly Media (available online through Safari Books)
Handbook of Graph Theory, J.L. Gross, J. Yellen and P. Zhang, Chapman and Hall/C (available online through Safari Books)
Scale-free networks are rare, A.D. Broido and A. Clauset, Nature Comms. (2019) DOI - 10.1038/s41467-019-08746-5

Prerequisites:

Understanding of or willingness to study graph theory.

The student must have a good understanding of Python or a similar tool.

CS: Handwritten Greek characters recognition

Aims: Implementing several learning algorithms to recognise the ancient Greek characters. A recent Greek polytonic database with ground-truth information will be provided for training and testing.

Background:

Machine recognition of Greek characters in images is a challenging process, because of the large number of more than 270 character classes, and the similarity of the diacritic marks in each character.

Until today, Greek text search facility is far from perfect. Hence, it is interesting to apply machine learning techniques and compare their accuracies for such task.

Early Deliverables

Proof of concept program: identify one machine learning task suitable for characters recognition and implement one learning algorithm for the chosen task.
Report: An overview of the datasets. Statistics and numbers involved.
Report: A report on machine learning tasks for character recognition.

Final Deliverables

Survey of machine learning techniques used for characters recognition.
The developed machine learning software for Greek characters recognition.
The software engineering process involved in generating your software.

Suggested Extensions

Provide predictions with confidence measure.

Reading

Gatos, Basilis, et al. GRPOLY-DB: An old Greek polytonic document image database. Document Analysis and Recognition (ICDAR), 2015 13th International Conference on. IEEE, 2015.
http://www.iit.demokritos.gr/~nstam/GRPOLY-DB

Prerequisites: Taking CS5920 Machine Learning would be useful.

CS: Housing Prices

Aims: The aim of the project is to apply multiple machine learning techniques for predicting housing prices and compare their performance.

Background:

The problem of working out the price of a house using its characteristics has been studied intensively over past decades. Many hosing price datasets are available on the Internet and many machine learning techniques have been applied to them.

Boston housing dataset has been a playground for statistics and machine learning methods since 1970s. A more recent California Housing Prices dataset dates to 1990s. In 2010s Ames dataset was put forward: it is based on transactions and contains dates as part of individual records.

The first task is to work out house prices in a batch setup when the time is not available (or ignored). We can compare learning methods and study the effect of parameters on the test error.

Secondly, a study of house prices in the online mode can be undertaken. Here datapoints become available one by one: knowing that a house is about to be sold, we predict the sales price, then we see the actual price in the transaction, then we make a prediction for the price in the next sale etc. We can use the data with timestamps to study inflation. Do prices grow with time? How fast?

The project is recommended for Computational Finance students.

Early Deliverables

Proof of concept program pre-processing the data and applying two machine learning methods to Boston Housing dataset
Report: A report describing supervised learning, the concepts 'training set' and 'test set', and the theory of the methods.

Final Deliverables

A machine learning method (such as Ridge Regression; the choice is by discussion with the supervisor) should be implemented from scratch.
Several different datasets should be used
Tests comparing different kernels, parameters, etc should be performed. The results should be visualised.
Prediction in the online mode should be implemented and results visuaised.
The report will describe different algorithms and implementation issues (such as the choice of data structures, numerical methods etc) necessary to apply the theory.
The report will describe computational experiments, analyse their results and draw conclusions

Suggested Extensions

Methods of prediction with expert advice can be applied in the online mode.
Is there inflation? A study can be undertaken.

Reading

Boston housing dataset on Kaggle: https://www.kaggle.com/prasadperera/the-boston-housing-dataset
California housing dataset on Kaggle: https://www.kaggle.com/camnugent/california-housing-prices
De Cock, Dean. Ames, Iowa: Alternative to the Boston housing data as an end of semester regression project. Journal of Statistics Education 19.3 (2011).
Y.Kalnishkan, Kernel Methods, http://onlineprediction.net/?n=Main.KernelMethods
J.Shawe-Taylor and N.Cristianini, Kernel Methods for Pattern Analysis Cambridge University Press, 2004
B.Schlkopf and A.J.Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, The MIT Press, 2001.

Prerequisites: Taking CS5100 (Data Analysis) is required. Taking CS5920 (Machine Learning) would be an advantage (but is not required). Taking CS5200 (Online Machine learning) will add an interesting perspective to the project.

CS: Identifying similarities between text and the Gene Ontology

Aims: To build a tool that will identify similarities between text and the descriptions in the Gene Ontology.

Background:

Molecular Biology is now generating colossal amounts of data. In particular, there are a variety of technologies that can scan entire genomes or transcriptomes and determine the likelihood of their association with a particular Biological process. However such results are very noisy and suffer from a "large p, small n" effect, namely that the ratio of possible degrees of freedom, p to the number of independent observations is much greater than 1 and hence suffer from very large false positive rates.

On the other hand, the very extensive body of biomedical literature could be used to filter out spurious genes from the data before undergoing analysis. Tools such as GOCat mine the literature to relate Gene Ontology terms to specific search terms representing Biological processes. This specific tool is no longer functional and an alternative needs to be provided.

This project will require the student to develop an equivalent of GOCat.

Early Deliverables

A report on Gene Ontologies.
A report on the tool GOCat explaining the methods.
A report on the data set required to build the tool.

Final Deliverables

The development of a stand alone tool to implement the equivalent of GOCat.
An API to query the tool.
Rigourous testing of the tool.

Suggested Extensions

Using more advanced methods beyond kNN for the implementation.
An API to query the tool.

Reading

Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases. Gobeill J Pasche E Vishnyakova D Ruch P DOI 10.1093/database/bat041
A survey on annotation tools for the biomedical literature. Neves M Leser U DOI 10.1093/bib/bbs084
The GOA database in 2009--an integrated Gene Ontology Annotation resource Barrell D Dimmer E Huntley R Binns D O'Donovan C Apweiler R dx.doi.org/10.1093/nar/gkn803

Prerequisites:

Understanding of machine learning methods such as kNN.

The student must have a good understanding of R and/or Python.

CS: Image and video classification

Aims: To implement object detection, image and video classification using deep learning and machine learning techniques.

Background: Deep learning and transfer learning show superior capabilities of tackling computer vision and image processing problems. Examples include the proposal and adoption of a variety of deep architectures for image generation, segmentation, detection, and classification. This project enables the exploration of any application of your choice to tackle a computer vision problem.

Early Deliverables

Implementation of transfer learning algorithms based on pre-trained networks for any of the following tasks, (1) object detection/classification, (2) image segmentation, and (3) video classification.
A report describing transfer learning algorithms implemented with some evaluation results.

Final Deliverables

Implementing transfer learning, deep learning (e.g. Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs)), or hybrid deep networks for any of the aforementioned tasks.
Implementing several deep learning algorithms with different architectures, where the employed architectures are either existing, or newly proposed.
Implementing manual hyper-parameter fine-tuning of the implemented deep networks.
A comprehensive evaluation and comparison against existing state-of-the-art studies.
The final report will describe methodology, implementation and testing with theoretical analysis and detailed evaluation results.

Suggested Extensions

Implementing deep hybrid networks with new architectures.
Implementing optimal hyper-parameter selection using an automatic process.
Theoretical analysis of the algorithms.

Reading

Goodfellow, I., Bengio, Y. and Courville, A., 2016. Deep learning. MIT press.
Chollet, F., 2021. Deep learning with Python.
Journal/conference papers.

Prerequisites: Take CS5100 and CS5950 is compulsory. Taking CS5920 is highly recommended, but not compulsory. Programming skills on any of the following languages, i.e. Python, MATLAB, Java, C++, or R, are required.

CS: Joint prediction regions for time-series models

Aims: The goal of this project is to develop and evaluate methods to construct joint prediction regions for time-series models.

Background:

A prediction region is a region that is guaranteed to include a prediction with given probability. A joint prediction region (JPR) is an "envelope" around the predicted future path (a sequence of predictions) guaranteeing that the true future path is included in the JPR with some probability. A similar criterion is the k-family-wise error rate (k-FWE), which bounds the probability that at least k points in path are not included in the JPR.

In a recent paper, Wolf and Wunderli proposed a bootstrap-based method to derive JPRs for time-series prediction (see reference below) and evaluated the method on autoregressive models and compared it with different techniques to produce JPR. In this project, you will implement the bootstrap JPR method, and evaluate it on an application of your choice or one suggested by the supervisor.

From a methodological viewpoint, the key steps of the project will be:

Learn a time-series predictor using a dataset of choice
Devise a method to compute prediction standard errors for your model
Implement the bootstrap algorithm of Wolf and Wunderli, using the above two models.

Early Deliverables

Report 1: bootstrap sampling.
Report 2: overview of chosen case study.
Report 3: prediction regions, JPRs and bootstrap JPRs.
Report 4: time-series models.
Report 5: preliminary plan, including an overview of the planned solution method and experimental evaluation plan.
Implement and evaluate predictor and prediction standard errors.

Final Deliverables

Implement code for bootstrap JPRs and evaluate empirical coverage and family-wise error rate for different hyper-parameters (e.g., k, horizon length, sample size, number of bootstrap samples).
Final dissertation describing problem, methods and algorithm, implementation details, computational experiments and their analysis, and conclusions.

Suggested Extensions

1. Compare different predictors.
2. Compare bootstrap JPRs with other methods mentioned in the paper (e.g., joint marginals and Scheffé)
3. Compare bootstrap JPRs with Monte Carlo sampling based on Bayesian neural networks and uncertainty recalibration

Reading

(bootstrap) Chapter 6, Lecture notes of 36-402, CMU Undergraduate Advanced Data Analysis, available at https://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf
(bootstrap JPRs) Wolf, Michael, and Dan Wunderli. "Bootstrap joint prediction regions." Journal of Time Series Analysis 36.3 (2015): 352-376.
(time-series models) Stock Market Predictions with LSTM in Python, available at https://www.datacamp.com/community/tutorials/lstm-python-stock-market
(time-series models) Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2019). Dive into deep learning, chapters 8 and 9, available at https://d2l.ai/index.html
(extension 3) Gal Y. & Ghahramani Z., 2016, Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning, ICML 2016, available at http://proceedings.mlr.press/v48/gal16.pdf
(extension 3) Kuleshov, Volodymyr, Nathan Fenner, and Stefano Ermon. "Accurate Uncertainties for Deep Learning Using Calibrated Regression." ICML 2018, available at https://arxiv.org/pdf/1807.00263.pdf

Prerequisites: Prerequisites: Recommended at least two of CS5200 (On-line Machine Learning), CS5970 (Experimental Design), CS5920 (Machine Learning), CS5100 (Data Analysis). Good programming skills in Python, MATLAB, or R.

CS: Large-Scale Graph Analytics

Aims: The aim of the project is to implement algorithms for a number of graph analytics problems of varying difficulty. The algorithms must be designed, implemented, tested, and evaluated on one of the large-scale distributed graph processing platforms.

You will start by studying background literature on various distributed graph processing platforms and then focus on a specific platform of your choice. You will then use this platform to implement 3 proof-of-concept programs solving a few simple graph analytics problems chosen from the list below. This will be followed by focusing on a single more advanced problem of your choice, and implementing and evaluating its prototype solution on a distributed platforms you have chosen.

Background: The list of suggested problems is as follows. You are welcome to propose and work on problems of your own subject to the supervisor's approval:

Proof-of-Concept:

Compute the number of vertices and edges in a graph
Compute maximum, minimum, and average vertex degrees
Compute and analyse the vertex degree distribution (probability mass function)
Compute the graph diameter (longest shortest path between any pair of nodes)
Compute and analyse the distribution (probability mass function) for the number of friends of a given degree k > 1

Advanced problems:

(A) Community Detection: Community in a social network refers to a collection of users whose ties to each other (in terms of e.g., the friend/follower relation) is closer than those to the other users. Community detection is a hot topic in the social network research and industry. The communities more often than not share a large number of common traits (such as e.g., interests in specific topics, geographical areas, social status, etc.), which makes them extremely useful for developing better recommendation systems, or advertisement campaigns. There is an extensive literature on the community detection problem, and a large number of algorithms have been proposed. Your task will be to identify and implement an algorithm suitable for implementing on a distributed computing platform, and evaluate its scalability and performance on various datasets.

(B) Triangle Enumeration: Similarly to community detection, finding and listing the triangles that occur in a graph is an important problem in network analysis and useful for determining network statistics such as the clustering coefficient. In the literature, there are several algorithms for triangle enumeration most of which are designed for a centralized model of computation. More recently, algorithms for distributed graph frameworks were proposed.

(C) PageRank: When considering real world graphs, we are often interested in determining the "important" vertices. In social networks, for example, important vertices correspond to well connected people. To decide whether a vertex is important, it is usually not sufficient to simply count its degree as this is not a very robust measure and can easily be skewed, for example, by connecting a vertex to many fake vertices that are solely created for this purpose and who do not have any other connections. Instead, the PageRank of a vertex depends on the number of links from vertices that are themselves important. PageRank is commonly used by search engines to determine the importance of web pages.

(D) Connected components: Connected components is the set of subgraphs of a given graph such that no two of these subgraphs have a vertex in common. For directed graphs, the connectivity between a pair of vertices a and b can be be either strong (i.e., b is reachable from a by following a path of directed arcs) or weak (i.e., a and b are connected in the underlying undirected graph). You will need to design and implement a distributed algorithm to identify both weakly and strongly connected components in a given graph, and analyse their respective size distributions.

(E) SimRank: is an important metric measuring similarity between two vertices in a graph through their structural context: i.e., two vertices are similar if they are referenced by two vertices which are also similar. SimRank found applications in a variety of diverse contexts including recommender systems, spam filtering, sponsored search, malware detection, etc. Your task is to implement a distributed algorithm to compute the SimRank score for any two pair of vertices in a given graph and experimentally analyse its performance.

(F) Implement a prototype of an interactive browser-based graph-analytics tool that will provide interface for both specifying a graph analytics problems and their inputs and visualising their execution on a remote distributed platform (Giraph or GraphX) in a creative way. Your tool should support at least 4 proof-of-concept graph analytics algorithms above, and be extensible to allow integrating other algorithms in the future.

Data: There is a large number of graph datasets of different nature available in various online repositories. The links to the commonly used ones can be found at https://moodle.royalholloway.ac.uk/mod/page/view.php?id=168248.

Tools: Both Giraph and GraphX are available on the department's big data cluster. Other platforms (such as GraphLab and PowerGraph) may also be available.

Early Deliverables

Report on the comparative study of the distributed platforms for processing large-scale graphs.
Implementation and evaluation of the proof-of-concept programs.
Detailed plan for the final project outcomes as agreed with the advisor.

Final Deliverables

The software you built to achieve the agreed project goals. The dissertation will report on different algorithms for the advanced problem you chose comparing their strengths and weaknesses. It will discuss techniques and findings, and provide meaningful visualisation of the results.

Reading

Prerequisites: CS5800 or prior programming experience

CS: Learning models for glucose prediction in diabetes therapy

Aims: The goal of this project is to apply machine learning methods to forecast glucose levels in the context of automated insulin therapy for type-1 diabetes.

Background:

Type 1 diabetes (T1D) is a disease in which the human pancreas is not able to produce a sufficient amount of insulin to regulate blood glucose levels, thereby inhibiting glucose uptake in muscle and fatty tissue. Many automated control techniques for T1D require some model for predicting future glucose levels to improve therapy effectiveness. Traditional models include mechanistic (aka white-box) differential equations models, which provide good accuracy and interpretability but require a potentially error-prone state estimation procedure, i.e., figuring out the state of the differential equation model out of past observations. In contrast, data-driven (aka black-box) machine learning models don't require state estimation as they can be trained to directly map past observations into predicted glucose values.

In this project, you will derive models for short-term glucose prediction (up to 1 hour prediction horizon) using deep learning techniques. Here by glucose we mean the subcutaneous glucose levels as measured by a CGM sensor. First, you will review the problem of automated insulin control and existing glucose prediction models. Then, you will implement data-driven machine learning models for glucose prediction, and train them with synthetic data provided by the project supervisor. These data consist of sequences of glucose measurements, insulin dosages, and ingested meals. You will experiment with and compare the accuracy of different deep neural network architectures and different lengths for the prediction horizon. If time permits, possible extensions include: analysis of prediction reliability using techniques like inductive conformal prediction or approximate Bayesian inference via Monte Carlo dropout; deriving optimal insulin control policies using the learned glucose prediction models; or comparison of the learned deep models with more classical autoregressive models.

Early Deliverables

Report 1: overview of the artificial pancreas and underlying control algorithms.
Report: review of main models for glucose prediction.
Report: preliminary plan, including an overview of the planned solution method and experimental evaluation plan.
Implement code for data processing and for learning neural network-based glucose predictors. Perform training for at least one NN architecture and at least one prediction horizon.

Final Deliverables

Final dissertation describing problem, data, solution method, machine learning methods used, implementation details, computational experiments and their analysis, and conclusions.

Suggested Extensions

Analysis of prediction reliability using techniques like inductive conformal prediction or approximate Bayesian inference via Monte Carlo dropout.
Derive optimal insulin control policies using the learned glucose prediction models
Compare the learned deep models with more classical autoregressive models.

Reading

(for report 1) Boughton, Charlotte K., and Roman Hovorka. "Advances in artificial pancreas systems." Science translational medicine 11.484 (2019). PDF available from supervisor
(for reports 1 and 2) Lunze, Katrin, et al. "Blood glucose control algorithms for type 1 diabetic patients: A methodological review." Biomedical signal processing and control 8.2 (2013): 107-119.
(for report 2 and for ideas on neural net architecture) Li, Kezhi, et al. "Convolutional recurrent neural networks for glucose prediction." IEEE Journal of Biomedical and Health Informatics (2019).
(for report 2 and for example of autoregressive model) Sparacino, Giovanni, et al. "Glucose concentration can be predicted ahead in time from continuous glucose monitoring sensor time-series." IEEE Transactions on biomedical engineering 54.5 (2007): 931-937.
(for monte carlo dropout) Gal, Yarin, and Zoubin Ghahramani. "Dropout as a bayesian approximation: Representing model uncertainty in deep learning." international conference on machine learning. 2016.
(for inductive conformal prediction) Papadopoulos, Harris. "Inductive conformal prediction: Theory and application to neural networks." Tools in artificial intelligence. IntechOpen, 2008.

Prerequisites: Prerequisites: CS5950 (Deep Learning) is recommended. Good programming skills in Python, MATLAB, or R.

CS: Machine Learning for Smart Meter Data

Aims: This project needs to study the smart meter dataset and develop machine learning algorithms for analyzing and predicting energy consumption and related tasks such as energy disaggregation and customer profiling. The goal is to optimize household energy consumption by understanding household characteristics and combining various source of information such as weather conditions and the time of the day.

Background: Utility companies are currently deploying smart meters in millions of households worldwide to collect fine-grained energy consumption data. For example, by the end of 2020, around 50 million smart meters will be fitted in over 26 million households across Wales, Scotland and England. UK government believes that the rollout will bring immediate benefits to consumers, but also moves to a lower carbon economy and a secure energy supply. One challenge is how to analyse the smart meter data so that utility companies can gain insight about their customers and enable personalized and scalable energy efficiency programs for individual households. Interest in this area has grown steadily in the last years.

Early Deliverables

Proof of concept program: identify one machine learning task suitable for the smart meter data and implement one learning algorithm for the chosen task.
Report: An overview of the datasets used. Statistics and numbers involved.
Report: A report on machine learning tasks for smart meter data.

Final Deliverables

Survey of machine learning techniques used for smart meter data analysis.
The developed machine learning software for smart meter data.
The software engineering process involved in generating your software.
Experiments with different smart meter data.

Suggested Extensions

Extend the learning algorithms for online setting.
Provide predictions with confidence measure.

Reading

Smart*: An Open Data Set and Tools for Enabling Research in Sustainable Homes. Sean Barker, Aditya Mishra, David Irwin, Emmanuel Cecchet, Prashant Shenoy, and Jeannie Albrecht. Proceedings of the 2012 Workshop on Data Mining Applications in Sustainability (SustKDD 2012), Beijing, China, August 2012.
Smart* Data Set for Sustainability http://traces.cs.umass.edu/index.php/Smart/Smart
SmartMeter Energy Consumption Data in London Households https://data.london.gov.uk/dataset/smartmeter-energy-use-data-in-london-households
UK Domestic Appliance-Level Electricity (UK-DALE) dataset http://jack-kelly.com/data/
Revealing Household Characteristics from Smart Meter Data. Christian Beckel, Leyna Sadamori, Thorsten Staake, Silvia Santini, Energy, Vol. 78, pp. 397-410, December 2014.
Piotr Mirowski, Sining Chen, Tin Kam Ho, Chun-Nam Yu, Demand Forecasting in Smart Grids, Bell Labs Technical Journal, 18(4), pp.135-158, 2014.

Prerequisites: Java, Machine Learning, CS5100 and CS5200.

CS: Machine learning based indoor positioning

Aims: Implementing several supervised learning algorithms to estimate the user location in a building, based on her current radio signal. A competition-level dataset mapping the signal-to-location of the building will be provided.

Background:

We spend most of our time indoors, where limited or no GPS service is available. The most popular solution leverages the ubiquitous WiFi signal to create a radio map of the tracking zone (i.e. WiFi fingerprinting).

WiFi fingerprinting is interesting, because it may be considered as both classification and regression problem. Firstly, as the training labels are discrete finite indoor locations, it may be addressed as a classification problem. Secondly, as these training labels are real numbers (i.e. the Cartesian coordinate), a regression model will be more suitable when there are too many training positions.

The challenge for both approaches is the high degree of variation of the radio signal, and the high dimensional training examples. Hence, it is useful to compare their performances on a real world dataset.

Early Deliverables

Producing a simple proof-of-concept implementation of some classification and regression algorithms from scratch.
A report of the empirical results of such learning algorithms on a small synthetic dataset.

Final Deliverables

A detailed literature review of the machine learning application in WiFi fingerprinting.
Producing a detailed report of the dataset structure, including visual analysis (e.g. t-SNE, UMAP, etc.).
Implementing the algorithms on the entire competition dataset, and optimising their parameters.
Carefully analysing and comparing the performance accuracy of such learning algorithms.

Suggested Extensions

Providing location estimations with some confidence measure.
Extending the learning algorithms to an online setting, where the user provides continuous samples as she traverses the building.

Reading

Bahl, P. and Padmanabhan, V. N. (2000). RADAR: An in-building RF-based user location and tracking system. In INFOCOM 2000. 19th Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE (Vol. 2, pp. 775-784).
Honkavirta, V., Perala, T., Ali-Loytty, S., and Piché, R. (2009). A comparative survey of WLAN location fingerprinting methods. In Positioning, Navigation and Communication, 2009. WPNC 2009. 6th Workshop on (pp. 243-251). IEEE.
https://archive.ics.uci.edu/ml/datasets/UJIIndoorLoc

Prerequisites: Taking CS5920 Machine Learning would be beneficial, but is not compulsory. Having good knowledge of Matlab or Python is essential.

CS: Natural Language Understanding: Measuring the Semantic Similarity between Sentences

Aims: To implement and design various deep neural networks for measuring the semantic similarities between sentence pairs.

Background: Natural language understanding (NLU) is widely viewed as a grand challenge in Artificial Intelligence (AI). An important sub-task in NLU is to measure the semantic similarity between sentence pairs, also known as the Semantic Textual Similarity (STS) task. A good STS model has many applications, for example in the search engine (to find the most relevant page for a given query), question answering (to cluster similar answers), document summarization (to measure how much information in the source documents is included in the summary), and automatic sentence paraphrasing (to check whether the revised sentence delivers the same meaning as the original one). This project focuses on developing STS models using the latest machine learning and artificial intelligence techniques. The arguably simplest, yet surprisingly strong method to measure the similarity between a pair of sentences is to count how many words or phrases the pair of sentences share. However, this method can easily fail when synonyms, homonyms and heteronyms are used in the sentences. To tackle this problem, neural-based text embedding methods have been proposed: they learn to project sentences to a high-dimensional vector space, such that semantically similar words (e.g. eggplant and aubergine) are neighbours in the vector space. The neural-based methods yield better performance than counting overlapping words. However, some recent study shows that the neural-based methods fail when comparing (i) sentences that use very different words but deliver very similar meanings (e.g. 'President Obama visited Beijing last week' and 'The first African-American US President arrived at Peking a few days ago'), and (ii) sentences that use very similar words but deliver completely different meanings (e.g. 'Hamilton beats Button and wins the game' and ' Button beats Hamilton and wins the game'). We aim at developing models that can alleviate these problems in this project.

Early Deliverables

Proof of concept program: tf-idf based similarity measurement
Proof of concept program: word-embedding based similarity measurement
(Optional) Proof of concept program: sentence-embedding based similarity measurement.
A report describing and assessing the results of applying the proof of concept programmes on different sentence pairs.

Final Deliverables

Fine-tune existing neural models (e.g. BERT), and evaluate whether they are better at dealing problems (i) and (ii) than other models.
A report summarising all the experiments and analysing strengths and weaknesses of each model

Suggested Extensions

Constructing more training sentence pairs, for example by using Google Translate: given a sentence in English (denoted S), first translate it into another language (denote the translated sentence by S'), and then translate S' back to English (denote the result as S''); we should assume that S' and S'' should be similar sentences.

Reading

Neural Network Methods for Natural Language Processing, by Y. Goldberg, Morgan and Claypool 2017
Deep Learning, by I. Goodfellow, Y. Bengio and A. Courville, MIT Press 2016
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", arXiv pre-print 1810.04805

Prerequisites:

CS5920 (Computer Learning) would be advantageous (but not required).

CS5950 (Deep Learning) is required.

CS4990/CS5990 (Natural Language Processing) is required.

This project will require some mathematical understanding; it is not simply programming.

CS: Neural based Document Summarisation

Aims: To implement and design various deep neural networks for generating short summaries for long documents.

Background: Document summarisation aims at compressing long textual documents into a short, human readable form that contains the most important information from the source. This is a particularly useful application in the age of the Internet, when a bewildering amount of information is online but people only have very limited time to find the most salient information. Document summarisation has attracted considerable research interests from the AI and Natural Language Processing (NLP) community. In this project, we will focus on news article summarisation. Existing document summarisation techniques fall into two categories: extractive summarisation, which extracts sentences from the input document to build the summary, and abstractive summarisation, which writes sentences from scratch to summarise the input document. Abstractive summarisation allows for higher flexibility but is much more challenging than the extractive approaches. However, recent advances in deep neural networks make abstractive summarisation possible. In this project, we will use real-world data to train an abstractive summariser, using the latest neural models, e.g. encoder-decoder and Transformer, compare their performance, and even design new models to further boost the performance.

Early Deliverables

Proof of concept program: Long Short Term Memory (LSTM) Neural Network for text encoding.
(Optional) Proof of concept program: Attention Mechanism on top of (LSTM).
A report describing and assessing the results of applying the proof of concept programmes on data.

Final Deliverables

Implement an LSTM-based Encoder-Decoder network and train it with a few hundred thousand of real data
Augment the Encoder-Decoder network with Pointer Mechanism and Attention Mechanism, compare their performance against vanilla Encoder-Decoder
(Optional) Implement Transformer-based Encoder-Decoder, compare its performance against the other models
A report summarising all the experiments and analysing strengths and weaknesses of each model

Suggested Extensions

Applying Reinforcement Learning (RL) on top of the Encoder-Decoder models. Recent research suggests that applying RL further boost the performance.
Using Transfer Learning techniques to adjust existing language models, e.g. GPT-2, to generate summaries.

Reading

Neural Network Methods for Natural Language Processing, by Y. Goldberg, Morgan and Claypool 2017
Deep Learning, by I. Goodfellow, Y. Bengio and A. Courville, MIT Press 2016
A. See, P. J. Liu and C. Manning, "Get To The Point: Summarization with Pointer-Generator Networks", Proceedings of the 55th ACL Annual Conference

Prerequisites:

CS5920 (Computer Learning) would be advantageous (but not required).

CS5950 (Deep Learning) is required.

CS4990/CS5990 (Natural Language Processing) is required.

This project will require some mathematical understanding; it is not simply programming.

CS: Predictive Power of Social Media Data

Aims: Gain insight into the predictive power of social media by establishing correlations between the features of the data posted on online social media platforms and salient properties of a significant real-life event of your choice.

Background: You will develop tools to gain insight into salient features of a significant past or future real-life event (such as e.g., presidential or parliamentary elections, referenda, etc.) of your choice using a large dataset collected off an online social media site, such as Twitter. For the purposes of this project, you will have access to the 300G Twitter dataset comprising one month worth of Twitter data posted in 2012, but you are also encouraged to propose and use an alternative dataset of your own provided it is sufficiently large, suitable for the studied problem, and approved by the supervisor). You may also consider collecting real-time data using the public APIs offered by some of the existing social media platforms, such as Twitter and Guardian.

You will use the tools you built to analyse various meta-data attributes associated with the data (such as time, location, hashtags, etc.) as well as properties of the content (e.g., sentiment score) to establish temporal, spatial, content-related, and other correlations between the real-life event you selected and the relevant social media properties. The report should reflect upon and draw general conclusions in regards to the predictive power of the social media platforms in relation to the dynamics and outcomes of real-life events.

Your project must use one or a combination of the large-scale data engineering tools you have studied in CS5234. For example, past projects have used Hadoop/MapReduce to identify tweets relevant to the studied event, extract the attributes of relevance, and assign sentiment score to the posted statuses using various Natural-Language Processing (NLP) techniques, such as [O'Connor, 2010] and/or Stanford NLP libraries (http://nlp.stanford.edu/)

Data: You will have access to the 300G Twitter dataset on the department's cloud, or could use an alternative dataset of your own subject to the supervisor's approval.

Tools: The project must use one or a combination of the large-scale data engineering tools studied in CS5234.

Early Deliverables

Precise specification of the project goals as agreed with the advisor along with data collection and analysis tools, and methodologies you plan to use to achieve the specified goals.
Intermediate report and programs showing and reflecting upon initial findings.

Final Deliverables

The software you built to achieve the project goals.
Final report will discuss techniques and findings, and provide meaningful visualisation of the results.

Reading

Brendan O’Connor, Ramnath Balasubramanyan, Bryan R. Routledge, Noah A. Smith, "From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series" In Proceedings of the International AAAI Conference on Weblogs and Social Media (2010)

Prerequisites: CS5234

CS: Ranking

Aims: The aim of this project is to implemet techniques used by search engines to rank the output, test them and analyse their performance.

Background: Given a query, a search engine needs to rank documents by relevance so that links to most relevant and authoritative documents could be shown on top of the list.

There are many approaches to this problem and many different algorithms based on various principles. The classical PageRank algorithm assigns ranks on the basis of links so that pages linked by many others get high ranks. The rank turns out to be the stationary distribution of a Markov chain.

Recently ranking has been increasingly approached in the context of machine learning. The problem is called "learning to rank" and supervised and semi-supervised methods are applied to it. One possibility is to train regression or classification algorithms on relevance scores provided by human evaluators. A number of parameters called features are calculated using the content of the page and the query; then a learning algorithm is applied to the features. Different measures have been used to analyse the effectiveness of ranking, including Mean Average Precision, Discounted Cumulative Gain, Kendall's rank, Yandex pfound etc.

The project consists of implementing ranking techniques and analysing their performance on synthetic and real-world data. The efficiency issue is particularly important as the programs should work on large datasets.

Benchmark datasets:

LETOR: Learning to Rank for Information Retrieval http://research.microsoft.com/en-us/um/beijing/projects/letor/

Kaggle Learning to Rank 2019 challenge https://www.kaggle.com/c/learning-to-rank-2019/

Yandex Internet Mathematics 2009 competition dataset (archived) https://web.archive.org/web/20150317144535/http://imat2009.yandex.ru/academic/mathematic/2009/en/

Early Deliverables

Proof of concept program: Page rank applied to a small (possibly synthetic) dataset. Its performance and efficiency should be analysed.
Report: An overview of the problem of document ranking and main approaches to it. An overview of PageRank.

Final Deliverables

Two or more different machine learning technique should be applied to a standard benchmark dataset and their performance compared.
The performance of the program should be optimised to enable them to process large amounts of data.
The report will describe the theory of underlying learning algorithms.
The report will describe implementation issues (such as the choice of data structures, numerical methods etc) necessary to apply the theory.
The report will describe computational experiments, analyse their esults and draw conclusions.

Suggested Extensions

Comparison of pointwise, pairwise and listwise approaches.
Use of real Internet dumps such as http://commoncrawl.org/

Reading

Christiane Rousseau, How Google works: Markov chains and eigenvalues, URL http://blog.kleinproject.org/?p=280
Chuan He, Cong Wang, Yi-xin Zhong, and Rui-Fan Li. A survey on learning to rank. Machine Learning and Cybernetics, 2008 International Conference on. DOI 10.1109/ICMLC.2008.4620685.
Tie-Yan Liu, Learning to Rank for Information Retrieval, Berlin, Heidelberg: Springer. 2011

Prerequisites: Taking CS5100 (Data Analysis) is required. Taking CS5920 Computer Learning would be an advantage (but is not required).

CS: Ridge Regression

Aims: The aim of the project is to implement a popular machine learning technique called ridge regression, apply it to real-world data, analyse its performance, and compare it against different algorithms under different settings.

Background: Ridge Regression (RR) is a popular machine learning algorithm. It is applied to problems similar to the following. The Boston housing database consists of records describing houses in Boston (to be precise, average houses in certain neighbourhoods). For each house the values of attributes such as the number of rooms, distance from the city centre, quality of schools etc are known and the price, as evaluated by an expert estate agent, is given. The goal of the learner is to determine the dependency between attributes and prices and to predict the price of a house using the values of the attributes. The program is first shown a number of training examples so that it could learn the dependency and then is tested on test examples.

This project aims at studying different problems related to regression. One is the problem of parameter selection. How do we ensure that Ridge Regression performs best? How do we select the right parameters? Or is it best to tune parameters in the on-line mode? The problem of overfitting is closely related. It is known that in many situations more sophisticated techniques achieving better results in training later behave worse than coarse methods; how do we handle this? The project has the potential to lead to new research results of independent interest.

The project is recommended for Computational Finance students.

Early Deliverables

Proof of concept program: Kernel Ridge Regression applied to a small artificial dataset.
Report: An overview of ridge regression describing the concepts 'training set' and 'test set', giving the formulas for Ridge Regression and defining all terms in them.
Report: Examples of applications of Ridge Regression worked out on paper and checked using the prototype program.

Final Deliverables

The program will work with a real dataset such as Boston housing, read the data from a file, preprocess the data (normalise, split the dataset into the test and training sets) and apply Ridge Regression using a kernel with parameters.
The program will automatically perform tests such as comparison of different kernels, parameters, etc. The results will be visualised.
The program will work in batch and on-line modes.
Tests will be performed to compare plain Ridge Regression against other regression-based algorithms such as Kernel Aggregating Algorithm Regression, Kernel Aggregating Algorithm Regression with Changing Dependencies, or gradient descent.
The report will describe the theory of Ridge Regression and derive the formulas.
The report will describe implementation issues (such as the choice of data structures, numerical methods etc) necessary to apply the theory.
The report will describe computational experiments, analyse their results and draw conclusions

Suggested Extensions

Application of Ridge Regression to different datasets (including those suggested by the student).
Comparison with other learning algorithms (e.g., aggregating algorithm regression, neural networks, gradient descent-based etc).
Combining regression with methods of prediction with expert advice.

Reading

Y.Kalnishkan, Kernel Methods, http://onlineprediction.net/?n=Main.KernelMethods
J.Shawe-Taylor and N.Cristianini, Kernel Methods for Pattern Analysis Cambridge University Press, 2004
C.E.Rasmussen and C.Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006 http://www.gaussianprocess.org/
B.Schlkopf and A.J.Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, The MIT Press, 2001.

Prerequisites: Taking CS5100 (Data Analysis) is required. Taking CS5920 (Computer Learning) would be an advantage (but is not required).

CS: Satisfiability Modulo Theories and Bounded Model Checking for AI Planning

Aims: The goal of this project is to learn about Satisfiability Modulo Theories and apply this method to perform bounded model checking, i.e., the problem of verifying whether a given transition system can reach some state within a finite number of steps. You will apply this method to solve AI planning problems.

Background:

Satisfiability Modulo Theories (SMT) is a kind of constraint satisfaction problem that allows solving constraints specified in some fragment of first order logic. In other words, SMT extends propositional satisfiability (SAT) to support (besides propositional ones as in SAT) numerical variables over a fixed arithmetic theory as well as quantifiers over such variables. State-of-the-art SMT solvers like Z3 by Microsoft provide very efficient decision procedures for such problems.

Among the numerous applications of SMT, you will focus on bounded model checking (BMC), i.e., the problem of verifying whether a given transition system can reach some state within a finite number of steps. In particular, you will use BMC to solve AI planning problems. A planning problem can be expressed as one of BMC, where the objective is to find a path of the AI agent leading to some goal state.

In the first part of the project, you will study the SMT problem and SMT solving techniques, and you will use the Z3 tool and its Python APIs to solve simple example instances. In the second part, you will study the BMC problem and implement code to solve it for generic transition systems using Z3+Python. In the final part, you will extend your program to encode and solve planning problems on 2-dimensional grids with obstacles. Extensions include visualization of path solutions, and planning under cost constraints.

Early Deliverables

Report: overview of SAT, SMT and SMT solving techniques.
Report: bounded model checking and encoding into SMT.
Install z3 and Python APIs. Develop code for SMT solving of a simple example problem (e.g., Sudoku, or job shop scheduling).
Develop code for BMC of generic transition systems.

Final Deliverables

Develop code for encoding a planning problem on a 2-d grid with obstacle as a BMC problem.
Evaluate your solution with arbitrary and randomly generated planning instances.
Analyze SMT solver performance for different grid sizes and numbers of obstacles.
Final dissertation. It should include background on SAT, SMT, BMC, and AI planning, as well as the description of your solution for the encoding of AI planning into SMT, some example solutions obtained with the developed planning method, and performance analysis results.

Suggested Extensions

Extend the planning problem and your solution with transition costs and cost constraints.
Develop code to visualize path solutions on the 2d grid.

Reading

De Moura, Leonardo, and Nikolaj Bjørner. "Satisfiability modulo theories: introduction and applications." Communications of the ACM 54.9 (2011): 69-77. (available at https://dl.acm.org/doi/abs/10.1145/1995376.1995394)
De Moura, Leonardo, and Nikolaj Bjørner. "Satisfiability modulo theories: An appetizer." Brazilian Symposium on Formal Methods (2009): 23-36. (available at https://link.springer.com/chapter/10.1007/978-3-642-10452-7_3)
Amla, Nina, et al. "An analysis of SAT-based model checking techniques in an industrial environment." (up to section 2.3 included). (available at https://link.springer.com/chapter/10.1007/11560548_20)
Lectures on " Introduction to planning" and " Planning formalisms" of the " Autonomous Intelligent Systems" course
Z3 tool page, https://github.com/Z3Prover/z3
Programming Z3 tutorial (sections 1,2,3,5), https://theory.stanford.edu/~nikolaj/programmingz3.html
Jupyter notebooks illustrating Z3's Python APIs (includes code for Programming Z3 tutorial), https://notebooks.azure.com/nbjorner/projects/z3examples
Online Z3 tutorial (using SMT-lib input format), https://rise4fun.com/z3/tutorial/guide
Dennis Yurichev, "SAT/SMT by Examples", https://yurichev.com/writings/SAT_SMT_by_example.pdf (a huge collection of maths and CS problems solved with SAT or SMT)

Prerequisites: Prerequisites: CS5980 (Autonomous Intelligent Systems) or some knowledge of AI planning. Good programming skills in Python. Some understanding of logic.

CS: Scalable Peer-To-Peer Key Value Store.

Aims: Implementing a scalable key value store in a peer-to-peer setting and understanding its performance.

Background:

Key value stores are data structures that provide dictionary-style operations for data storage and retrieval. To ensure scalability and resilience to environmental changes, a key value store can be distributed among multiple nodes in a distributed system. In this project, the focus is on implementing peer-to-peer key value stores in an actor model framework as provided by Elixir/Erlang or Cloud Haskell.

Early Deliverables

A report giving an overview of prominent existing key value stores such as Chord and SkipGraph.
A report describing how existing key value stores deal with environmental changes.
A proof-of-concept program implementing a simple (possibly inefficient) distributed key value store.

Final Deliverables

A distributed key value store that provides standard dictionary operations and is scalable to large data sets.
A report describing the architecture of the implemented key value store.
A report on the performed experiments and simulation results.

Suggested Extensions

Supporting advanced dictionary operations, e.g., range queries.
More efficient storage of data by using coding techniques.
Simulating node churn.

Reading

Ion Stoica, Robert Morris, David R. Karger, M. Frans Kaashoek, Hari Balakrishnan: Chord: A scalable peer-to-peer lookup service for internet applications. SIGCOMM 2001: 149-160
James Aspnes, Gauri Shah: Skip graphs. ACM Trans. Algorithms 3(4): 37 (2007)

Prerequisites: CS5860; Understanding of the basic concepts of distributed systems. Distributed programming.

CS: Value at Risk

Aims: Write a program for computing Value at Risk. Check its performance using backtesting (more generally, exploring its calibration and resolution).

Background: It is a very difficult and important problem to estimate the riskiness of a portfolio of securities. Value at Risk (VaR) is a tool for measuring financial risk. The goal of this project is to implement some of the methodologies for VaR computation (either under normal conditions or using "crash metrics") and test it on the real financial data. Other monetary measures of risk may also be studied.

The project is recommended for Computational Finance students.

Early Deliverables

You will write a simple proof-of-concept program for computing Value at Risk for a portfolio consisting of 1 or 2 stocks using two different methods: model-building and historical simulation.
Different ways of computing Value at Risk will be back tested and the statistical significance of the results analyzed.
The report will describe the program (software engineering, algorithms, etc.); discuss different ways of estimating the variance of returns and the covariance between returns of two different securities (using the standard formula, exponentially weighted moving average, or GARCH(1,1)).

Final Deliverables

The program should be extended by allowing derivatives (such as options) in the portfolio and using Monte Carlo simulation. Option pricing can be performed using the Black-Scholes formula, or binomial trees (for American options), or a different variety of Monte Carlo simulation (for Asian options).
Allow an arbitrary number of stocks in the model-building approach (using eigen- or Cholesky decomposition).
Allow an arbitrary number of stocks using histotical simulation.
Ideally, it will have a graphical user interface.
The final report will describe the theory behind the algorithms; the implementation issues necessary to apply the theory; the software engineering process involved in generating your software; computational experiments with different data sets, methods, and parameters.

Suggested Extensions

Computing conditional VaR (also known as expected shortfall)
Explore carefully the choice of parameters for EWMA and GARCH(1,1) using a loss function such as square or log loss
Empirical study of the two approaches to historical simulation for n-day VaR: reducing to 1-day VaR and direct estimation
Complement back testing by stress testing.
Replacing the Gaussian model for stock returns by robust models.
Computing monetary measures of risk different from VaR.
Using proper scoring rules for quantiles for valuating computed VaRs.
Using methods of worst-case risk aggregation, such as the Rearrangement Algorithm, or worst-case online decision making, such as the Weak Aggregating Algorithm (currently active areas of research).

Reading

John C. Hull. Options, futures and other derivatives, 10th ed., Pearson, Upper Saddle River, NJ, 2018.
Hans Follmer and Alexander Schied. Stochastic Finance: An Introduction in Discrete Time 3rd ed., de Gruyter, Berlin, 2011.
Yahoo Finance as source of data: http://finance.yahoo.com/
http://gloria-mundi.com

Prerequisites: Python, R, MATLAB, Java, or C++ programming skills.

Clustering and Generative Modelling

Aims: To implement various clustering and generative modeling methods, and to compare them on both artificial and real world data.

This project is to implement and compare a variety of clustering methods on some artificial and real data.

Early Deliverables

Proof of concept program: Gaussian mixture model applied to a small artificial dataset.
A report giving an overview of clustering and describing three different clustering methods.
A report describing and assessing the results of applying the proof of concept program to data.

Final Deliverables

The program(s) will work with a real dataset(s).
You will generate and compare different clusterings or generative models using several different methods, such as: kernel-based semi-supervised clustering, spectral learning, factor analysis, tree-based methods, or other comparable methods.
You will compare and visualize the clusterings produced.
The report will describe and derive the formulae for the methods that you use, and the report will also discuss the implementation.

Suggested Extensions

There is a very wide range of clustering and generative-modelling algorithms in standard use, and an even wider variety of methods of fitting them to data. This would be an opportunity to implement several different methods and get to understand them by using them in practice.

Reading

Pattern Recognition and Machine Learning, by C.M. Bishop, Springer 2006
Machine learning: a probabilistic perspective”, by K. Murphy, MIT Press 2012
S. Basu, A. Bamerjee and R. Mooney, "Semi-supervised clustering by seeding", Proceedings of the 19th International Conference on Machine Learning, pp.19-26, 2002.

Prerequisites: CS5920 (Computer Learning) would be advantageous (but not required). This project will require some mathematical understanding; it is not simply programming.

Clustering and Generative Modelling

Aims: To implement various clustering and generative modeling methods, and to compare them on both artificial and real world data.

This project is to implement and compare a variety of clustering methods on some artificial and real data.

Early Deliverables

Proof of concept program: Gaussian mixture model applied to a small artificial dataset.
A report giving an overview of clustering and describing three different clustering methods.
A report describing and assessing the results of applying the proof of concept program to data.

Final Deliverables

The program(s) will work with a real dataset(s).
You will generate and compare different clusterings or generative models using several different methods, such as: kernel-based semi-supervised clustering, spectral learning, factor analysis, tree-based methods, or other comparable methods.
You will compare and visualize the clusterings produced.
The report will describe and derive the formulae for the methods that you use, and the report will also discuss the implementation.

Suggested Extensions

There is a very wide range of clustering and generative-modelling algorithms in standard use, and an even wider variety of methods of fitting them to data. This would be an opportunity to implement several different methods and get to understand them by using them in practice.

Reading

Pattern Recognition and Machine Learning, by C.M. Bishop, Springer 2006
Machine learning: a probabilistic perspective”, by K. Murphy, MIT Press 2012
S. Basu, A. Bamerjee and R. Mooney, "Semi-supervised clustering by seeding", Proceedings of the 19th International Conference on Machine Learning, pp.19-26, 2002.

Prerequisites: CS5920 (Computer Learning) would be advantageous (but not required). This project will require some mathematical understanding; it is not simply programming.

Comparison of machine learning algorithms

Aims: To implement and compare on benchmark data sets various machine learning algorithms

Background: Machine learning allows us to write computer programs to solve many complex problems: instead of solving a problem directly, the program can learn to solve a class of problems given a training set produced by a teacher. This project will involve implementing a range of machine learning algorithms, from the simplest to sophisticated, and studying their empirical performance using methods such as cross-validation and ROC analysis. In this project you will learn valuable skills prized by employers using data mining.

Early Deliverables

You will implement simple machine learning algorithms such as nearest neighbours and decision trees.
They will be tested using simple artificial data sets.
Report: a description of 1-nearest neighbour and k-nearest neighbours algorithms, with different strategies for breaking the ties;
Report: a description of decision trees using different measures of uniformity.

Final Deliverables

May include nearest neighbours using kernels and multi-class support vector machines.
The algorithms will be studied empirically on benchmark data sets such as those available from the UCI data repository and the Delve repository.
For many of these data set judicious preprocessing (including normalisation of attributes or examples) will be essential.
The performance of all algorithms will be explored using a hold-out test set and cross-validation.
The overall program will have a full object-oriented design, with a full implementation life cycle using modern software engineering principles.
Ideally, it will have a graphical user interface.
The report will describe: the theory behind the algorithms.
The report will describe: the implementation issues necessary to apply the theory.
The report will describe: the software engineering process involved in generating your software.
The report will describe: computational experiments with different data sets and parameters.

Suggested Extensions

Modifications of known algorithms.
Dealing with machine learning problems with asymmetrical errors (such as spam detection) and ROC analysis.
Nontrivial adaptations of known algorithms to applications in a specific area, such as medical diagnosis or option pricing.
Exploring the cross-validation procedure, in particular the leave-one out procedure. What is the optional number of folds?
Comparative study of different strategies of reducing multi-class classifiers to binary classifiers (such as one-against-one, one-against-the-rest, coding-based).

Reading

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Second edition. Springer, New York, 2009.
Tom M. Mitchell. Machine Learning. McGrow-Hill, New York, 1997.
Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Second edition. Springer, New York, 2000.
Vladimir N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.
Seth Hettich and Steven D. Bay. The UCI KDD Archive. University of California, Department of Information and Computer Science, Irvine, CA, 1999, http://kdd.ics.uci.edu.
Delve archive. http://www.cs.toronto.edu/~delve.

Prerequisites: Java or C++ programming skills. In particular, they should be comfortable with data structure programming such as trees. CS3920 is highly recommended. This project would suit the specialised AI degree.

Comparison of machine learning algorithms

Aims: To implement and compare on benchmark data sets various machine learning algorithms

Early Deliverables

You will implement simple machine learning algorithms such as nearest neighbours and decision trees.
They will be tested using simple artificial data sets.
Report: a description of 1-nearest neighbour and k-nearest neighbours algorithms, with different strategies for breaking the ties;
Report: a description of decision trees using different measures of uniformity.

Final Deliverables

May include nearest neighbours using kernels and multi-class support vector machines.
The algorithms will be studied empirically on benchmark data sets such as those available from the UCI data repository and the Delve repository.
For many of these data set judicious preprocessing (including normalisation of attributes or examples) will be essential.
The performance of all algorithms will be explored using a hold-out test set and cross-validation.
The overall program will have a full object-oriented design, with a full implementation life cycle using modern software engineering principles.
Ideally, it will have a graphical user interface.
The report will describe: the theory behind the algorithms.
The report will describe: the implementation issues necessary to apply the theory.
The report will describe: the software engineering process involved in generating your software.
The report will describe: computational experiments with different data sets and parameters.

Suggested Extensions

Modifications of known algorithms.
Dealing with machine learning problems with asymmetrical errors (such as spam detection) and ROC analysis.
Nontrivial adaptations of known algorithms to applications in a specific area, such as medical diagnosis or option pricing.
Exploring the cross-validation procedure, in particular the leave-one out procedure. What is the optional number of folds?
Comparative study of different strategies of reducing multi-class classifiers to binary classifiers (such as one-against-one, one-against-the-rest, coding-based).

Reading

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Second edition. Springer, New York, 2009.
Tom M. Mitchell. Machine Learning. McGrow-Hill, New York, 1997.
Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Second edition. Springer, New York, 2000.
Vladimir N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.
Seth Hettich and Steven D. Bay. The UCI KDD Archive. University of California, Department of Information and Computer Science, Irvine, CA, 1999, http://kdd.ics.uci.edu.
Delve archive. http://www.cs.toronto.edu/~delve.

Compression and the Burrows-Wheeler transform

Aims: To implement lossless compression techniques along with preprocessing by the Burrows-Wheeler transform.

Background: This project is concerned with data compression and associated preprocessing techniques. Data compression, or source coding, involves encoding information using fewer bits than the original file or representation. Compression can be either lossy or lossless, and is applied in order to reduce resources such as storage or transmission capacity.

The Burrows-Wheeler Transform is a text transformation scheme that has found applications in various aspects of the data explosion problem, from data compression to indexing. Specifically, the transform is a permutation of an input string which often clusters occurrences of the same letter in the input into `runs’ of the letter. Hence it is useful as a preprocessing step in compression methods like run-length encoding.

A string of letters over an alphabet is a Lyndon word if it is the least in lexicographic (dictionary) order amongst the set of its rotations. For example, "data" is not a Lyndon word since it is not lexicographically least in the set {data, atad, tada, adat} whereas "adat" is a Lyndon word. Similarly, "computer" is a Lyndon word while "science" is not. An important technique for handling string computations is to split an input string into substrings, each with a common pattern. One such method is the Lyndon factorisation, where each substring is a Lyndon word. Efficiencies are achieved with the Burrows-Wheeler transform, by first factoring the input string into Lyndon words, and then applying the transform. It is also known that representing a string as a suffix tree can improve the speed of the Burrows-Wheeler transform.

In this project you will study lossy and lossless compression techniques. You will study relevant Shannon information theory such as entropy encoding, and some of the classic compression algorithms such as Huffman coding including their complexities. You will implement lossless compression algorithms applied to textual data, and preprocessing techniques.

Benchmarking will be performed in order to compare compression ratios: between different encoding methods, plus with and without preprocessing . You will develop a GUI suitable for importing text files; the GUI will incorporate HCI (human computer interaction) concepts.

Early Deliverables

Report on classic lossless algorithms: Implement: run-length encoding; move-to-front encoding; Huffman encoding.
Program: GUI built with buttons etc.
Program: Efficient Suffix Tree implementation
Program: Naive implementation of Burrows-Wheeler
Report: How does Burrows-Wheeler work with an emphasis on writing and understanding a proof that the reverse transform works.
Report: Shannon information theory and lossy v lossless compression; complexity theory; pseudo-code of algorithms implemented.
Reports: Complexity, NP hardness and the big O notation
Report: The preprocessing phase for compression.
Program: Implement an efficient Burrows-Wheeler Transform using Suffix Trees.
Program: Apply the Lyndon factorisation to an input string, and apply the Burrows-Wheeler transform to each of the Lyndon factors.

Final Deliverables

The program will have a Graphical User Interface with which to input text files and display compression ratios for compressed files.
The Lyndon factorisation and Burrows-Wheeler transformation of a string will be displayed.
The program must have a full object-oriented design, with a full implementation life cycle using modern software engineering principles.
The report will describe the software engineering process involved in generating your software, interesting programming techniques and data structures used (or able to be used) on the project.
The report will describe the complexity analysis of your computations.
The report will include some results of benchmarking compression algorithms with a discussion of their meanings.
The report will include discussion of applications of the Burrows-Wheeler transform and Lyndon factorisation.
The report will include a User Manual.

Suggested Extensions

Build a mobile Burrow-Wheeler application: this will require use of Worker Threads.
Show the output using an effective JTree with callbacks to enable user interaction.
Animate the Burrows-Wheeler in order to help explain its working to a student unfamiliar with the algorithm.
Adding Digital Signatures to compressed files.

Prerequisites: Mathematical aptitude, in particular combinatorics, and competent programming skills to include handling a variety of data structures.

Reading

D. Adjeroh, T. Bell, A. Mukherjee, The Burrows-Wheeler transform: data compression, suffix arrays, and pattern matching, Springer (2008) 352 pp.
M. Burrows and D. Wheeler, A block sorting lossless data compression algorithm, Technical Report 124, Digital Equipment Corporation (1994).
M. Salson, T. Lecroq, M. Leonard and L. Mouchard, A four-stage algorithm for updating a Burrows-Wheeler transform, Theoret. Comput. Sci. 410 (43) (2009) 4350-4359 doi:10.1016/j.tcs.2009.07.016.
http://en.wikipedia.org/wiki/Burrows-Wheeler transform
M. Lothaire, Combinatorics onWords, Addison-Wesley, Reading, MA, 1983; 2nd Edition, Cambridge University Press, Cambridge, 1997.
J. P. Duval, Factorising words over an ordered alphabet, J. Algorithms 4 (1983) 363-381.
http://en.wikipedia.org/wiki/Data_compression

Compression and the Burrows-Wheeler transform

Aims: To implement lossless compression techniques along with preprocessing by the Burrows-Wheeler transform.

Early Deliverables

Report on classic lossless algorithms: Implement: run-length encoding; move-to-front encoding; Huffman encoding.
Program: GUI built with buttons etc.
Program: Efficient Suffix Tree implementation
Program: Naive implementation of Burrows-Wheeler
Report: How does Burrows-Wheeler work with an emphasis on writing and understanding a proof that the reverse transform works.
Report: Shannon information theory and lossy v lossless compression; complexity theory; pseudo-code of algorithms implemented.
Reports: Complexity, NP hardness and the big O notation
Report: The preprocessing phase for compression.
Program: Implement an efficient Burrows-Wheeler Transform using Suffix Trees.
Program: Apply the Lyndon factorisation to an input string, and apply the Burrows-Wheeler transform to each of the Lyndon factors.

Final Deliverables

The program will have a Graphical User Interface with which to input text files and display compression ratios for compressed files.
The Lyndon factorisation and Burrows-Wheeler transformation of a string will be displayed.
The program must have a full object-oriented design, with a full implementation life cycle using modern software engineering principles.
The report will describe the software engineering process involved in generating your software, interesting programming techniques and data structures used (or able to be used) on the project.
The report will describe the complexity analysis of your computations.
The report will include some results of benchmarking compression algorithms with a discussion of their meanings.
The report will include discussion of applications of the Burrows-Wheeler transform and Lyndon factorisation.
The report will include a User Manual.

Suggested Extensions

Build a mobile Burrow-Wheeler application: this will require use of Worker Threads.
Show the output using an effective JTree with callbacks to enable user interaction.
Animate the Burrows-Wheeler in order to help explain its working to a student unfamiliar with the algorithm.
Adding Digital Signatures to compressed files.

Prerequisites: Mathematical aptitude, in particular combinatorics, and competent programming skills to include handling a variety of data structures.

Reading

D. Adjeroh, T. Bell, A. Mukherjee, The Burrows-Wheeler transform: data compression, suffix arrays, and pattern matching, Springer (2008) 352 pp.
M. Burrows and D. Wheeler, A block sorting lossless data compression algorithm, Technical Report 124, Digital Equipment Corporation (1994).
M. Salson, T. Lecroq, M. Leonard and L. Mouchard, A four-stage algorithm for updating a Burrows-Wheeler transform, Theoret. Comput. Sci. 410 (43) (2009) 4350-4359 doi:10.1016/j.tcs.2009.07.016.
http://en.wikipedia.org/wiki/Burrows-Wheeler transform
M. Lothaire, Combinatorics onWords, Addison-Wesley, Reading, MA, 1983; 2nd Edition, Cambridge University Press, Cambridge, 1997.
J. P. Duval, Factorising words over an ordered alphabet, J. Algorithms 4 (1983) 363-381.
http://en.wikipedia.org/wiki/Data_compression

Computer Language Design and Engineering

Aims: To investigate the basic techniques which underpin computer language design and translation. To investigate in more detail one aspect of this field, and to implement a tool that performs a related task.

Background: Programming language processors (which include compilers, interpreters and API's for data interchange formats such JSON and XML) need to `understand' some sort of syntax, and then execute appropriate actions according to a defined semantics. For instance, the javac compiler processes programs written in Java syntax and outputs compiled instructions for the Java Virtual Machine. A python interpreter understands Python syntax, and directly executes Python phrases as soon as it recognises them.
This set of projects focuses on the specification and implementation of syntax and semantics for such processors. Interestingly, a variety of algorithms have been developed which allow the tools to be automatically generated from appropriate specifications. These techniques are very important because nearly all of our software is written in high level languages, and nearly all high level languages are implemented using these ideas.
Technologies used by programming language processors include: the use of context free grammars to specify syntax; the use of lexical analysis and parsing algorithms to convert an user's program into an internal tree form; the use of term rewriting to modify those trees; and the use of attribute grammars and structural operational semantics to specify interpretations of that syntax including code generation and optimisation.
To do well on these projects you need to be interested in programming languages, have good programming skills and be comfortable with the kind of mathematics that you encountered in CS1870. You will write your code in Java, and make use of tools developed within the department such as RDP and ART. In the first part of this project you will learn about fundamental topics in this area including syntax specification, lexical analysis, parsing and simple interpretation. You will then move on to a more detailed study of one area. The first term deliverables are described in the early deliverables section. In the second part of the project (mostly the second term) you will focus on the design and implementation of tooling for programming language processing. Your detailed topic must be agreed with your superviser; suggestions include the following.

A tool that takes a set of regular expressions and uses Thompson'sconstruction and the subset construction to produce a lexical analyserfor the set of expressions.
A compiler for a simple Java subset language capable of executing small programs.
An interpreter for a small language using either attribute grammars or structural operational semantics.
A parser generator using either the LR or recursive descent algorithms.
An implementation of a context free term rewriting system.

Further information

Lexical analysers are typically based on finite state automata of the kind discussed in CS1870. The algorithms are well known but implementation details are important in the development of an efficient tool, and you will be expected to show how such optimisations can be used to improve performance.
A minimal language might comprise an integer-only subset of Java which supported basic control flow. Once this has been demonstrated, then it could then be extended to support extended control flow constructs such as switch, break and procedure call, as well as extensions to the type system, especially arrays.
Language interpretation can be viewed as adding actions to BNF specifications such that when a parser recognises a sub-phrase of the language, it can immediately execute those actions. Attribute grammars work by embedding actions within the grammar and providing attributes that allow information to flow up and down the trees constructed by parsers. Structural Operational Semantics take a different approach in which actions are triggered during the `reduction' of a program to some minimal form.
A variety of parsing algorithms have been developed. Two of the most commonly used are LR parsing and Recursive Descent parsing. A parser generator takes a specification for a language L written in BNF (for which you will already have a parser from your term 1 work) and then adds actions that cause it to emit a program that when compiled, will act as a parser for L. Your task is to understand your chosen parsing algorithm, and then write a parser generator for that algorithm. Interestingly, a correctly working parser generator can be `bootstrapped', that is it can be used to process its own specification.
A term rewriting system rewrites expressions. For instance, we know that in Boolean algebra, the logical AND of FALSE and any value is false. We can express this as FALSE AND x rewrites to FALSE. We also know that the logical AND of true and some value is just that value: TRUE AND x rewrites to x. A term rewriter takes a collection of these rewrites representing (in this case) Boolean identities, and attempts to simplify expressions by performing these substitutions. In this project, you will build a system which will explore rewrites over terms and perform useful computations.

Early Deliverables

A report on the use of context free grammars to specify syntax, along with manual procedures for generating and parsing languages.
A report on the use of a parsing tool to produce derivations, and the use of grammar idioms to capture notions of associativity and priority in arithmetic expressions.
The development of an interpreter for a language that models a four function calculator with memory.
The development of a pretty printer for BNF specifications, demonstrating that programs can process their own source code.
A technical report covering all of the above work, including a bibliography.
A Term 2 project schedule with a list of deliverables and dates.

Final Deliverables

A project report containing a description of all the research you have carried out and the details of your second term implementation.
An implementation of your chosen tool, See the suggestions in the Background Section.

Computer Language Design and Engineering

A tool that takes a set of regular expressions and uses Thompson'sconstruction and the subset construction to produce a lexical analyserfor the set of expressions.
A compiler for a simple Java subset language capable of executing small programs.
An interpreter for a small language using either attribute grammars or structural operational semantics.
A parser generator using either the LR or recursive descent algorithms.
An implementation of a context free term rewriting system.

Further information

Lexical analysers are typically based on finite state automata of the kind discussed in CS1870. The algorithms are well known but implementation details are important in the development of an efficient tool, and you will be expected to show how such optimisations can be used to improve performance.
A minimal language might comprise an integer-only subset of Java which supported basic control flow. Once this has been demonstrated, then it could then be extended to support extended control flow constructs such as switch, break and procedure call, as well as extensions to the type system, especially arrays.
Language interpretation can be viewed as adding actions to BNF specifications such that when a parser recognises a sub-phrase of the language, it can immediately execute those actions. Attribute grammars work by embedding actions within the grammar and providing attributes that allow information to flow up and down the trees constructed by parsers. Structural Operational Semantics take a different approach in which actions are triggered during the `reduction' of a program to some minimal form.
A variety of parsing algorithms have been developed. Two of the most commonly used are LR parsing and Recursive Descent parsing. A parser generator takes a specification for a language L written in BNF (for which you will already have a parser from your term 1 work) and then adds actions that cause it to emit a program that when compiled, will act as a parser for L. Your task is to understand your chosen parsing algorithm, and then write a parser generator for that algorithm. Interestingly, a correctly working parser generator can be `bootstrapped', that is it can be used to process its own specification.
A term rewriting system rewrites expressions. For instance, we know that in Boolean algebra, the logical AND of FALSE and any value is false. We can express this as FALSE AND x rewrites to FALSE. We also know that the logical AND of true and some value is just that value: TRUE AND x rewrites to x. A term rewriter takes a collection of these rewrites representing (in this case) Boolean identities, and attempts to simplify expressions by performing these substitutions. In this project, you will build a system which will explore rewrites over terms and perform useful computations.

Early Deliverables

A report on the use of context free grammars to specify syntax, along with manual procedures for generating and parsing languages.
A report on the use of a parsing tool to produce derivations, and the use of grammar idioms to capture notions of associativity and priority in arithmetic expressions.
The development of an interpreter for a language that models a four function calculator with memory.
The development of a pretty printer for BNF specifications, demonstrating that programs can process their own source code.
A technical report covering all of the above work, including a bibliography.
A Term 2 project schedule with a list of deliverables and dates.

Final Deliverables

A project report containing a description of all the research you have carried out and the details of your second term implementation.
An implementation of your chosen tool, See the suggestions in the Background Section.

Computer Vision for Physical Security

Aims: Propose and develop a system that detects real-world threats using camera-based systems.

Background: Computer vision allows machines to recognise objects in real-world footage. In principle, this allows machines to flag potential threats in an automated fashion based on historical and present images in video footage of real-world environments. Automated threat detection mechanisms to help security guards identify threats might be of help to them, especially if they have to temporarily leave their post, or have to guard a significant number of areas. In this project, students are asked to implement a system that is able to observe a real environment over time and attempt to identify potential threats, independent of a number of factors, e.g. lighting conditions or wind conditions. The student is encouraged to approach this challenge as they see fit, but would be expected to design, implement and assess any methods they develop. One approach might be to implement a camera system using e.g. a web camera and OpenCV to conduct anomaly detection on real-world environments using a Markov model, and flag any issues related to potential threats.

Prerequisites:

Good programming skills, particularly C, C++, Python, and knowledge of Windows/Linux internals.
A basic background of image processing techniques or OpenCV is an advantage.
Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub).

Early Deliverables

A report describing the state of the art in automated camera systems.
A clear workplan describing the set of tests to be performed, tools to be implemented and classes of techniques to be proposed and studied

Final Deliverables

Design and implement a framework for development of new and existing physical security techniques
Describe in detail the design of tests for the framework
Describe and analyse the features that have been used during the project to detect/stop threats
Critically compare the findings with related works

Reading

OpenCV: https://opencv.org/
TensorFlow: https://www.tensorflow.org/
Computer vision: a modern approach

Conformal Anomaly Discovery

Aims: Write programs implementing Conformal Predictors with various underlying algorithms such as Support Vector Machines and apply them to a several datasets to detect anomalies in the data.

Background:

Recently developed conformal predictors can be used to narrow down the search space and to improve accuracy of classification.

Early Deliverables

A review and a report of the approach in the case of classification and regression.
A design of the system and its extension for conformal predictors.
Implementation of a pilot system and some preliminary experiments.

Final Deliverables

Implementation of the system, experiments with various parameters.
Experimenting with different underlying ML algorithms.
It is desirable that the final system will have a graphical user interface.
The final report will describe the theory, the implementation issues, computational experiments with different data sets and parameters.

Suggested Extensions

Online learning with conformal predictors.

Reading

V.Vovk, A.Gammerman, G.Shafer. "Algorithmic Learning in a Random World", Springer, 2005.

Prerequisites: Good programming skills in Phyton, Java, MATLAB, or R. Excellent knowledge of linear algebra. Taking CS5920 (or CS3920) Maching Learning is an advantage.

Conformal Prediction in Data Editing: Error Correction

Aims: To develop an application of Conformal Prediction to detection and correction of possible mistakes in a data set.

Early Deliverables

Find and choose a data set
Implement a relevant version of Conformal Prediction algorithm
Impute some mistakes into the data, apply the algorithm, and observe the results.

Final Deliverables

Implement a Conformal Prediction with an additional function of error detection.
Compare different versions of Conformal Prediction by their efficiency for data correction.
Analyse and report the output.

Suggested Extensions

Propose, justify theoretically, and implement a scheme of multiple error correction.
Compare how the approach works in different (on-line, batch, cross-validation, inductive) modes.

Reading

Hedging predictions in machine learning by Alex Gammerman and Vladimir Vovk. Computer Journal 50:151--177 (2007).
A tutorial on conformal prediction by Glenn Shafer and Vladimir Vovk. Journal of Machine Learning Research 9:371--421 (2008).
Mushroom Data Set. https://archive.ics.uci.edu/ml/datasets/Mushroom

Prerequisites: Taking CS5920 Machine Learning is recommended.

Conformal Prediction in Data Editing: Error Correction

Aims: To develop an application of Conformal Prediction to detection and correction of possible mistakes in a data set.

Early Deliverables

Find and choose a data set
Implement a relevant version of Conformal Prediction algorithm
Impute some mistakes into the data, apply the algorithm, and observe the results.

Final Deliverables

Implement a Conformal Prediction with an additional function of error detection.
Compare different versions of Conformal Prediction by their efficiency for data correction.
Analyse and report the output.

Suggested Extensions

Propose, justify theoretically, and implement a scheme of multiple error correction.
Compare how the approach works in different (on-line, batch, cross-validation, inductive) modes.

Reading

Hedging predictions in machine learning by Alex Gammerman and Vladimir Vovk. Computer Journal 50:151--177 (2007).
A tutorial on conformal prediction by Glenn Shafer and Vladimir Vovk. Journal of Machine Learning Research 9:371--421 (2008).
Mushroom Data Set. https://archive.ics.uci.edu/ml/datasets/Mushroom

Prerequisites: Taking CS5920 Machine Learning is recommended.

Conformal Prediction of Multi-Class Structured Label Spaces

Early Deliverables

Study data sets and identify one relevant for the project, describe it and the structure of its label space.
Select a basic machine learning approach applicable to this data.
Develop and implement a multi-class version of this approach and get some initial results on the data

Final Deliverables

Survey of the data and machine learning techniques.
Develop a conformity score functions for data with structured label space.
Implement a conformal prediction algorithm applicable to the data.
Analysis of the output.

Suggested Extensions

Compare results with different basic machine learning approaches.
Compare original label space structure attached to the data against its automatic structuring by means of data analysis.

Reading

Hedging predictions in machine learning by Alex Gammerman and Vladimir Vovk. Computer Journal 50:151--177 (2007).
A tutorial on conformal prediction by Glenn Shafer and Vladimir Vovk. Journal of Machine Learning Research 9:371--421 (2008).
Predicting structured objects with support vector machines by T.Joachims, T.Hofmann, Y.Yue, C.N.Yu. Communications of the ACM 52 (11), 97--104

Prerequisites: Taking CS5920 Machine Learning would is recommended.

Conformal Prediction of Multi-Class Structured Label Spaces

Early Deliverables

Study data sets and identify one relevant for the project, describe it and the structure of its label space.
Select a basic machine learning approach applicable to this data.
Develop and implement a multi-class version of this approach and get some initial results on the data

Final Deliverables

Survey of the data and machine learning techniques.
Develop a conformity score functions for data with structured label space.
Implement a conformal prediction algorithm applicable to the data.
Analysis of the output.

Suggested Extensions

Compare results with different basic machine learning approaches.
Compare original label space structure attached to the data against its automatic structuring by means of data analysis.

Reading

Hedging predictions in machine learning by Alex Gammerman and Vladimir Vovk. Computer Journal 50:151--177 (2007).
A tutorial on conformal prediction by Glenn Shafer and Vladimir Vovk. Journal of Machine Learning Research 9:371--421 (2008).
Predicting structured objects with support vector machines by T.Joachims, T.Hofmann, Y.Yue, C.N.Yu. Communications of the ACM 52 (11), 97--104

Prerequisites: Taking CS5920 Machine Learning would is recommended.

Conformal predictors for classification and regression

Aims: Implement and test machine learning algorithms that complement their predictions with measures of confidence

Background: A disadvantage of standard machine learning algorithms is that they only produce "bare predictions", i.e., predictions without any indications of their reliability. The technique of conformal prediction makes it possible to build measures of confidence on top of almost any prediction algorithm. Implementing conformal prediction algorithm will familiarise you with machine learning techniques such as cross-validation and ROC analysis. You will learn valuable skills prized by employers using data mining.

Early Deliverables

You will implement simple inductive conformal predictors and run them on small artificial data sets.
They will be based both on underlying algorithms implemented by you (such as nearest neighbours)
... and on advanced machine learning algorithms included in available libraries (such as gradient boosting).
You will produce a report containing descriptions of algorithms and basics of the theory.

Final Deliverables

You will implement transductive conformal predictors based on the method of k nearest neighbours; transductive conformal predictors are less computationally efficient but have better accuracy.
You will implement cross-conformal predictors, which combine the accuracy of transductive conformal predictors and the computational efficiency of inductive conformal predictors.
The algorithms will be studied empirically on benchmark data sets such as those available from the UCI data repository and the Delve repository.
For many of these data set judicious preprocessing (including normalisation of attributes or examples) will be essential.
Two aspects of the performance of the algorithms will be explored: their validity and efficiency.
The overall program will have a full object-oriented design, with a full implementation life cycle using modern software engineering principles.
Ideally, it will have a graphical user interface.
The report will describe: the theory behind the algorithms.
The report will describe: the implementation issues necessary to apply the theory.
The report will describe: the software engineering process involved in generating your software.
The report will describe: computational experiments with different data sets, underlying algorithms, and parameters.

Suggested Extensions

Empirical study of the performance of the method of cross validation for different numbers of folds and extend to the method of cross-conformal prediction.
Exploring the validity of conformal predictors conditional on the training set.
Studying connections between probability-type inductive conformal predictors and ROC curves.

Reading

Glenn Shafer and Vladimir Vovk. A tutorial on conformal prediction. Journal of Machine Learning Research, 9:371-421, 2008. http://arxiv.org/abs/0706.3188
Alex Gammerman and Vladimir Vovk. Hedging predictions in machine learning (with discussion). The Computer Journal, 50:151-177, 2007. http://arxiv.org/abs/cs/0611011
Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer, New York, 2005.
Vladimir Vovk. Cross-conformal predictors. http://arxiv.org/abs/1208.0806
Vladimir Vovk. Inductive conformal predictors in the batch mode. http://arxiv.org/abs/1209.2673
Seth Hettich and Steven D. Bay. The UCI KDD Archive. University of California, Department of Information and Computer Science, Irvine, CA, 1999, http://kdd.ics.uci.edu
Delve archive. http://www.cs.toronto.edu/~delve

Prerequisites: Java or C++ programming skills. CS3920 is highly recommended. This project would suit the specialised AI degree.

Conformal predictors for detecting harmful respiratory events

Background:

Non-speech body sounds recognition had not received major attentions, despite containing valuable information about the person's health conditions.

This project aims at detecting the two most popular infectious events, coughs and sneezes, from raw recordings.

The challenge for such objective is the variety of coughs and sneezes, amongst other non-speech body sounds. Hence, the confidence measure of conformal prediction would be useful in such scenario.

Early Deliverables

Producing a simple proof-of-concept implementation of classification-based conformal prediction with different underlying algorithms (e.g. SVM, k-NN, Random Forests, etc.) from scratch.
A report of the empirical results of such conformal predictors on a small synthetic dataset.

Final Deliverables

A detailed literature review of existing machine learning algorithms used for non-speech sound recognition.
Producing a detailed report of the dataset structure, including some visual analysis (e.g. t-SNE, UMAP, etc.).
Implementing conformal predictors on the whole dataset, optimising the underlying algorithms' parameters.
Carefully analysing the p-values, comparing the validity, efficiency, accuracy of the implemented conformal predictors.

Suggested Extensions

Comparing different methods of extracting sound features (e.g. MFCC, CFA, ZCR) from raw recordings, and their impacts on the prediction performance.
Using the entire dataset to train a new model, then testing such model on other real world recordings.

Reading

Nguyen, K. A. and Luo, Z. (2018). Cover Your Cough: Detection of Respiratory Events with Confidence Using a Smartwatch. Proceedings of Machine Learning Research, 91, 1-18.
Piczak, K. J. (2015). ESC: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia (pp. 1015-1018). ACM.
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/YDEPUT

Prerequisites: Taking CS5920 Machine Learning is required. Taking CS5100 Data Analsysis would be useful, but is not compulsory. Having good knowledge of Matlab or Python is essential.

Conformal predictors for detecting harmful respiratory events

Background:

Non-speech body sounds recognition had not received major attentions, despite containing valuable information about the person's health conditions.

This project aims at detecting the two most popular infectious events, coughs and sneezes, from raw recordings.

The challenge for such objective is the variety of coughs and sneezes, amongst other non-speech body sounds. Hence, the confidence measure of conformal prediction would be useful in such scenario.

Early Deliverables

Producing a simple proof-of-concept implementation of classification-based conformal prediction with different underlying algorithms (e.g. SVM, k-NN, Random Forests, etc.) from scratch.
A report of the empirical results of such conformal predictors on a small synthetic dataset.

Final Deliverables

A detailed literature review of existing machine learning algorithms used for non-speech sound recognition.
Producing a detailed report of the dataset structure, including some visual analysis (e.g. t-SNE, UMAP, etc.).
Implementing conformal predictors on the whole dataset, optimising the underlying algorithms' parameters.
Carefully analysing the p-values, comparing the validity, efficiency, accuracy of the implemented conformal predictors.

Suggested Extensions

Comparing different methods of extracting sound features (e.g. MFCC, CFA, ZCR) from raw recordings, and their impacts on the prediction performance.
Using the entire dataset to train a new model, then testing such model on other real world recordings.

Reading

Nguyen, K. A. and Luo, Z. (2018). Cover Your Cough: Detection of Respiratory Events with Confidence Using a Smartwatch. Proceedings of Machine Learning Research, 91, 1-18.
Piczak, K. J. (2015). ESC: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia (pp. 1015-1018). ACM.
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/YDEPUT

Prerequisites: Taking CS5920 Machine Learning is required. Taking CS5100 Data Analsysis would be useful, but is not compulsory. Having good knowledge of Matlab or Python is essential.

Conformal predictors in identifying protein sequences

Aims: To assess the viability of conformal prediction in understanding the accuracy of tools in identifying protein sequences with particular protein families.

Background:

An important task in Bioinformatics is identifying if a protein sequence is part of a particular protein family. This is typically carried out using tools such as Hidden Markox Models (HMM's).

Conformal predictors suggets an alternative to estimating such probabilities by moving away from the paradigm on ROC curves etc. and attempt to create probabilities based on the models themselves.

Early Deliverables

A report on Hidden Markov Model (HMM) tools in Bioinformatics.
A report on conformal predictors.
A simple implementation of conformal prediction on a family of proteins from PFAM.

Final Deliverables

A framework to evaluate the credibility and confidence of individual HMM's from PFAM.

Suggested Extensions

Using conformal predictors on HMM models in protein sequence prediction.

Reading

The Pfam protein families database. Finn RD1, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K, Holm L, Sonnhammer EL, Eddy SR, Bateman A. doi: 10.1093/nar/gkp985.
http://jmlr.csail.mit.edu/papers/volume9/shafer08a/shafer08a.pdf

Prerequisites:

Attended CS5100 and be likely to get a distinction in the module.

Conformal predictors in identifying protein sequences

Aims: To assess the viability of conformal prediction in understanding the accuracy of tools in identifying protein sequences with particular protein families.

Background:

An important task in Bioinformatics is identifying if a protein sequence is part of a particular protein family. This is typically carried out using tools such as Hidden Markox Models (HMM's).

Conformal predictors suggets an alternative to estimating such probabilities by moving away from the paradigm on ROC curves etc. and attempt to create probabilities based on the models themselves.

Early Deliverables

A report on Hidden Markov Model (HMM) tools in Bioinformatics.
A report on conformal predictors.
A simple implementation of conformal prediction on a family of proteins from PFAM.

Final Deliverables

A framework to evaluate the credibility and confidence of individual HMM's from PFAM.

Suggested Extensions

Using conformal predictors on HMM models in protein sequence prediction.

Reading

The Pfam protein families database. Finn RD1, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K, Holm L, Sonnhammer EL, Eddy SR, Bateman A. doi: 10.1093/nar/gkp985.
http://jmlr.csail.mit.edu/papers/volume9/shafer08a/shafer08a.pdf

Prerequisites:

Attended CS5100 and be likely to get a distinction in the module.

Cooperative Strategies in Multi-agent Systems

Aims: To implement a virtual tournament of a set of programmable players engaging in a round robin for the Iterated Prisoner's Dilemma game, test different strategies for these players to follow, identify winning strategies and evaluate the final results.

Background: Imagine the following social interaction presented here as a kind of game that requires intelligent reasoning: You and your accomplish are both held prisoners, having been captured by the police and charged with a serious crime. The prosecutor interrogates you separately and offers each of you a deal. If one of you, the defector, incriminates the other, while the partner remains silent, then as the defector you will be convicted of a lesser crime and your sentence will be cut to one year for providing enough information to jail your partner. Meanwhile, your silent confederate will be convicted of a more serious crime and burdened with a four year sentence. If you both remain silent, and thus cooperate with each other, there will be insufficient evidence to convict either of you of the more serious crime, and you will each receive a sentence of two years for a lesser offence. If one the other hand, you both defect by incriminating each other, you will both be convicted of the more serious crime, but given reduced reduced sentences of three years for at lest being willing to provide information.

If you follow a purely selfish strategy as a player, the best outcome in terms of years in jail is that (a) you defect and the other person cooperates, (b) you both cooperate, and (c) you both defect. If both you and your accomplish are selfish, then you will both try to defect, hence this means 3 years in jail, which is the worst outcome. So if you could only trust each other, by following a strategy that is cooperative, you would both be better off than if you had acted selfishly. A more cooperative strategy will also give you benefits in the future, especially if you had to meet your accomplish after the sentence, engage in more crime, and as it likely need to play the game again, but now with knowledge of how your accomplice behaved in the previous game. This is called the Iterated Prisoner's dilemma game.

Method: The project should design and implement, in Java, a centralised implementation of a generic virtual tournament framework for running the Iterated Prisoner's dilemma game and test different types of player agents.

A type of agent will be characterised simply by the kind of strategy the agent is using in the game (e.g. always defect, randomly choose between defect and cooperate). Your work should show how to implement these kind of agents and evaluate their performance in terms of their cooperation characteristics.

A successful project will need to be a balanced project both in terms of the intelligent behaviour of individual players (strategies) and the generality of the software engineering techniques used to implement the tournament.

Early Deliverables

Familiarisation with the idea of strategies and the identification of set of them.
Implementation of strategies and the definition of an interface that allows agents to be selected and play against each other
Definition of a tournament
Report on the design of a tournament.
Implementation of a tournament, where a user selects the number of agents, associates them with existing strategies, specifies the number of iterations and deploys a tournament
Report on the Iterative Prisoner's dilemma problem
Report on previous work with the Iterative Prisoner's dilemma

Final Deliverables

A program that allows us to simulate tournaments of the iterative prisoner's dilemma.
A GUI that allows a user to create and manage tournaments.
GUI should provide summary statistics that are tournament specific, i.e. a summary for each player/strategy and how it is compared with the other players/strategies in the tournament.
The report should describe some of the theory behind strategies in cooperative settings.
The report should provide an analysis and an evaluation of the different strategies, including an indication of which one is the best strategy, if any.
The report should contain a design of the program and a discussion of any software engineering issues encountered.
The report should describe any useful and interesting programming techniques that have been employed to develop the final prototype.

Suggested Extensions

Make use of multi-agent system platform to support the tournament in a completely distributed setting.
Use logic-based rules to specify the strategies and wrap them in the Java implementation.
Use strategy recognition techniques so that agents can detect what strategies opponents are playing, and adapt their behaviour accordingly.
Create a methodology for developing tournaments of this type.

Prerequisites: Excellent programming ability in Java. Good knowledge of Prolog / SQL is desirable but not strictly necessary.

Reading

M. Nowak, R. May and K. Sigmund. The Arithmetics of Mutual Help, Scientific American, June 1995.
R. Axelrod, W. D. Hamilton (1981), The Evolution of Cooperation, Science 211: 1390–96.
K. Stathis, A game-based architecture for developing interactive components in computational logic, Journal of Functional and Logic Programming, 20(1), Jan 2000, MIT Press.

Cross-conformal Prediction with Neural Networks

Aims: Write a program (or programs) implementing a cross-conformal predictor based on Neural Networks.

Early Deliverables

A proof-of-concept program implementing neural nets both in the case of regression and in the case of classification (where neural networks output probability predictions).
Building a cross-conformal predictor on top of neural nets and exploring its empirical validity.
A preliminary empirical study of the efficiency of the student's implementation and standard implementations (such as the one in scikit-learn).

Final Deliverables

Careful cross-validation of various parameters, including the stopping rule, learning rate, and the numbers of hidden neurons at various levels.
Implementing regularization.
Experimenting with different architectures, possibly including deep architectures.
Ideally, the final product will have a graphical user interface.
The final report will describe the theory behind the algorithms, the implementation issues necessary to apply the theory, the software engineering process involved in generating your software, computational experiments with different data sets, methods, and parameters.

Suggested Extensions

Careful study of the optimal number of folds.
Replacing back-propagation with other optimization algorithms.
Implementing jackknife+, a popular recent version of cross-conformal prediction.

Reading

Hastie, Tibshirani, and Friedman, The Elements of Statistical Learning, 2009, Chapter 11.
Mitchell, Machine Learning, 1997, Chapter 4.
Research papers.

Prerequisites: Python, R, MATLAB, Java, or C++ programming skills. Good knowledge of calculus, especially differentiation.

Cross-conformal Prediction with Neural Networks

Aims: Write a program (or programs) implementing a cross-conformal predictor based on Neural Networks.

Early Deliverables

A proof-of-concept program implementing neural nets both in the case of regression and in the case of classification (where neural networks output probability predictions).
Building a cross-conformal predictor on top of neural nets and exploring its empirical validity.
A preliminary empirical study of the efficiency of the student's implementation and standard implementations (such as the one in scikit-learn).

Final Deliverables

Careful cross-validation of various parameters, including the stopping rule, learning rate, and the numbers of hidden neurons at various levels.
Implementing regularization.
Experimenting with different architectures, possibly including deep architectures.
Ideally, the final product will have a graphical user interface.
The final report will describe the theory behind the algorithms, the implementation issues necessary to apply the theory, the software engineering process involved in generating your software, computational experiments with different data sets, methods, and parameters.

Suggested Extensions

Careful study of the optimal number of folds.
Replacing back-propagation with other optimization algorithms.
Implementing jackknife+, a popular recent version of cross-conformal prediction.

Reading

Hastie, Tibshirani, and Friedman, The Elements of Statistical Learning, 2009, Chapter 11.
Mitchell, Machine Learning, 1997, Chapter 4.
Research papers.

Prerequisites: Python, R, MATLAB, Java, or C++ programming skills. Good knowledge of calculus, especially differentiation.

Cybersecurity Visualization

Aims: Propose and develop a system that detects visualizes cybersecurity-related data.

Background: Cybersecurity visualization helps analysts and risk owners alike to make better decisions about what to do when the network or a host is attacked. In this project the student will develop novel cybersecurity visualizations. The student is free to approach the challenge as they see fit, but would be expected to design, implement and assess the visualizations they develop. These projects tend to have a focus on network traffic visualization, but the student is encouraged to visualize datasets they would be most interested in. Other interesting topics might for instance include: host-based activities (e.g. CPU/RAM/network usage statistics), network scans (incl. vulnerabilities scanning). Past students have visualized network traffic patterns and android permission violation patterns. Other projects on visualizations are possible (not just cybersecurity), depending on interest and inspiration. The student will be expected to develop real-time and offline visualizations.

Prerequisites:

Good programming skills, particularly C, C++, Python, and knowledge of Windows/Linux internals.
WebGL and OpenGL are an advantage.
Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub).

Early Deliverables

A report describing the state of the art in resilience.
A clear workplan describing the set of tests to be performed, tools to be implemented and classes of techniques to be proposed and studied.

Final Deliverables

Design and implement a framework for development of new and existing visualization techniques.
Describe in detail the design of tests for the framework.
Describe and analyse the features that have been used during the project to detect/stop threats.
Critically compare the findings with related works.

Reading

WebGL fundamentals https://webgl2fundamentals.org/
SecViz.org https://secviz.org/
Bokeh, Vega, D3, Plotly visualization suites
R Marty. Applied security visualization.
H Shiravi et al. A survey of visualization systems for network security.

Deep Learning

Aims: To implement a deep learning algorithm on chosen data, and to improve its performance with architecture search and optimisation of hyperparameters

This project is to implement and assess a deep learning algorithm - typically for a feed-forward network - and to perform visualisations and experiments to analyse and improve its performance

Early Deliverables

Proof of concept program: Implementation of a deep learning algorithm for a relatively small network on a relatively simple dataset, together with experiments determining the effect of hyperparameters on convergence
A report giving an overview of a range of related deep learning methods.
A report describing and assessing the performance of the initial implementation and any experiments performed

Final Deliverables

The program(s) will work with a real dataset(s).
You will implement and compare multiple architectures and optimisation methods
You will visualize and assess the performance of the algorithms produced.
The report will describe and derive the formulae for the methods that you use, and the report will also discuss the implementation.

Suggested Extensions

This would be an opportunity to implement several deep learning methods and get to understand them in practice by applying them to a significant dataset.

Reading

"Deep Learning", by Goodfellow, Bengio, and Courville, MIT Press 2016
"Deep Learning with Python" by Francois Chollet, Manning Publications 2018

Prerequisites:

CS5920 (Computer Learning) would be advantageous (but not required).

CS5950 (Deep Learning) is required.

This project will require some mathematical understanding; it is not simply programming.

Deep Learning

Aims: To implement a deep learning algorithm on chosen data, and to improve its performance with architecture search and optimisation of hyperparameters

This project is to implement and assess a deep learning algorithm - typically for a feed-forward network - and to perform visualisations and experiments to analyse and improve its performance

Early Deliverables

Proof of concept program: Implementation of a deep learning algorithm for a relatively small network on a relatively simple dataset, together with experiments determining the effect of hyperparameters on convergence
A report giving an overview of a range of related deep learning methods.
A report describing and assessing the performance of the initial implementation and any experiments performed

Final Deliverables

The program(s) will work with a real dataset(s).
You will implement and compare multiple architectures and optimisation methods
You will visualize and assess the performance of the algorithms produced.
The report will describe and derive the formulae for the methods that you use, and the report will also discuss the implementation.

Suggested Extensions

This would be an opportunity to implement several deep learning methods and get to understand them in practice by applying them to a significant dataset.

Reading

"Deep Learning", by Goodfellow, Bengio, and Courville, MIT Press 2016
"Deep Learning with Python" by Francois Chollet, Manning Publications 2018

Prerequisites:

CS5920 (Computer Learning) would be advantageous (but not required).

CS5950 (Deep Learning) is required.

This project will require some mathematical understanding; it is not simply programming.

Dependency Modelling and Resilience in systems

Aims: Propose, develop and trial organisational resilience methods by examining how incidents affect mission operations in network simulations.

Background: A resilient organisation must be able to efficiently recover from actuated threats such as: cyber-attacks, natural disasters, wear and tear, disruptions (e.g. technologies becoming obsolete, key personnel leaving or outbreaks such as the covid19 pandemic) as well as accidents. Practices such as documentation methodologies, archives, version control systems and backups are widely regarded as default recovery mechanisms. However, these approaches do not adequately capture vital knowledge in organisations about assets and processes. Capabilities gaps include a lack of: 1) tools to automate and facilitate documentation of real world processes, and knowledge capture for protection mechanisms in organisations; 2) mechanisms to cope with, and recover from actuated threats posed to assets and mission processes, and 3) understanding which resilience requirements are universal, and which are sector/organisation specific. In this project, students will propose, develop and trial organisational resilience methods by examining how incidents affect mission operations in network simulations. This includes: building dependency models that describe and make use of organisation dependencies, capturing insights of human thinking such as context and assimilate human understandings. These solutions are intended to make resilience more achievable in real-world settings. The value of this project is in the systematic investigation and integration of knowledge as a protection mechanism for organisations and will involve developing models of organisations, how incidents affect them, and recovery solutions.

Prerequisites:

Good programming skills, particularly C, C++, Python, and knowledge of Windows/Linux internals.
Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub).

Early Deliverables

A report describing the state of the art in resilience.
A clear workplan describing the set of tests to be performed, tools to be implemented and classes of techniques to be proposed and studied

Final Deliverables

Design and implement a framework for development of new and existing resilience techniques
Describe in detail the design of tests for the framework
Describe and analyse the features that have been used during the project to detect/stop threats
Critically compare the findings with related works

Reading

Julia Allen. Measures for managing operational resilience.Technical Report, 2011.
Julia Allen, Pamela Curtis, Nader Mehravari, Andrew Moore, Kevin Partridge, Robert Stoddard, and Randy Trzeciak. Analyzing cases of resilience success and failure-a research study. Technical report, Carnegie Mellon University, the Software Engineering Institute, 2012.
Richard A Caralli, Julia Allen, and David W White.CERT resilience management model: A maturity model for managing operational resilience. Addison-Wesley Professional, 2010.
Deborah Bodeau and Richard Graubart. Cyber resilience metrics: Key observations. Technical Report,2016.
Ronald J Brachman, Richard E Fikes, and Hector J Levesque. Krypton: A functional approach to knowledge representation.Computer, (10):67–73, 1983.
Linkov and Trump. The science and practice of resilience. 2019

Detecting disguised processes using application-behaviour profiling

Aims: Propose and develop a system that detects disguised processes.

Background: In order to avoid detection, malware can disguise itself as a legitimate program or hijack system processes to reach its goals. Commonly used signature-based Intrusion Detection Systems (IDS) struggle to distinguish between these processes and are thus only of limited use to detect these kind of attacks. They also have the shortcoming that they need to be updated frequently to possess the latest malware definitions. This makes them inherently prone to missing novel attacks. Misuse detection IDSs however overcome this problem by maintaining a ground truth of normal application behaviour and reporting deviations as anomalies. In this project, students will be tasked to investigate how a process' behaviours can be profiled in the attempt to identify whether it is behaving anomalously and if it can be correctly identified as malware. This project will build on existing research in this area.

Prerequisites:

Good programming skills, particularly C, C++, Python, and knowledge of Windows/Linux internals.
Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub).

Early Deliverables

A report describing the state of the art in malware disguising themselves as other processes.
A clear workplan describing the set of tests to be performed, tools to be implemented and classes of techniques to be proposed and studied

Final Deliverables

Design and implement a framework for development of new and existing related malware detection techniques
Describe in detail the design of tests for the framework
Describe and analyse the features that have been used during the project to detect/stop threats
Critically compare the findings with related works

Reading

Vaas and Happa. Detecting disguised processes using application-behavior profiling. 2017
Geden and Happa. Classification of malware families based on runtime behaviour. 2018

Developing a security suite for Android Smartphones

Aims: Develop a security suite for Android based smartphones.

Background: This project will develop a security suite for Android smartphones. There are several security apps on the Google Play Store which take care of certain security aspects but there isn’t any all-in-one solution which covers the security as a whole. This project will analyse the possibility of having one such security package by using secure sensor management and enforcing authentication and access control policies without modifying the ROM or Android OS.

Prerequisites:

Good programming skills, particularly Android app development, and knowledge of IDEs.
Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub).

Early Deliverables

A report describing different security apps for integrating into security suite.
A proof-of-concept implementation of the security suite
Testing the system

Final Deliverables

Design and develop the final security suite
Complete report which must include a User Manual

Reading

Conti, M et al. Context-related policy enforcement for android
Bojjagani, S. et al. Stamba: Security testing for Android mobile banking apps.
Vulnerability Test Suite for Android - NowSecure https://info.nowsecure.com › VTS-for-Android
https://www.android.com/intl/en_uk/enterprise/security/

Development of a Long-Term Archiving System

Aims: Develop a user-friendly long-term archiving system for digital documents.

Background: Digital signatures are an effective tool to guarantee the authenticity and integrity of digital documents. However, in the long term, there are issues such as expired certificates and the fact that the cryptographic algorithms used today will become insecure in future years due to advances in cryptanalysis and computing power. Therefore, it is necessary to augment signed documents with additional evidence in the form of digital timestamps issued by so-called timestamping authorities (TSAs). The goal of this project is to develop a user-friendly long-term archiving system that gives users the opportunity to a) easily apply digital signatures to documents stored on their device, b) automatically interact with TSAs to generate and refresh additional evidence, and c) verify that a given document indeed existed in the current form at a certain point in time.

Prerequisites:

Basic knowledge in cryptography (mainly hash functions, digital signatures, public-key infrastructures)
Basic knowledge in UI development, web services (mainly REST API), cryptographic libraries (mainly OpenSSL)

Early Deliverables

Implement a user-friendly signing tool that applies a digital signature to a given file
Implement a TSA service that can be accessed via a REST API
Design a suitable file format to accumulate digital evidence

Final Deliverables

Extend the signing tool with capabilities to automatically interact with TSAs to generate and refresh digital evidence
Extend the signing tool with capabilities to verify whether a digitally signed and timestamped document existed at a certain point in time

Reading

Martín Vigil, Johannes Buchmann, Daniel Cabarcas, Christian Weinert, Alexander Wiesmaier: "Integrity, Authenticity, Non-Repudiation, and Proof of Existence for Long-Term Archiving: A Survey". In Computers & Security 2015.
Christian Weinert, Denise Demirel, Martín Vigil, Matthias Geihs, Johannes Buchmann: "MoPS: A Modular Protection Scheme for Long-Term Storage". In ASIACCS 2017.

Development, Implementation, and Simulation of New Insertion Strategies for Cuckoo Filters

Aims: Develop, implement, and simulate new insertion strategies for Cuckoo filters.

Background: Probabilistic data structures for efficient membership testing like Cuckoo filters play an important role in cryptographic protocols that solve the private set intersection (PSI) problem. In many PSI use cases such as mobile contact discovery, Cuckoo filters have to handle frequent insertions of many new set elements. However, when the load factor of a Cuckoo filter increases, it is more likely that the default insertion strategy fails. In that case, a larger Cuckoo filter has to be created from scratch and distributed among all protocol participants, which causes significant communication overhead. Therefore, the goal of this project is to increase the lifetime of Cuckoo filters in PSI protocols by developing and implementing new insertion strategies that can deal with high load factors. The new strategies should then be simulated to measure their success as well as the resulting run-time overhead.

Prerequisites:

Basic knowledge in cryptography (mainly hash functions) and complexity theory (mainly "Big O" notation)
Knowledge of either C/C++, Go, Java, or Rust

Early Deliverables

Algorithms for three new insertion strategies with different trade-offs (in terms of time and storage complexity)
Implementation of at least one of the new strategies in one existing Cuckoo filter implementation
Implementation of a simulation environment that will allow to measure the success and run-time overhead of different insertion strategies

Final Deliverables

Implementation of the remaining new insertion strategies
Simulation results for all insertion strategies

Reading

Bin Fan, David G. Andersen, Michael Kaminsky, Michael Mitzenmacher: "Cuckoo Filter: Practically Better Than Bloom". In: CoNEXT 2014.
Daniel Kales, Christian Rechberger, Thomas Schneider, Matthias Senker, Christian Weinert: "Mobile Private Contact Discovery at Scale". In USENIX Security Symposium 2019.
https://github.com/efficient/cuckoofilter
https://github.com/linvon/cuckoo-filter
https://github.com/MGunlogson/CuckooFilter4J
https://github.com/axiomhq/rust-cuckoofilter

Digital Fountain Codes

Aims: Understand and implement a randomised protocol called a Digital Fountain Code - for sending a file over a channel, without the receiver having to ever ask for any part of the message to be re-sent.

Background: Digital fountain codes (or Luby codes) are an elegant solution to a natural problem. Suppose you have to send a large file over a network, so that you have to split it into a sequence of small packets. Suppose that some packets can get lost, and suppose that it is either expensive or impossible for the receiver to ask you to re-send any lost packets, and you do not know which packets did not arrive. How can you send the file so that the receiver can reconstruct it? One's first idea is to send a stream of packets, each containing a random part of the file: eventually the receiver will get all parts of the file, but the total amount of packets that the receiver has to get will be many times the size of the original file. Remarkably, using a digital fountain code, it is possible to construct randomised packets so that the receiver can reconstruct the file after receiving only about 5% more data than was in the original file. Although the basic idea is quite simple, there are numerous issues in constructing an efficient implementation, and the optimal 'degree distribution' of the packets may still be unknown.

Early Deliverables

A basic implementation for sending sequences of characters. The 'droplets' may be object instances with appropriate fields. The implementation should consist of two programs: an encoder that reads in a text file and writes out a sequence of 'droplets', and a decoder that reads in a sequence of droplets and which outputs a text file, which should be identical to the original text file.
A description of how the basic encoding and decoding algorithms work, with pseudocode and a running example. It should also include a discussion of the 'degree distribution', and of how to sample random values from a discrete distribution.

Final Deliverables

An efficient program for encoding and decoding.
A sequence of well designed and analysed experiments investigating the program's efficiency and behaviour.
The report will contain a discussion of erasure channels, contrast with feedback channels. When are erasure channels needed? What other solutions are there?
The report will contain a description of Digital Fountain Codes: basic encoding and decoding, with a running example. Role of degree distribution. Behaviour with special cases of degree distribution. Luby's degree distributions. Time and space efficiency of variants of decoding algorithm.
The report will contain 3. Design and implementation: What is overall structure of program? What are design issues? How were these design issues resolved?
The report will contain complete testing schedule: What is evidence program is correct (test cases)
The report will contain an analysis of the time and space efficiency of algorithm(s)
The report will contain, for each experiment: aim, method, tables/graphs of results, questions raised

Suggested Extensions

Implementation should be efficient: different optimisations lead to time and space efficiency. Many small extensions to algorithm are possible. You could design and carry out experiments to assess the efficiency of your algorithm.
It is possible to design experiments to check the efficiency of the packet degree distribution, and you could try to optimise the degree distribution in the light of your experiments.

Reading

Start with Wikipedia on 'Luby Transform Code', and the references it gives.

Digital Fountain Codes

Early Deliverables

A basic implementation for sending sequences of characters. The 'droplets' may be object instances with appropriate fields. The implementation should consist of two programs: an encoder that reads in a text file and writes out a sequence of 'droplets', and a decoder that reads in a sequence of droplets and which outputs a text file, which should be identical to the original text file.
A description of how the basic encoding and decoding algorithms work, with pseudocode and a running example. It should also include a discussion of the 'degree distribution', and of how to sample random values from a discrete distribution.

Final Deliverables

An efficient program for encoding and decoding.
A sequence of well designed and analysed experiments investigating the program's efficiency and behaviour.
The report will contain a discussion of erasure channels, contrast with feedback channels. When are erasure channels needed? What other solutions are there?
The report will contain a description of Digital Fountain Codes: basic encoding and decoding, with a running example. Role of degree distribution. Behaviour with special cases of degree distribution. Luby's degree distributions. Time and space efficiency of variants of decoding algorithm.
The report will contain 3. Design and implementation: What is overall structure of program? What are design issues? How were these design issues resolved?
The report will contain complete testing schedule: What is evidence program is correct (test cases)
The report will contain an analysis of the time and space efficiency of algorithm(s)
The report will contain, for each experiment: aim, method, tables/graphs of results, questions raised

Suggested Extensions

Implementation should be efficient: different optimisations lead to time and space efficiency. Many small extensions to algorithm are possible. You could design and carry out experiments to assess the efficiency of your algorithm.
It is possible to design experiments to check the efficiency of the packet degree distribution, and you could try to optimise the degree distribution in the light of your experiments.

Reading

Start with Wikipedia on 'Luby Transform Code', and the references it gives.

Distributed Cheat-Proof Games

Aims: To build a high-performance distributed cheat-proof game using recently available trusted computing hardware (Intel SGX).

Background: Popular multiplayer video games often run in a client-server model, where the server runs on a dedicated infrastructure provided by the game developer. The main benefit of this approach is that it is easier to detect and prevent cheating, since players can modify software on their own machine for their own benefit [1]. Unfortunately, this approach has several drawbacks. First, the scalability of the system is limited as the greater the number of clients, the more server infrastructure required. Indeed, popular games often experience overloaded game servers because of the number of simultaneous players [2, 3]. Furthermore, as clients cannot be trusted, they have to send substantial amounts of data to the server, which then has to do all the computation on it. This increases both the system bandwidth and CPU usage, which can degrade the user experience [4, 5].
Although cloud infrastructures, e.g., Amazon EC2, provide several benefits (maintenance, scalability, reliability), they do not provide any security guarantee: a bug in the cloud infrastructure or a malicious cloud provider could be even more dangerous than a malicious client.
Intel has recently released the Software Guard eXtensions [6, 9, 10], a set of instructions dedicated to improving the security of an application by running (parts of) it in a secure container, called an enclave. The data and computation happening in this container is protected by the CPU chip from any external attacker, be it another malicious process, the operating system, or physical access to the machine.
The goal of this project is to partition a real-world game (e.g. ioquake3 [7], a clone of the popular Quake 3: Arena video game), by integrating a subset of its components inside an enclave. As a result some computation at the server can be offloaded to the client while preventing cheating. A list of potential cheats can be found in [8].

Early Deliverables

Familiarity with necessary background on Intel SGX.
Familiarity with Intel SGX SDK. Show examples of its operation on toy/tutorial problems.
Report: Intel SGX overview
Report: Distributed Games, including architecture and potential for cheating in the game of your choice.
Proof-of-concept: Partition a real-world game of your choice (e.g. ioquake3) so that the server runs inside an enclave
Proof-of-concept: Offload some of the computation at the server to the client, inside an enclave

Final Deliverables

Modify the communication protocol between the client and the server so that the integrity of the game is preserved even when a client is cheating;
Establish a secure communication protocol, e.g., TLS, so that an attacker cannot observe the network traffic for cheating;
Evaluate the performance and scalability benefits of your distributed game.
Report: Detailed architecture description of final partitioned game.
Report: Security analysis of final partitioned game.

Suggested Extensions

Using the Intel SGX functionalities, establish a framework to remotely attest/verify the authenticity of a client/server and provision it with sensitive data.

Reading

https://support.steampowered.com/kb_article.php?ref=7849-Radz-6869
http://www.gamesradar.com/diablo-iii-error-37-causing-great-wailing-and-gnashing-teeth-players-worldwide-fail-log-solution-wait-a-bit/
http://www.ibtimes.com/division-servers-struggle-ubisoft-confirms-login-problems-2332172
Raaen et al., Latency Thresholds for Usability in Games: A Survey, NIK 2014.
Howard et al., Cascading Impact of Lag on User Experience in Cooperative Multiplayer Games, NetGames 2014.
https://software.intel.com/en-us/sgx
https://ioquake3.org/
Yahyavi et al., Watchmen: Scalable cheat-resistant support for distributed multi-player online games, ICDCS 2013.
Intel SGX Linux SDK Sample Code: https://github.com/intel/linux-sgx/tree/master/SampleCode/SampleEnclave
Intel SGX Explained, https://eprint.iacr.org/2016/086.pdf

Prerequisites: Strong interest in distributed systems, systems programming. Good programming skills (especially C/C++). Some interest or familiarity with basic security concepts.

Distributed robust Log system for IoT using DLT technology

Aims: Develop a general-purpose robust distributed auditing/Logging system for Internet of Things (IoT).

Background: A system log is considered one of important security elements in any system especially IoT system as it is necessary to log and monitor relevant and sensitive activities in the system for compliance and accountability purposes. Therefore, all architectural components need to have an auditing system at least a log file recording the history of different activities (measurements sending or receiving events). However, creating a tamper-proof and trusted auditing system appears to be very challenging as the data of this system may undergo interferences or illegitimate modifications affecting its integrity. For this reason, incorporating Blockchain (DLT) technology in such system would offer a good solution for this problem.

Prerequisites:

Ability to research a new topic and implement designs
Good programming skills particularly Node JS , Java, Golang, etc and knowledge of Windows/Linux internals.
Knowledge Data/Database Modelling
Familiarity with DLT and Blockchain technologies (e.g. HyperLedger)

Early Deliverables

A report describing architectural components of the IoT system and their relevant activities which need to be audited and addressing Blockchain technologies available for use
A design of the auditing data model
A proof of concept implementation of the auditing system using a DLT technologies
Testing the system

Final Deliverables

Design and develop the data model of the IoT-based auditing system
Develop a prototype of Blockchain-IoT-based auditing system
Complete report
The report will include a User Manual

Reading

https://www.networkmanagementsoftware.com/what-is-syslog/
https://en.wikipedia.org/wiki/Syslog
https://medium.com/swlh/top-5-dlt-blockchain-protocols-to-consider-in-your-project-dae7fc6d381d
https://www.ibm.com/blogs/research/2018/02/architecture-hyperledger-fabric/
https://hyperledger-fabric.readthedocs.io/en/release-1.4/getting_started.html

Econ: Causal Evaluation of Empirical Factor Models using the PC Algorithm

Background: This project is only available to students on the Computational Finance programme. Contact Alessio Sancetta on alessio.sancetta@rhul.ac.uk for details.

Prerequisites: Google what the PC algorithm is to see if you like this, and can use data from French's data library, among others.

Econ: Credit markets after 2008.

Background: This project is only available to students on the Computational Finance programme. Contact Pierre-Olivier Fortin on Pierre-Olivier.Fortin@rhul.ac.uk for details.

Prerequisites: Knowledge of Econometrics or Statistical analysis.

Econ: Modelling volatility of cryptocurrencies

Background: This project is only available to students on the Computational Finance programme. Contact Souhaila Al Hesso on souhaila.alhesso@rhul.ac.uk for details.

Econ: Price limits in emerging markets.

Background: This project is only available to students on the Computational Finance programme. Contact Vinay Nundlall on Vinay.Nundlall@rhul.ac.uk for details.

Prerequisites: Knowledge of Econometrics or Statistical analysis.

Econ: Price overreactions in the cryptocurrency market

Background: This project is only available to students on the Computational Finance programme. Contact Souhaila Al Hesso on souhaila.alhesso@rhul.ac.uk for details.

Econ: Technical analysis and Cryptocurrency

Background: This project is only available to students on the Computational Finance programme. Contact Souhaila Al Hesso on souhaila.alhesso@rhul.ac.uk for details.

Econ: The information content in trading volume in emerging stock exchanges.

Background: This project is only available to students on the Computational Finance programme. Contact Vinay Nundlall on Vinay.Nundlall@rhul.ac.uk for details.

Ensemble methods

Aims: Write a program (or programs) implementing various ensemble methods (such as bagging and boosting) together with some underlying algorithms.

Early Deliverables

A proof-of-concept implementation of bagging, random forests, or boosting both in the case of regression and in the case of classification.
Implementing underlying algorithms, such as decision stumps or decision trees; simple proof-of-concept implementations are sufficient.

Final Deliverables

A final program implementing bagging, random forests, or boosting (or all three) with a careful choice of the parameters.
Implementations of the underlying algorithms should be complemented by a careful empirical study of the effect of various parameters; producing optimized programs.
Comparing the performance of different ensemble techniques in the problems of classification and regression.
Using bagged and boosted decision trees as probability predictors; evaluating their perfomance using proper loss functions.
Ideally, the final product will have a graphical user interface.
The final report will describe the theory behind the algorithms, the implementation issues necessary to apply the theory, the software engineering process involved in generating your software, computational experiments with different data sets, methods, and parameters.

Suggested Extensions

Implementing the out-of-bag estimate of error probability.
Investigating variable importance.
Proximity plots.
Theoretical analysis of the algorithms.

Reading

Hastie, Tibshirani, and Friedman, The Elements of Statistical Learning, 2009, Section 8.7, Chapters 10 and 15-16
James, Witten, Hastie, and Tibshirani, An Introduction to Statistical Learning, 2013, Chapter 8.
Research papers.

Prerequisites: Taking CS5100 Data Analysis. Python, R, MATLAB, Java, or C++ programming skills.

Ensemble methods

Aims: Write a program (or programs) implementing various ensemble methods (such as bagging and boosting) together with some underlying algorithms.

Early Deliverables

A proof-of-concept implementation of bagging, random forests, or boosting both in the case of regression and in the case of classification.
Implementing underlying algorithms, such as decision stumps or decision trees; simple proof-of-concept implementations are sufficient.

Final Deliverables

A final program implementing bagging, random forests, or boosting (or all three) with a careful choice of the parameters.
Implementations of the underlying algorithms should be complemented by a careful empirical study of the effect of various parameters; producing optimized programs.
Comparing the performance of different ensemble techniques in the problems of classification and regression.
Using bagged and boosted decision trees as probability predictors; evaluating their perfomance using proper loss functions.
Ideally, the final product will have a graphical user interface.
The final report will describe the theory behind the algorithms, the implementation issues necessary to apply the theory, the software engineering process involved in generating your software, computational experiments with different data sets, methods, and parameters.

Suggested Extensions

Implementing the out-of-bag estimate of error probability.
Investigating variable importance.
Proximity plots.
Theoretical analysis of the algorithms.

Reading

Hastie, Tibshirani, and Friedman, The Elements of Statistical Learning, 2009, Section 8.7, Chapters 10 and 15-16
James, Witten, Hastie, and Tibshirani, An Introduction to Statistical Learning, 2013, Chapter 8.
Research papers.

Prerequisites: Taking CS5100 Data Analysis. Python, R, MATLAB, Java, or C++ programming skills.

Exploring the graph properties of the DataCite PID graph

Aims: To build a tool that will make GraphQL queries to the datacite PID graph and to explore its graph properties.

Background:

This project will require the student to examine the behaviour of the degree of the PID graph.

Early Deliverables

A report on graphs and scale-free networks.
A report on the PID graph.
A report on GraphQL.

Final Deliverables

The development of a tool to query the PID graph and visualise graphs.
Identifying the behaviour of the degree of the PID graph.

Suggested Extensions

Exploring other graph obseravbles e.g. centrality.
Exploring using a recommender system to identify papers of interest.

Reading

Introducing the PID Graph - https://blog.datacite.org/introducing-the-pid-graph/
The Fullstack Introduction for GraphQL - https://www.howtographql.com/
Learning GraphQL, E.Porcello and A. Banks O'Reilly Media (available online through Safari Books)
Handbook of Graph Theory, J.L. Gross, J. Yellen and P. Zhang, Chapman and Hall/C (available online through Safari Books)
Scale-free networks are rare, A.D. Broido and A. Clauset, Nature Comms. (2019) DOI - 10.1038/s41467-019-08746-5

Prerequisites:

Understanding of or willingness to study graph theory.

The student must have a good understanding of Python or a similar tool.

Exploring the graph properties of the DataCite PID graph

Aims: To build a tool that will make GraphQL queries to the datacite PID graph and to explore its graph properties.

Background:

This project will require the student to examine the behaviour of the degree of the PID graph.

Early Deliverables

A report on graphs and scale-free networks.
A report on the PID graph.
A report on GraphQL.

Final Deliverables

The development of a tool to query the PID graph and visualise graphs.
Identifying the behaviour of the degree of the PID graph.

Suggested Extensions

Exploring other graph obseravbles e.g. centrality.
Exploring using a recommender system to identify papers of interest.

Reading

Introducing the PID Graph - https://blog.datacite.org/introducing-the-pid-graph/
The Fullstack Introduction for GraphQL - https://www.howtographql.com/
Learning GraphQL, E.Porcello and A. Banks O'Reilly Media (available online through Safari Books)
Handbook of Graph Theory, J.L. Gross, J. Yellen and P. Zhang, Chapman and Hall/C (available online through Safari Books)
Scale-free networks are rare, A.D. Broido and A. Clauset, Nature Comms. (2019) DOI - 10.1038/s41467-019-08746-5

Prerequisites:

Understanding of or willingness to study graph theory.

The student must have a good understanding of Python or a similar tool.

Game Playing with Monte-Carlo Tree Search

Aims: To design and implement game playing program(s) with Monte-Carlo Tree Search.

Background: Monte-Carlo Tree Search (MCTS) is a newly developed class of methods for heuristic search in game trees, and for other problems in which tree search is used. In a nutshell, the idea is to explore the game tree by simulating a great many complete sequences of random play (`rollouts'), gradually concentrating attention on the more promising paths found, and so growing an initial promising game tree, and finally selecting the estimated best move. The number of rollouts considered at each move may be tens of thousands (or even millions, depending on the problem and how fast your code is). It is a different game-tree search strategy from the older method of alpha-beta pruning.

MCTS is an active research area, and has scored some striking successes - it is the basis of the current best Go programs (Go has too high a branching factor for alpha-beta pruning to be used).

In this project, the aim is to produce and evaluate a simple implementation of MCTS for a simple game, and to do some experimental evaluations of its performance. Possible games might be Connect4 (probably a good choice) or 9x9 capture Go (requires some subtle programming).

Early Deliverables

Proof of concept programs: bandit problem solver ('bandit problems' are particularly trivial statistical games, which you should tackle first), with experiments showing how upper-confidence-bound (UCB) methods compare to naive strategies.
Proof of concept programs: Program that enables 2 human players to play the selected game (no AI moves).
Reports: UCB method for bandit problems.
Reports: Rules of the game, expressed semi-formally as state and allowable transitions.
Reports: Algorithm for MCTS.

Final Deliverables

The program must have a full object-oriented design, with a full implementation life cycle using modern software engineering principles
The program should have a GUI for game play, for the case of two human players, and one player against the computer.
The report should include experimental assessments of effects on program performance of parameters of MCTS (depth of search, number of rollouts used, etc).
The report will describe the software engineering process involved in generating your software.
The report will have sections on: the theory of bandit processes, and the algorithm for MCTS; the game used, the software engineering process in development of the program; an explanation and justification of design decisions for the GUI; and an experimental analysis of the performance of MCTS.

Suggested Extensions

A possible enhancement would be a comparison of MCTS with alpha-beta pruning.
The project could be done for a relatively simple game ( such as Connect4 ) or for 9x9 capture Go, which adds some programming complexity.

Genetic Algorithms

Aims: To implement a GA (or, preferably, a range of variations of the GA) and to apply it to various optimisation problems (e.g. constraint satisfaction). The program should follow a model-view-controller design pattern. It should be possible to To design, implement, and evaluate a (range of) genetic algorithms.

Background: A genetic algorithm (GA) is a simulated model of evolution: techniques of this type are widely used to solve optimisation problems. "Solve" is perhaps the wrong word here: 'try to improve a solution' might be better, for although these algorithms are models of evolution, they often don't work very well. GAs are so widely used because they are straightforward to program, and in optimisation problems, the only harm an ineffective algorithm can do is to waste computer time.

Early Deliverables

Proof of concept programs: First implementation of GA for constraint satisfaction/function optimisation.
Report on Design Patterns.
Report on Genetic Algorithms.
Report on encoding various problems for GAs.
Report on the Theory of coalescence and genetic drift.

Final Deliverables

The program must have a full object-oriented design, with a full implementation life cycle using modern software engineering principles
The program will at least enable a user to interactively view and control GA optimisation for (at least one) optimisation problem.
The program must have a text based control file from which it can be configured.
The program will have a Graphical User Interface that can be used to animate simulations.
The report will describe the software engineering process involved in generating your software.
The report will include a survey of optimisation methods, including GAs.
The report will include some results of experiments that compare different optimisation methods applied to the same problems.
The report will describe interesting programming techniques and data structures used (or able to be used) on the project.

Suggested Extensions

This is an open-ended project that requires a substantial amount of programming. The basic project may be extended in many ways.
One possibility would be to choose an optimisation problem, and then compare a range of optimisation techniques for it, including GAs.

Handwritten Greek characters recognition

Background:

Until today, Greek text search facility is far from perfect. Hence, it is interesting to apply machine learning techniques and compare their accuracies for such task.

Early Deliverables

Proof of concept program: identify one machine learning task suitable for characters recognition and implement one learning algorithm for the chosen task.
Report: An overview of the datasets. Statistics and numbers involved.
Report: A report on machine learning tasks for character recognition.

Final Deliverables

Survey of machine learning techniques used for characters recognition.
The developed machine learning software for Greek characters recognition.
The software engineering process involved in generating your software.

Suggested Extensions

Provide predictions with confidence measure.

Reading

Gatos, Basilis, et al. GRPOLY-DB: An old Greek polytonic document image database. Document Analysis and Recognition (ICDAR), 2015 13th International Conference on. IEEE, 2015.
http://www.iit.demokritos.gr/~nstam/GRPOLY-DB

Prerequisites: Taking CS5920 Machine Learning would be useful.

Handwritten Greek characters recognition

Background:

Until today, Greek text search facility is far from perfect. Hence, it is interesting to apply machine learning techniques and compare their accuracies for such task.

Early Deliverables

Proof of concept program: identify one machine learning task suitable for characters recognition and implement one learning algorithm for the chosen task.
Report: An overview of the datasets. Statistics and numbers involved.
Report: A report on machine learning tasks for character recognition.

Final Deliverables

Survey of machine learning techniques used for characters recognition.
The developed machine learning software for Greek characters recognition.
The software engineering process involved in generating your software.

Suggested Extensions

Provide predictions with confidence measure.

Reading

Gatos, Basilis, et al. GRPOLY-DB: An old Greek polytonic document image database. Document Analysis and Recognition (ICDAR), 2015 13th International Conference on. IEEE, 2015.
http://www.iit.demokritos.gr/~nstam/GRPOLY-DB

Prerequisites: Taking CS5920 Machine Learning would be useful.

Housing Prices

Aims: The aim of the project is to apply multiple machine learning techniques for predicting housing prices and compare their performance.

Background:

The first task is to work out house prices in a batch setup when the time is not available (or ignored). We can compare learning methods and study the effect of parameters on the test error.

The project is recommended for Computational Finance students.

Early Deliverables

Proof of concept program pre-processing the data and applying two machine learning methods to Boston Housing dataset
Report: A report describing supervised learning, the concepts 'training set' and 'test set', and the theory of the methods.

Final Deliverables

A machine learning method (such as Ridge Regression; the choice is by discussion with the supervisor) should be implemented from scratch.
Several different datasets should be used
Tests comparing different kernels, parameters, etc should be performed. The results should be visualised.
Prediction in the online mode should be implemented and results visuaised.
The report will describe different algorithms and implementation issues (such as the choice of data structures, numerical methods etc) necessary to apply the theory.
The report will describe computational experiments, analyse their results and draw conclusions

Suggested Extensions

Methods of prediction with expert advice can be applied in the online mode.
Is there inflation? A study can be undertaken.

Reading

Boston housing dataset on Kaggle: https://www.kaggle.com/prasadperera/the-boston-housing-dataset
California housing dataset on Kaggle: https://www.kaggle.com/camnugent/california-housing-prices
De Cock, Dean. Ames, Iowa: Alternative to the Boston housing data as an end of semester regression project. Journal of Statistics Education 19.3 (2011).
Y.Kalnishkan, Kernel Methods, http://onlineprediction.net/?n=Main.KernelMethods
J.Shawe-Taylor and N.Cristianini, Kernel Methods for Pattern Analysis Cambridge University Press, 2004
B.Schlkopf and A.J.Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, The MIT Press, 2001.

Housing Prices

Aims: The aim of the project is to apply multiple machine learning techniques for predicting housing prices and compare their performance.

Background:

The first task is to work out house prices in a batch setup when the time is not available (or ignored). We can compare learning methods and study the effect of parameters on the test error.

The project is recommended for Computational Finance students.

Early Deliverables

Proof of concept program pre-processing the data and applying two machine learning methods to Boston Housing dataset
Report: A report describing supervised learning, the concepts 'training set' and 'test set', and the theory of the methods.

Final Deliverables

A machine learning method (such as Ridge Regression; the choice is by discussion with the supervisor) should be implemented from scratch.
Several different datasets should be used
Tests comparing different kernels, parameters, etc should be performed. The results should be visualised.
Prediction in the online mode should be implemented and results visuaised.
The report will describe different algorithms and implementation issues (such as the choice of data structures, numerical methods etc) necessary to apply the theory.
The report will describe computational experiments, analyse their results and draw conclusions

Suggested Extensions

Methods of prediction with expert advice can be applied in the online mode.
Is there inflation? A study can be undertaken.

Reading

Boston housing dataset on Kaggle: https://www.kaggle.com/prasadperera/the-boston-housing-dataset
California housing dataset on Kaggle: https://www.kaggle.com/camnugent/california-housing-prices
De Cock, Dean. Ames, Iowa: Alternative to the Boston housing data as an end of semester regression project. Journal of Statistics Education 19.3 (2011).
Y.Kalnishkan, Kernel Methods, http://onlineprediction.net/?n=Main.KernelMethods
J.Shawe-Taylor and N.Cristianini, Kernel Methods for Pattern Analysis Cambridge University Press, 2004
B.Schlkopf and A.J.Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, The MIT Press, 2001.

Identifying similarities between text and the Gene Ontology

Aims: To build a tool that will identify similarities between text and the descriptions in the Gene Ontology.

Background:

This project will require the student to develop an equivalent of GOCat.

Early Deliverables

A report on Gene Ontologies.
A report on the tool GOCat explaining the methods.
A report on the data set required to build the tool.

Final Deliverables

The development of a stand alone tool to implement the equivalent of GOCat.
An API to query the tool.
Rigourous testing of the tool.

Suggested Extensions

Using more advanced methods beyond kNN for the implementation.
An API to query the tool.

Reading

Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases. Gobeill J Pasche E Vishnyakova D Ruch P DOI 10.1093/database/bat041
A survey on annotation tools for the biomedical literature. Neves M Leser U DOI 10.1093/bib/bbs084
The GOA database in 2009--an integrated Gene Ontology Annotation resource Barrell D Dimmer E Huntley R Binns D O'Donovan C Apweiler R dx.doi.org/10.1093/nar/gkn803

Prerequisites:

Understanding of machine learning methods such as kNN.

The student must have a good understanding of R and/or Python.

Identifying similarities between text and the Gene Ontology

Aims: To build a tool that will identify similarities between text and the descriptions in the Gene Ontology.

Background:

This project will require the student to develop an equivalent of GOCat.

Early Deliverables

A report on Gene Ontologies.
A report on the tool GOCat explaining the methods.
A report on the data set required to build the tool.

Final Deliverables

The development of a stand alone tool to implement the equivalent of GOCat.
An API to query the tool.
Rigourous testing of the tool.

Suggested Extensions

Using more advanced methods beyond kNN for the implementation.
An API to query the tool.

Reading

Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases. Gobeill J Pasche E Vishnyakova D Ruch P DOI 10.1093/database/bat041
A survey on annotation tools for the biomedical literature. Neves M Leser U DOI 10.1093/bib/bbs084
The GOA database in 2009--an integrated Gene Ontology Annotation resource Barrell D Dimmer E Huntley R Binns D O'Donovan C Apweiler R dx.doi.org/10.1093/nar/gkn803

Prerequisites:

Understanding of machine learning methods such as kNN.

The student must have a good understanding of R and/or Python.

Image and video classification

Aims: To implement object detection, image and video classification using deep learning and machine learning techniques.

Early Deliverables

Implementation of transfer learning algorithms based on pre-trained networks for any of the following tasks, (1) object detection/classification, (2) image segmentation, and (3) video classification.
A report describing transfer learning algorithms implemented with some evaluation results.

Final Deliverables

Implementing transfer learning, deep learning (e.g. Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs)), or hybrid deep networks for any of the aforementioned tasks.
Implementing several deep learning algorithms with different architectures, where the employed architectures are either existing, or newly proposed.
Implementing manual hyper-parameter fine-tuning of the implemented deep networks.
A comprehensive evaluation and comparison against existing state-of-the-art studies.
The final report will describe methodology, implementation and testing with theoretical analysis and detailed evaluation results.

Suggested Extensions

Implementing deep hybrid networks with new architectures.
Implementing optimal hyper-parameter selection using an automatic process.
Theoretical analysis of the algorithms.

Reading

Goodfellow, I., Bengio, Y. and Courville, A., 2016. Deep learning. MIT press.
Chollet, F., 2021. Deep learning with Python.
Journal/conference papers.

Image and video classification

Aims: To implement object detection, image and video classification using deep learning and machine learning techniques.

Early Deliverables

Implementation of transfer learning algorithms based on pre-trained networks for any of the following tasks, (1) object detection/classification, (2) image segmentation, and (3) video classification.
A report describing transfer learning algorithms implemented with some evaluation results.

Final Deliverables

Implementing transfer learning, deep learning (e.g. Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs)), or hybrid deep networks for any of the aforementioned tasks.
Implementing several deep learning algorithms with different architectures, where the employed architectures are either existing, or newly proposed.
Implementing manual hyper-parameter fine-tuning of the implemented deep networks.
A comprehensive evaluation and comparison against existing state-of-the-art studies.
The final report will describe methodology, implementation and testing with theoretical analysis and detailed evaluation results.

Suggested Extensions

Implementing deep hybrid networks with new architectures.
Implementing optimal hyper-parameter selection using an automatic process.
Theoretical analysis of the algorithms.

Reading

Goodfellow, I., Bengio, Y. and Courville, A., 2016. Deep learning. MIT press.
Chollet, F., 2021. Deep learning with Python.
Journal/conference papers.

Image-based malware classification using neural networks

Aims: Evaluating the efficacy of neural networks in classifying image-based representation of malware.

Background: Classical malware detection approaches involve extracting some binary features/signatures from the malware via static and dynamic analysis. However, most of existing approaches suffer from a number of issues, e.g. scalability, obfuscation / evasion. The goal of the project is to ascertain the practical value of approaching the malware classification problem as a computer vision task. The basis for this project is the observation that if malware binaries are plotted as grayscale images, the textural and structural patterns can be used to effectively classify binaries as either benign or malicious, as well as cluster malicious binaries into respective threat types (e.g., cryptomining, ransomware). The project is mainly focused on evaluating the effectiveness and efficacy of convolutional neural networks and deep neural networks for classifying imaged-based representation of malware into malware types.

Prerequisites:

Basics of malware, knowledge of Python/C, some notions of machine learning.
Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub)

Early Deliverables

A report describing techniques/methods to convert a binary into an image
A preliminary tool to convert binaries into (grayscale) images
Creation of a reasonably large dataset composed of different types of malware binaries
A preliminary set of tests with neural networks (CNN, DNN) to classify a subset of the dataset based on their image representation

Final Deliverables

Leveraging neural networks for classifying malware into families
Describe in detail the design/architecture of the system
Test the system on the entire dataset and discuss the results (efficacy, efficiency) and pros/cons of each technique
Suggestions for improving the proposed system

Reading

https://dl.acm.org/doi/pdf/10.1145/2016904.2016908
https://www.intel.com/content/www/us/en/artificial-intelligence/documents/stamina-deep-learning-for-malware-protection-whitepaper.html
https://arxiv.org/pdf/1812.07606.pdf
https://www.blackhat.com/docs/us-15/materials/us-15-Davis-Deep-Learning-On-Disassembly.pdf
https://arxiv.org/pdf/1903.11551.pdf
https://www.covert.io/research-papers/deep-learning-security/Convolutional%20Neural%20Networks%20for%20Malware%20Classification.pdf
https://dl.acm.org/doi/pdf/10.1145/2046684.2046689

Implementing lattice-based cryptography

Aims: Implementing a lattice-based public-key encryption scheme

Background: Post-quantum cryptography refers to cryptographic schemes that are believed to be secure against an adversary equipped with a sufficiently large quantum computer. In contrast, cryptosystems based on problems like factoring or solving discrete logarithms, which are commonly used today, would no longer be secure in this setting. Lattice-based cryptographic schemes are among the most prominent candidates in the ongoing effort by NIST to standardise post-quantum Public-Key Encryption schemes (PKEs), Key Encapsulation Mechanisms (KEMs), and digital signatures. The Learning with Errors (LWE) problem (and its variants) are examples of hard problems on which lattice-based cryptographic schemes are based.

Prerequisites: Some undergraduate mathematical background e.g. linear algebra, number theory as well as knowledge of C++ programming is useful.

Early Deliverables

A report describing some hard problems on lattices and how they relate
A report motivating the use of lattice-based cryptography
Proof of concept implementation of Regev's LWE-based PKE (perhaps using mathematical software e.g. Sage)

Final Deliverables

Implementation in C++ of Lindner-Peikert IND-CPA secure PKE based on LWE, Ring-LWE, or Module-LWE (instead of C++, also a different programming language can be used if approved by the project supervisor)
Use IND-CPA secure PKE as a building block to implement IND-CCA secure KEM
Report discussing LWE variants and justification of choice of underlying hardness assumption
Report explaining relevant cryptographic concepts e.g. notions of security, definitions of PKE and KEM and related primitives

Reading

https://csrc.nist.gov/Projects/post-quantum-cryptography/Post-Quantum-Cryptography-Standardization
https://cims.nyu.edu/~regev/papers/lwesurvey.pdf
http://portal.acm.org/citation.cfm?id=1060590.1060603
https://www.maths.ox.ac.uk/system/files/attachments/sage-introduction.pdf
https://web.eecs.umich.edu/~cpeikert/pubs/lwe-analysis.pdf
https://repo.zenk-security.com/Cryptographie%20.%20Algorithmes%20.%20Steganographie/Introduction%20to%20Modern%20Cryptography.pdf

Implementing the PKCS#1 v1.5 Signature Scheme

Aims: Implementing the PKCS#1 v1.5 Signature Scheme with provably secure parameters.

Background: The PKCS#1 v1.5 signature scheme is one of the most widely used signature schemes in practice, mainly due to it's use in TLS. While it has been used for decades, there was no security proof for the scheme, which cast some doubt on its usability. Although RSA-PSS has been suggested as a replacement, it is randomized and is more computationally expensive, so keeping PKCS#1 is preferable. in 2018 Jager, Kakvi and May presented a proof of security for the PKCS#1 v1.5 signature schemes, but only if larger parameters are used. The question to be answered is how much of an overhead is introduced by using the provably secure parameters.

Prerequisites: Some undergraduate mathematical background e.g. linear algebra, number theory is useful. Experience implementing programs that work with large numbers is a bonus.

Early Deliverables

A report describing provable security and how it affects parameter choices.
A report motivating the use of provably secure parameters.
Proof of concept implementation of parameterizable RSA key generation.

Final Deliverables

An implementation of PKCS#1 v1.5 signatures scheme with provably secure and standard parameters.
A benchmarking program to compare the scheme with provably secure parameters and standard ones.
Report explaining the implementation of RSA based schemes.
Report explaining relevant cryptographic concepts e.g. definitions of digital signatures, notions of security, etc.

Reading

https://eprint.iacr.org/2018/855.pdf
https://datatracker.ietf.org/doc/html/rfc8017
https://www.ecrypt.eu.org/csa/documents/D5.4-FinalAlgKeySizeProt.pdf
https://www.openssl.org/docs/manmaster/man1/openssl-speed.html

Insider Threat Detection

Aims: Propose and develop a system that detects insider threats.

Background: This project will propose, develop and trial various detection techniques to detect insider threats. This research would be experimental in nature. Further reflection would then be given to how generalisable this result might be and what we might determine about the critical datasets required for this kind of control to be effective. The project itself, depending on student preferences, might morph into modelling of insider threats instead of detection.

Prerequisites:

Good programming skills, particularly C, C++, Python, and knowledge of Windows/Linux internals.
Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub).

Early Deliverables

A report describing the state of the art in insider threat detection.
A clear workplan describing the set of tests to be performed, tools to be implemented and classes of techniques to be proposed and studied.

Final Deliverables

Design and implement a framework for development of new and existing insider threat detection techniques
Describe in detail the design of tests for the framework
Describe and analyse the features that have been used during the project to detect/stop insider threats
Critically compare the findings with related works

Reading

The CERT insider threat database: https://insights.sei.cmu.edu/insider-threat/2011/08/the-cert-insider-threat-database.html
The CERT insider threat dataset: https://resources.sei.cmu.edu/library/asset-view.cfm?assetid=508099
Al Tabash and Happa. Insider-threat detection using gaussian mixture models and sensitivity profiles.
Legg et al. Towards a conceptual model and reasoning structure for insider threat detection.
Agrafiotis et al. Identifying attack patterns for insider threat detection.
Rashid et al. A new take on detecting insider threats: exploring the use of hidden markov models.
Legg et al. Caught in the act of an insider attack: detection and assessment of insider threat.

Joint prediction regions for time-series models

Aims: The goal of this project is to develop and evaluate methods to construct joint prediction regions for time-series models.

Background:

From a methodological viewpoint, the key steps of the project will be:

Learn a time-series predictor using a dataset of choice
Devise a method to compute prediction standard errors for your model
Implement the bootstrap algorithm of Wolf and Wunderli, using the above two models.

Early Deliverables

Report 1: bootstrap sampling.
Report 2: overview of chosen case study.
Report 3: prediction regions, JPRs and bootstrap JPRs.
Report 4: time-series models.
Report 5: preliminary plan, including an overview of the planned solution method and experimental evaluation plan.
Implement and evaluate predictor and prediction standard errors.

Final Deliverables

Implement code for bootstrap JPRs and evaluate empirical coverage and family-wise error rate for different hyper-parameters (e.g., k, horizon length, sample size, number of bootstrap samples).
Final dissertation describing problem, methods and algorithm, implementation details, computational experiments and their analysis, and conclusions.

Suggested Extensions

1. Compare different predictors.
2. Compare bootstrap JPRs with other methods mentioned in the paper (e.g., joint marginals and Scheffé)
3. Compare bootstrap JPRs with Monte Carlo sampling based on Bayesian neural networks and uncertainty recalibration

Reading

(bootstrap) Chapter 6, Lecture notes of 36-402, CMU Undergraduate Advanced Data Analysis, available at https://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf
(bootstrap JPRs) Wolf, Michael, and Dan Wunderli. "Bootstrap joint prediction regions." Journal of Time Series Analysis 36.3 (2015): 352-376.
(time-series models) Stock Market Predictions with LSTM in Python, available at https://www.datacamp.com/community/tutorials/lstm-python-stock-market
(time-series models) Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2019). Dive into deep learning, chapters 8 and 9, available at https://d2l.ai/index.html
(extension 3) Gal Y. & Ghahramani Z., 2016, Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning, ICML 2016, available at http://proceedings.mlr.press/v48/gal16.pdf
(extension 3) Kuleshov, Volodymyr, Nathan Fenner, and Stefano Ermon. "Accurate Uncertainties for Deep Learning Using Calibrated Regression." ICML 2018, available at https://arxiv.org/pdf/1807.00263.pdf

Joint prediction regions for time-series models

Aims: The goal of this project is to develop and evaluate methods to construct joint prediction regions for time-series models.

Background:

From a methodological viewpoint, the key steps of the project will be:

Learn a time-series predictor using a dataset of choice
Devise a method to compute prediction standard errors for your model
Implement the bootstrap algorithm of Wolf and Wunderli, using the above two models.

Early Deliverables

Report 1: bootstrap sampling.
Report 2: overview of chosen case study.
Report 3: prediction regions, JPRs and bootstrap JPRs.
Report 4: time-series models.
Report 5: preliminary plan, including an overview of the planned solution method and experimental evaluation plan.
Implement and evaluate predictor and prediction standard errors.

Final Deliverables

Implement code for bootstrap JPRs and evaluate empirical coverage and family-wise error rate for different hyper-parameters (e.g., k, horizon length, sample size, number of bootstrap samples).
Final dissertation describing problem, methods and algorithm, implementation details, computational experiments and their analysis, and conclusions.

Suggested Extensions

1. Compare different predictors.
2. Compare bootstrap JPRs with other methods mentioned in the paper (e.g., joint marginals and Scheffé)
3. Compare bootstrap JPRs with Monte Carlo sampling based on Bayesian neural networks and uncertainty recalibration

Reading

(bootstrap) Chapter 6, Lecture notes of 36-402, CMU Undergraduate Advanced Data Analysis, available at https://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf
(bootstrap JPRs) Wolf, Michael, and Dan Wunderli. "Bootstrap joint prediction regions." Journal of Time Series Analysis 36.3 (2015): 352-376.
(time-series models) Stock Market Predictions with LSTM in Python, available at https://www.datacamp.com/community/tutorials/lstm-python-stock-market
(time-series models) Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2019). Dive into deep learning, chapters 8 and 9, available at https://d2l.ai/index.html
(extension 3) Gal Y. & Ghahramani Z., 2016, Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning, ICML 2016, available at http://proceedings.mlr.press/v48/gal16.pdf
(extension 3) Kuleshov, Volodymyr, Nathan Fenner, and Stefano Ermon. "Accurate Uncertainties for Deep Learning Using Calibrated Regression." ICML 2018, available at https://arxiv.org/pdf/1807.00263.pdf

Large-Scale Graph Analytics

Background: The list of suggested problems is as follows. You are welcome to propose and work on problems of your own subject to the supervisor's approval:

Proof-of-Concept:

Compute the number of vertices and edges in a graph
Compute maximum, minimum, and average vertex degrees
Compute and analyse the vertex degree distribution (probability mass function)
Compute the graph diameter (longest shortest path between any pair of nodes)
Compute and analyse the distribution (probability mass function) for the number of friends of a given degree k > 1

Advanced problems:

Tools: Both Giraph and GraphX are available on the department's big data cluster. Other platforms (such as GraphLab and PowerGraph) may also be available.

Early Deliverables

Report on the comparative study of the distributed platforms for processing large-scale graphs.
Implementation and evaluation of the proof-of-concept programs.
Detailed plan for the final project outcomes as agreed with the advisor.

Final Deliverables

The software you built to achieve the agreed project goals. The dissertation will report on different algorithms for the advanced problem you chose comparing their strengths and weaknesses. It will discuss techniques and findings, and provide meaningful visualisation of the results.

Reading

Prerequisites: CS5800 or prior programming experience

Learning models for glucose prediction in diabetes therapy

Aims: The goal of this project is to apply machine learning methods to forecast glucose levels in the context of automated insulin therapy for type-1 diabetes.

Background:

Early Deliverables

Report 1: overview of the artificial pancreas and underlying control algorithms.
Report: review of main models for glucose prediction.
Report: preliminary plan, including an overview of the planned solution method and experimental evaluation plan.
Implement code for data processing and for learning neural network-based glucose predictors. Perform training for at least one NN architecture and at least one prediction horizon.

Final Deliverables

Final dissertation describing problem, data, solution method, machine learning methods used, implementation details, computational experiments and their analysis, and conclusions.

Suggested Extensions

Analysis of prediction reliability using techniques like inductive conformal prediction or approximate Bayesian inference via Monte Carlo dropout.
Derive optimal insulin control policies using the learned glucose prediction models
Compare the learned deep models with more classical autoregressive models.

Reading

(for report 1) Boughton, Charlotte K., and Roman Hovorka. "Advances in artificial pancreas systems." Science translational medicine 11.484 (2019). PDF available from supervisor
(for reports 1 and 2) Lunze, Katrin, et al. "Blood glucose control algorithms for type 1 diabetic patients: A methodological review." Biomedical signal processing and control 8.2 (2013): 107-119.
(for report 2 and for ideas on neural net architecture) Li, Kezhi, et al. "Convolutional recurrent neural networks for glucose prediction." IEEE Journal of Biomedical and Health Informatics (2019).
(for report 2 and for example of autoregressive model) Sparacino, Giovanni, et al. "Glucose concentration can be predicted ahead in time from continuous glucose monitoring sensor time-series." IEEE Transactions on biomedical engineering 54.5 (2007): 931-937.
(for monte carlo dropout) Gal, Yarin, and Zoubin Ghahramani. "Dropout as a bayesian approximation: Representing model uncertainty in deep learning." international conference on machine learning. 2016.
(for inductive conformal prediction) Papadopoulos, Harris. "Inductive conformal prediction: Theory and application to neural networks." Tools in artificial intelligence. IntechOpen, 2008.

Prerequisites: Prerequisites: CS5950 (Deep Learning) is recommended. Good programming skills in Python, MATLAB, or R.

Learning models for glucose prediction in diabetes therapy

Aims: The goal of this project is to apply machine learning methods to forecast glucose levels in the context of automated insulin therapy for type-1 diabetes.

Background:

Early Deliverables

Report 1: overview of the artificial pancreas and underlying control algorithms.
Report: review of main models for glucose prediction.
Report: preliminary plan, including an overview of the planned solution method and experimental evaluation plan.
Implement code for data processing and for learning neural network-based glucose predictors. Perform training for at least one NN architecture and at least one prediction horizon.

Final Deliverables

Final dissertation describing problem, data, solution method, machine learning methods used, implementation details, computational experiments and their analysis, and conclusions.

Suggested Extensions

Analysis of prediction reliability using techniques like inductive conformal prediction or approximate Bayesian inference via Monte Carlo dropout.
Derive optimal insulin control policies using the learned glucose prediction models
Compare the learned deep models with more classical autoregressive models.

Reading

(for report 1) Boughton, Charlotte K., and Roman Hovorka. "Advances in artificial pancreas systems." Science translational medicine 11.484 (2019). PDF available from supervisor
(for reports 1 and 2) Lunze, Katrin, et al. "Blood glucose control algorithms for type 1 diabetic patients: A methodological review." Biomedical signal processing and control 8.2 (2013): 107-119.
(for report 2 and for ideas on neural net architecture) Li, Kezhi, et al. "Convolutional recurrent neural networks for glucose prediction." IEEE Journal of Biomedical and Health Informatics (2019).
(for report 2 and for example of autoregressive model) Sparacino, Giovanni, et al. "Glucose concentration can be predicted ahead in time from continuous glucose monitoring sensor time-series." IEEE Transactions on biomedical engineering 54.5 (2007): 931-937.
(for monte carlo dropout) Gal, Yarin, and Zoubin Ghahramani. "Dropout as a bayesian approximation: Representing model uncertainty in deep learning." international conference on machine learning. 2016.
(for inductive conformal prediction) Papadopoulos, Harris. "Inductive conformal prediction: Theory and application to neural networks." Tools in artificial intelligence. IntechOpen, 2008.

Prerequisites: Prerequisites: CS5950 (Deep Learning) is recommended. Good programming skills in Python, MATLAB, or R.

Lexical Analysis

Aims: To build a tool which reads regular expression definitions and generates a program which 'tokenises' a string according to the regular expression specifications.

Background: Parsers form the basis of most computer interfaces, indeed the first thing most user applications need to do is parse the user's input. In general, the input to a parser is not the character string representation of the input. This string is initially processed into a sequence of tokens, elements of generic classes such as identifier, integer, relational operator etc.

A lexical analyser is a simplified parser which performs this initial processing. The UNIX standard tool LEX is an example of such a program.

This project involves implementing a lexical analyser generator based on finite state automata. The algorithms are well known but implementation details are important in the development of an efficient tool.

For this project you will need programming aptitude. The implementation may be produced using the parser generator tool rdp or from scratch. This will be discussed in the detailed project design phase.

Early Deliverables

Write up of regular expressions, Thompson's construction, subset construction, minimisation.
Implementation design for the lexer generator, with an analysis of the data structures needed.
User interface design, data structure design.
Proof of concept implementation using hand constructed automata.

Final Deliverables

The project report will contain a description of the construction of automata from regular expressions and their combination into a tokeniser, written at a level suitable for a beginning final year student to understand.
An implementation of a lexical analyser generator which can take as input a set of regular expressions, generate a corresponding set of finite state automata, and use these to tokenise an input string.

Lindenmayer systems

Aims: To implement a graphical demonstration of stochastic Lindenmayer systems and use it to display organic-like structures.

Background:

L-systems or Lindenmayer systems (named after their originator, Aristid Lindenmayer) are grammar-based mechanisms for producing geometric figures. Like a conventional context free grammar, an L-system comprises an alphabet and a set of productions. An initial seed string is used to given as a starting sentential form. Strings are transformed by applying the productions to all viable elements, and the results strings are interpreted using a graphics engine.

Recursive rules may be used to define fractal-like objects, and by attaching probabilities to individual productions we can produce images which mimic some aspects of biological systems, including plant growth.

To succeed in this project, you need to be interested in automata and graphics; and you must possess good programming skills including the use of GUI's with Java.

Early Deliverables

A set of small worked manual examples of non-stochastic L-systems.
A Java program which interprets strings and draws sequences of line segments.
Write a report surveying classes of L-systems.

Final Deliverables

Demonstrate a working GUI-based L-system environment.
Extend your system to support stochastic productions and show how they can mimic plant growth.
A final report on the project activities, fully acknowledging and citing sources.

Suggested Extensions

Investigate the relationship between L-systems and Semi-Thue systems.
Extend your system to draw 3-dimensional objects.

Reading

The book 'The algorithmic beauty of plants' available free from http://algorithmicbotany.org/papers/abop/abop.pdf .
Many online resources: start with Wikipedia

Machine Learning for Smart Meter Data

Early Deliverables

Proof of concept program: identify one machine learning task suitable for the smart meter data and implement one learning algorithm for the chosen task.
Report: An overview of the datasets used. Statistics and numbers involved.
Report: A report on machine learning tasks for smart meter data.

Final Deliverables

Survey of machine learning techniques used for smart meter data analysis.
The developed machine learning software for smart meter data.
The software engineering process involved in generating your software.
Experiments with different smart meter data.

Suggested Extensions

Extend the learning algorithms for online setting.
Provide predictions with confidence measure.

Reading

Smart*: An Open Data Set and Tools for Enabling Research in Sustainable Homes. Sean Barker, Aditya Mishra, David Irwin, Emmanuel Cecchet, Prashant Shenoy, and Jeannie Albrecht. Proceedings of the 2012 Workshop on Data Mining Applications in Sustainability (SustKDD 2012), Beijing, China, August 2012.
Smart* Data Set for Sustainability http://traces.cs.umass.edu/index.php/Smart/Smart
SmartMeter Energy Consumption Data in London Households https://data.london.gov.uk/dataset/smartmeter-energy-use-data-in-london-households
UK Domestic Appliance-Level Electricity (UK-DALE) dataset http://jack-kelly.com/data/
Revealing Household Characteristics from Smart Meter Data. Christian Beckel, Leyna Sadamori, Thorsten Staake, Silvia Santini, Energy, Vol. 78, pp. 397-410, December 2014.
Piotr Mirowski, Sining Chen, Tin Kam Ho, Chun-Nam Yu, Demand Forecasting in Smart Grids, Bell Labs Technical Journal, 18(4), pp.135-158, 2014.

Prerequisites: Java, Machine Learning, CS5100 and CS5200.

Machine Learning for Smart Meter Data

Early Deliverables

Proof of concept program: identify one machine learning task suitable for the smart meter data and implement one learning algorithm for the chosen task.
Report: An overview of the datasets used. Statistics and numbers involved.
Report: A report on machine learning tasks for smart meter data.

Final Deliverables

Survey of machine learning techniques used for smart meter data analysis.
The developed machine learning software for smart meter data.
The software engineering process involved in generating your software.
Experiments with different smart meter data.

Suggested Extensions

Extend the learning algorithms for online setting.
Provide predictions with confidence measure.

Reading

Smart*: An Open Data Set and Tools for Enabling Research in Sustainable Homes. Sean Barker, Aditya Mishra, David Irwin, Emmanuel Cecchet, Prashant Shenoy, and Jeannie Albrecht. Proceedings of the 2012 Workshop on Data Mining Applications in Sustainability (SustKDD 2012), Beijing, China, August 2012.
Smart* Data Set for Sustainability http://traces.cs.umass.edu/index.php/Smart/Smart
SmartMeter Energy Consumption Data in London Households https://data.london.gov.uk/dataset/smartmeter-energy-use-data-in-london-households
UK Domestic Appliance-Level Electricity (UK-DALE) dataset http://jack-kelly.com/data/
Revealing Household Characteristics from Smart Meter Data. Christian Beckel, Leyna Sadamori, Thorsten Staake, Silvia Santini, Energy, Vol. 78, pp. 397-410, December 2014.
Piotr Mirowski, Sining Chen, Tin Kam Ho, Chun-Nam Yu, Demand Forecasting in Smart Grids, Bell Labs Technical Journal, 18(4), pp.135-158, 2014.

Prerequisites: Java, Machine Learning, CS5100 and CS5200.

Machine learning based indoor positioning

Background:

Early Deliverables

Producing a simple proof-of-concept implementation of some classification and regression algorithms from scratch.
A report of the empirical results of such learning algorithms on a small synthetic dataset.

Final Deliverables

A detailed literature review of the machine learning application in WiFi fingerprinting.
Producing a detailed report of the dataset structure, including visual analysis (e.g. t-SNE, UMAP, etc.).
Implementing the algorithms on the entire competition dataset, and optimising their parameters.
Carefully analysing and comparing the performance accuracy of such learning algorithms.

Suggested Extensions

Providing location estimations with some confidence measure.
Extending the learning algorithms to an online setting, where the user provides continuous samples as she traverses the building.

Reading

Bahl, P. and Padmanabhan, V. N. (2000). RADAR: An in-building RF-based user location and tracking system. In INFOCOM 2000. 19th Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE (Vol. 2, pp. 775-784).
Honkavirta, V., Perala, T., Ali-Loytty, S., and Piché, R. (2009). A comparative survey of WLAN location fingerprinting methods. In Positioning, Navigation and Communication, 2009. WPNC 2009. 6th Workshop on (pp. 243-251). IEEE.
https://archive.ics.uci.edu/ml/datasets/UJIIndoorLoc

Prerequisites: Taking CS5920 Machine Learning would be beneficial, but is not compulsory. Having good knowledge of Matlab or Python is essential.

Machine learning based indoor positioning

Background:

Early Deliverables

Producing a simple proof-of-concept implementation of some classification and regression algorithms from scratch.
A report of the empirical results of such learning algorithms on a small synthetic dataset.

Final Deliverables

A detailed literature review of the machine learning application in WiFi fingerprinting.
Producing a detailed report of the dataset structure, including visual analysis (e.g. t-SNE, UMAP, etc.).
Implementing the algorithms on the entire competition dataset, and optimising their parameters.
Carefully analysing and comparing the performance accuracy of such learning algorithms.

Suggested Extensions

Providing location estimations with some confidence measure.
Extending the learning algorithms to an online setting, where the user provides continuous samples as she traverses the building.

Reading

Bahl, P. and Padmanabhan, V. N. (2000). RADAR: An in-building RF-based user location and tracking system. In INFOCOM 2000. 19th Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE (Vol. 2, pp. 775-784).
Honkavirta, V., Perala, T., Ali-Loytty, S., and Piché, R. (2009). A comparative survey of WLAN location fingerprinting methods. In Positioning, Navigation and Communication, 2009. WPNC 2009. 6th Workshop on (pp. 243-251). IEEE.
https://archive.ics.uci.edu/ml/datasets/UJIIndoorLoc

Prerequisites: Taking CS5920 Machine Learning would be beneficial, but is not compulsory. Having good knowledge of Matlab or Python is essential.

Massive scale data analytics with Hadoop

Aims: Learn basics of large-scale data analysis and data mining using MapReduce and Hadoop.

Implement Map and Reduce programs for several useful data analytics problems, and analyse their performance.

Prerequisites: Students attempting this project should be good at programming, and comfortable with basic data structures and algorithms. Basic knowledge of statistics and probability will be helpful. Students should be comfortable with Linux command line tools, and shell scripting.

Background: MapReduce, and its open source implementation Hadoop, have become a primary tool for analysing massive data sets on large-scale clusters of commodity machines (such as found in today's cloud infrastructures). The key to its popularity is a simple programming model allowing developers to implement their analytics logic by providing code for just two primitives: Map and Reduce. The resulting MapReduce program is then passed to the runtime environment, which automatically parallelises its execution on a cluster of machines. The developers can tune performance of their MapReduce programs by tweaking values of a number of parameters affecting the degree of parallelism, data locality, and others.

In this project, you are asked to code MapReduce programs for a number of typical analytics problems with varied degree of difficulty, deploy them onto a Hadoop cluster, analyse their performance, and propose an optimal parameter setting.

To test your programs, you will need to generate sample data sets, and/or download them from online sources. Pointers to several such sources will be provided by the advisor.The problems you are asked to solve are as follows:

Given a large collection of web pages, find a number of occurrences of each word. This is a "Hello World!" of MapReduce.
Distributed Grep: Given a large collection of web pages, and a regular expression, find all strings matching the regular expression. The output must be a list of tuples each of which consists of a matching string and its location in the input (e.g., file name, and offset or line number).
Basic statistics analysis of a simple data set: Given a large collection of numbers (e.g., representing times to failure of a software component resulting from a large number of experiments), generate a frequency distribution of the numbers at fixed percentiles (e.g., 10%), find values of various different order statistics, such as minimum, maximum, and median, and visualise the shape of the distribution.
Implement an advanced data mining algorithm (e.g., K-Means, or a classifier) from [4], apply it to discover interesting features in a realistic data set, and visualise the results. The problem, algorithms, and data sets will be chosen in coordination with the advisor.

Early Deliverables

Solution of Problems 1 and 2, their performance analysis, recommendation for cluster configuration and parameter settings, and result visualisation.
Reports: Map and Reduce algorithm description; experience with running your code on a Hadoop cluster; practical problems being encountered, and their solution; experience with performance analysis, and parameter tuning; recommendation for optimal parameter setting for each of the analysed problem.

Final Deliverables

Solution for Problems 3 and 4, their performance analysis, recommendation for cluster configuration and parameter settings, and visualisation of the results.
The report will provide a reflective overview of the MapReduce paradigm, and its Hadoop implementation.
The report will describe implementation of the Map and Reduce primitives for all four problems.
The report will discuss and motivate the choice of the data sets for each of the four problems.
If synthetic data sets were used, the report will outline the procedures used to generate the data sets.
The report will reflect on the experience of running, analysing and tuning your implementation on a Hadoop cluster.

Suggested Extensions

Implement advanced data mining techniques. Apply them to interesting data sets.
Talk to machine learning experts in the department to identify open massive scale analytics problems they are interested in solving, and implement and run them in the Hadoop framework. One such problem can be identifying trends in the epidemics spread using real-time data supplied by mobile devices. If you are interested to look at this problem, you should talk to Chris Watkins and Gregory Chockler.

Reading

Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI 2004, pages 137-150, 2004.
Jeffrey Dean and Sanjay Ghemawat. MapReduce: a flexible data processing tool. Commun. ACM, 53(1):72-77, 2010.
Jeffrey Dean. Experiences with MapReduce, an abstraction for large-scale computation. In PACT '06: Proceedings of the 15th international conference on Parallel architectures and compilation techniques, page 1, New York, NY, USA, 2006. ACM.
Rajaraman, Lesocovec, Ulman. Mining of Massive Datasets. Cambridge University Press. Available online at http://infolab.stanford.edu/~ullman/mmds.html.
MapReduce bibliography by A. Kamil, 2010: http://www.columbia.edu/~ak2834/mapreduce.html.
Hadoop: http://hadoop.apache.org.

Metaheuristics

Aims: To study various metaheuristic approaches, both single-solution and population-based, and implement them for various optimisation problems.

Background: Metaheuristic such as Local Search, Tabu Search, Genetic and Memetic algorithms, are very powerful practical computing tools to obtain near-optimal solutions for various optimisations problems within a limited amount of time and space. (We cannot always guarantee near-optimality, but it's the case for many practical problems.)

Early Deliverables

Report on Basics of Computational Complexity.
First report on Local Search.
Report on Tabu Search.
Implementation of Tabu Search for pseudo-boolean optimisation.

Final Deliverables

Full report on Local Search.
Report on Iterated Local Search.
Implementation of Iterated Local Search for pseudo-boolean optimisation.
Report on Genetic Algorithms.
Implementation of Genetic Algorithm for pseudo-boolean optimisation.
Report on Memetic Algorithms.
Implementation of Memetic Algorithm for pseudo-boolean optimisation.
Graphic User Interface for the four implementations.
Computational comparison of the four implementations.

Suggested Extensions

Implementing metaheuristics for other optimisation problems.

Mitigating Anti-Sandboxing Tricks used by Malware

Aims: Detecting and Mitigating some Evasion Techniques used by Malware

Background: Several malware samples exploit advanced tactics to detect whether they are run in a sandboxed/virtual analysis environment. In such cases, malware samples do not perform any malicious actions to evade analysis and detection by security researchers. The goal of the project is to analyse some targets of anti-sandboxing techniques used by malware (e.g., registry keys, reverse Turing test, loaded libraries, process list) and propose/develop some countermeasures to mitigate these evasion attacks, by testing them on existing evasion tools.

Prerequisites:

Good programming skills, particularly C, C++, Python, and knowledge of Windows/Linux internals.
Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub).

Early Deliverables

A report describing evasion techniques used by malware
A report giving an overview of existing detection/mitigation techniques for evasion
A clear workplan describing the set of tests to be performed, and the classes of evasion mechanisms/techniques to be analysed

Final Deliverables

A thorough analysis of evasion techniques, e.g. against sandboxing, virtualised environments and debugging
Describe in detail the design of the tests
Describe and analyse the features that have been used during the project to detect/stop evasion
Critically compare the findings with related works
Suggestions for improving the security of sandboxes with respect to this problem

Reading

A Survey On Automated Dynamic Malware Analysis Evasion and Counter-Evasion: PC, Mobile, and Web
https://github.com/a0rtega/pafish
https://github.com/LordNoteworthy/al-khaser
https://github.com/joesecurity/pafishmacro

Modeling and Inference with Ontologies

Aims: Understanding Web Ontology Language (OWL) modelling with a possible emphasis on its theoretical foundation (within Description Logic); and building a simple model for a real-world scenario using OWL.

Background:

OWL ontologies are basically descriptions of certain enterprises, reminiscent to a database modelling of an enterprise; the advantage of ontologies over databases is the ability to model not only basic individual information but rather semantic connections between entities in the enterprise (e.g., that a certain class of entities is a sub-class of a different class of entities). This gives the user the ability to infer *new* information from existing one (and not only to extract the already existing information).

Ontologies are one of the technologies of the Semantic Web and are currently used extensively in domains such as medicine, biology, and in the general area of digital humanities including archives.

The project will offer students the opportunity to learn some basic concepts behind the rich and fundamental theory of ontologies, namely Description Logic (which are weak versions of first-order logic), and the use of ontology frameworks (for example, Protege) in real (or artificial) data sets.

A successful project will (1) Investigate and understand some basic concepts of the logical theory of Description Logic: syntax and semantics; (2) Define a developed an OWL ontology for a particular chosen domain using the the Protege tool; Or work and possibly extend an existing model, exemplifying sophisticate inferencing in the model; (3) Show the applicability of the ontology as to store and infer new information from existing one. (4) There are two options for this project that a student can take: Option 1: the more logical one; Option 2: the more ontology-engineering one. Both options will have some parts from option 1 and 2, but with a different emphasis.

Early Deliverables

Project plan with tasks and milestones.
An overview of ontologies: what they are, basic technical concepts (RDF,RDFS and basic OWL).
Option 1: An overview of Description Logic that supports ontological reasoning.
Option 2: Gathering of requirements from a user, or justifying such requirements from the perspective of a potential user.
Initial experiments with Protege. And building a preliminary model for the chosen domain. (The project intends to utilise Protege but other available or developed tools for ontology engineering are certainly legitimate.

Final Deliverables

An ontology of the knowledge domain associated with the data set.
Option 2: a web front-end for the data set based on the ontology.
Option 1: A discussion of Description Logic with explicit examples in the ontology constructed.
Option 1: Investigate inferencing algorithms in Description Logic and their run-time complexity.
At least a partial encoding of the data set in the ontology.
Option 1: Exemplifying sophisticated inferencing using Protege (or other selected tools): using restrictions (quantifiers).
Option 2: The report will include examples of usage of the system.
Option 2: The report will include discuss the perceived advantages and disadvantages of using ontology frameworks for representing and querying the chosen data set

Suggested Extensions

Option 2: Connect the ontology to other existing ontologies in related areas.
Option 2: User-friendly interfaces for data representation and querying.
Option 2: Investigate inferencing algorithms in Description Logic and their run-time complexity.
Option 1: User-friendly interfaces for data representation and querying.
Option 1: User-friendly interfaces for data representation and querying.
Option 1: a web front-end for the data set based on the ontology.

Reading

Option 2: Semantic Web for the Working Ontologist. Dean Allemang and Jim Hendler.
Option 1: A Description Logic Primer. Markus Krotzsch, Frantiek Simancik, Ian Horrocks.
Both options: http://protege.stanford.edu

Modelling attacks in network simulations

Aims: Propose and develop a system that provides network simulations (including realistic attacks in them).

Background: The main objective of the project is to model attacker behaviour on attack graphs and evaluate the difference in outcomes from different attacker profiles. There are two main components: a network simulator and the ability to include attacker behaviour in the simulation. A proposed solution is developed to determine the effects of different attacker properties in a simulation. The objectives of the project include: 1) Designing an attacker classification along with templates for different attacker types. 2) Implement a proof of concept for attacker traversal on a network with an attacker that follows a strategy, with limited knowledge and resources. 3) Explore different techniques and properties to see how they affect the attacker's decisions.

Prerequisites:

Good programming skills, particularly C, C++, Python, and knowledge of Windows/Linux internals.
Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub).

Early Deliverables

A report describing the state of the art in network simulations.
A clear workplan describing the set of tests to be performed, tools to be implemented and classes of techniques to be proposed and studied.

Final Deliverables

Design and implement a framework for development of new and existing network simulation techniques
Describe in detail the design of tests for the framework
Describe and analyse the features that have been used during the project to simulate networks
Critically compare the findings with related works

Reading

GNS3: https://www.gns3.com/
https://www.sciencedirect.com/topics/computer-science/network-simulation

Modern authentication for mobile and web applications

Aims: Explore how single sign-on authentication and authorization is performed in modern mobile and web applications as well as for 3rd-party integrations and build a multi-user application that employs such mechanisms.

Background:

Modern applications often consists of multiple parts that are connected using Web-based APIs (often following REST principles). A (cloud-based) backend provides persistence and implements business logic, while frontends, such as mobile and web apps, provide user interfaces that interact with the backend services. Often, 3rd-party providers can integrate their services as well, sometimes using the same APIs as the first-party application components.

In such a diverse and distributed setting, authentication and authorization are key to protect resources from unauthorized access. Protocols like OAuth 2.0 and OpenID Connect aim to solve this issue and have found widespread adoption in industry. However, their application is often not trivial and can be easily implemented in flawed ways.

To illustrate how such mechanisms can be applied in a good way, we want to build a mock application that consists of multiple components (such as a backend, a mobile app, and a web interface) and that allows for 3rd-party integration. This application should make use of common authentication and authorization frameworks (using appropriate libraries). Due to the nature of such applications, you will need to develop several components in different programming languages, potentially using different libraries.

Prerequisites:

This is a challenging project that will require good programming skills with proficiency in multiple programming languages.
You need to be able to identify good libraries and integrate these in your project.

Early Deliverables

A skeleton of a mock application (in a field of your choice) with a backend, a RESTful API (for 3rd-party integrations and optionally frontend), and at least one frontend (web interface or mobile app).
A report describing the authentication and authorization mechanisms and best practices and how you plan to integrate these in your application.

Final Deliverables

The full application with integration of authentication and authorization mechanisms.
Complete report

Reading

https://tools.ietf.org/html/rfc6749 (OAuth 2.0)
https://tools.ietf.org/html/rfc6750 (Bearer Token Usage)
https://tools.ietf.org/html/rfc6819 (OAuth 2.0 Threat Model and Security Considerations)
https://tools.ietf.org/html/draft-ietf-oauth-security-topics-21 (Draft OAuth Security BCP)
https://tools.ietf.org/html/draft-ietf-oauth-browser-based-apps-12 (Draft Best Practices for OAuth Browser-Based Apps)
https://openid.net/specs/openid-connect-core-1_0.html (OpenID Connect Core 1.0)
https://www.oauth.com/ (OAuth 2.0 Simplified)
https://www.oauth.net/ (Community page for OAuth 2.0)
Daniel Fett, Ralf Küsters, Guido Schmitz. 2016. A Comprehensive Formal Security Analysis of OAuth 2.0. In: 23rd ACM SIGSAC Conference on Computer and Communications Security (CCS 2016). https://dl.acm.org/doi/pdf/10.1145/2976749.2978385
Daniel Fett, Ralf Küsters, Guido Schmitz. 2017. The Web SSO Standard OpenID Connect: In-Depth Formal Security Analysis and Security Guidelines. In: IEEE 30th Computer Security Foundations Symposium (CSF 2017). https://ieeexplore.ieee.org/iel7/8048777/8049639/08049720.pdf

Natural Language Understanding: Measuring the Semantic Similarity between Sentences

Aims: To implement and design various deep neural networks for measuring the semantic similarities between sentence pairs.

Early Deliverables

Proof of concept program: tf-idf based similarity measurement
Proof of concept program: word-embedding based similarity measurement
(Optional) Proof of concept program: sentence-embedding based similarity measurement.
A report describing and assessing the results of applying the proof of concept programmes on different sentence pairs.

Final Deliverables

Fine-tune existing neural models (e.g. BERT), and evaluate whether they are better at dealing problems (i) and (ii) than other models.
A report summarising all the experiments and analysing strengths and weaknesses of each model

Suggested Extensions

Constructing more training sentence pairs, for example by using Google Translate: given a sentence in English (denoted S), first translate it into another language (denote the translated sentence by S'), and then translate S' back to English (denote the result as S''); we should assume that S' and S'' should be similar sentences.

Reading

Neural Network Methods for Natural Language Processing, by Y. Goldberg, Morgan and Claypool 2017
Deep Learning, by I. Goodfellow, Y. Bengio and A. Courville, MIT Press 2016
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", arXiv pre-print 1810.04805

Prerequisites:

CS5920 (Computer Learning) would be advantageous (but not required).

CS5950 (Deep Learning) is required.

CS4990/CS5990 (Natural Language Processing) is required.

This project will require some mathematical understanding; it is not simply programming.

Natural Language Understanding: Measuring the Semantic Similarity between Sentences

Aims: To implement and design various deep neural networks for measuring the semantic similarities between sentence pairs.

Early Deliverables

Proof of concept program: tf-idf based similarity measurement
Proof of concept program: word-embedding based similarity measurement
(Optional) Proof of concept program: sentence-embedding based similarity measurement.
A report describing and assessing the results of applying the proof of concept programmes on different sentence pairs.

Final Deliverables

Fine-tune existing neural models (e.g. BERT), and evaluate whether they are better at dealing problems (i) and (ii) than other models.
A report summarising all the experiments and analysing strengths and weaknesses of each model

Suggested Extensions

Constructing more training sentence pairs, for example by using Google Translate: given a sentence in English (denoted S), first translate it into another language (denote the translated sentence by S'), and then translate S' back to English (denote the result as S''); we should assume that S' and S'' should be similar sentences.

Reading

Neural Network Methods for Natural Language Processing, by Y. Goldberg, Morgan and Claypool 2017
Deep Learning, by I. Goodfellow, Y. Bengio and A. Courville, MIT Press 2016
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", arXiv pre-print 1810.04805

Prerequisites:

CS5920 (Computer Learning) would be advantageous (but not required).

CS5950 (Deep Learning) is required.

CS4990/CS5990 (Natural Language Processing) is required.

This project will require some mathematical understanding; it is not simply programming.

Natural Language Understanding: Measuring the Semantic Similarity between Sentences

Aims: To implement and design various deep neural networks for measuring the semantic similarities between sentence pairs.

Early Deliverables

Proof of concept program: tf-idf based similarity measurement
Proof of concept program: word-embedding based similarity measurement
(Optional) Proof of concept program: sentence-embedding based similarity measurement.
A report describing and assessing the results of applying the proof of concept programmes on different sentence pairs.

Final Deliverables

Fine-tune existing neural models (e.g. BERT), and evaluate whether they are better at dealing problems (i) and (ii) than other models.
A report summarising all the experiments and analysing strengths and weaknesses of each model

Suggested Extensions

Constructing more training sentence pairs, for example by using Google Translate: given a sentence in English (denoted S), first translate it into another language (denote the translated sentence by S'), and then translate S' back to English (denote the result as S''); we should assume that S' and S'' should be similar sentences.

Reading

Neural Network Methods for Natural Language Processing, by Y. Goldberg, Morgan and Claypool 2017
Deep Learning, by I. Goodfellow, Y. Bengio and A. Courville, MIT Press 2016
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", arXiv pre-print 1810.04805

Prerequisites:

CS5920 (Computer Learning) would be advantageous (but not required).

CS5950 (Deep Learning) is required.

CS4990/CS5990 (Natural Language Processing) is required.

This project will require some mathematical understanding; it is not simply programming.

Neural based Document Summarisation

Aims: To implement and design various deep neural networks for generating short summaries for long documents.

Early Deliverables

Proof of concept program: Long Short Term Memory (LSTM) Neural Network for text encoding.
(Optional) Proof of concept program: Attention Mechanism on top of (LSTM).
A report describing and assessing the results of applying the proof of concept programmes on data.

Final Deliverables

Implement an LSTM-based Encoder-Decoder network and train it with a few hundred thousand of real data
Augment the Encoder-Decoder network with Pointer Mechanism and Attention Mechanism, compare their performance against vanilla Encoder-Decoder
(Optional) Implement Transformer-based Encoder-Decoder, compare its performance against the other models
A report summarising all the experiments and analysing strengths and weaknesses of each model

Suggested Extensions

Applying Reinforcement Learning (RL) on top of the Encoder-Decoder models. Recent research suggests that applying RL further boost the performance.
Using Transfer Learning techniques to adjust existing language models, e.g. GPT-2, to generate summaries.

Reading

Neural Network Methods for Natural Language Processing, by Y. Goldberg, Morgan and Claypool 2017
Deep Learning, by I. Goodfellow, Y. Bengio and A. Courville, MIT Press 2016
A. See, P. J. Liu and C. Manning, "Get To The Point: Summarization with Pointer-Generator Networks", Proceedings of the 55th ACL Annual Conference

Prerequisites:

CS5920 (Computer Learning) would be advantageous (but not required).

CS5950 (Deep Learning) is required.

CS4990/CS5990 (Natural Language Processing) is required.

This project will require some mathematical understanding; it is not simply programming.

Neural based Document Summarisation

Aims: To implement and design various deep neural networks for generating short summaries for long documents.

Early Deliverables

Proof of concept program: Long Short Term Memory (LSTM) Neural Network for text encoding.
(Optional) Proof of concept program: Attention Mechanism on top of (LSTM).
A report describing and assessing the results of applying the proof of concept programmes on data.

Final Deliverables

Implement an LSTM-based Encoder-Decoder network and train it with a few hundred thousand of real data
Augment the Encoder-Decoder network with Pointer Mechanism and Attention Mechanism, compare their performance against vanilla Encoder-Decoder
(Optional) Implement Transformer-based Encoder-Decoder, compare its performance against the other models
A report summarising all the experiments and analysing strengths and weaknesses of each model

Suggested Extensions

Applying Reinforcement Learning (RL) on top of the Encoder-Decoder models. Recent research suggests that applying RL further boost the performance.
Using Transfer Learning techniques to adjust existing language models, e.g. GPT-2, to generate summaries.

Reading

Neural Network Methods for Natural Language Processing, by Y. Goldberg, Morgan and Claypool 2017
Deep Learning, by I. Goodfellow, Y. Bengio and A. Courville, MIT Press 2016
A. See, P. J. Liu and C. Manning, "Get To The Point: Summarization with Pointer-Generator Networks", Proceedings of the 55th ACL Annual Conference

Prerequisites:

CS5920 (Computer Learning) would be advantageous (but not required).

CS5950 (Deep Learning) is required.

CS4990/CS5990 (Natural Language Processing) is required.

This project will require some mathematical understanding; it is not simply programming.

Neural based Document Summarisation

Aims: To implement and design various deep neural networks for generating short summaries for long documents.

Early Deliverables

Proof of concept program: Long Short Term Memory (LSTM) Neural Network for text encoding.
(Optional) Proof of concept program: Attention Mechanism on top of (LSTM).
A report describing and assessing the results of applying the proof of concept programmes on data.

Final Deliverables

Implement an LSTM-based Encoder-Decoder network and train it with a few hundred thousand of real data
Augment the Encoder-Decoder network with Pointer Mechanism and Attention Mechanism, compare their performance against vanilla Encoder-Decoder
(Optional) Implement Transformer-based Encoder-Decoder, compare its performance against the other models
A report summarising all the experiments and analysing strengths and weaknesses of each model

Suggested Extensions

Applying Reinforcement Learning (RL) on top of the Encoder-Decoder models. Recent research suggests that applying RL further boost the performance.
Using Transfer Learning techniques to adjust existing language models, e.g. GPT-2, to generate summaries.

Reading

Neural Network Methods for Natural Language Processing, by Y. Goldberg, Morgan and Claypool 2017
Deep Learning, by I. Goodfellow, Y. Bengio and A. Courville, MIT Press 2016
A. See, P. J. Liu and C. Manning, "Get To The Point: Summarization with Pointer-Generator Networks", Proceedings of the 55th ACL Annual Conference

Prerequisites:

CS5920 (Computer Learning) would be advantageous (but not required).

CS5950 (Deep Learning) is required.

CS4990/CS5990 (Natural Language Processing) is required.

This project will require some mathematical understanding; it is not simply programming.

Offline HTML5 Maps Application

Aims: To produce an offline mapping application using recent HTML5 offline technologies such as Web Storage, Service Workers, Cache, and the IndexedDB. The application will be based on Open Street Map (OSM) data retrieved directly from the web or previously stored in a local database, and then rendered in realtime.

Background: Recent developments in HTML5 include the facility for producing "web" applications that work offline - that is, when no internet connection is present. This technology can be used to produce web-pages that can also be used as apps on mobile phones of all kinds.

Mapping applications are hugely popular on mobile devices, and many of us would be lost without them. In most cases - such as Google Maps - the maps are represented as image tiles that must be cached for offline use. Searching or route planning requires a data connection.

An alternative representation is "vector maps" where the actual location data of the roads and buildings is stored and rendered in realtime (one such app is "osmand"). Because the actual map data is available, it can be searched and presented as the user wishes without a data connection.

This project will experiment with new HTML5 technologies to produce a vector map application.

Early Deliverables

Proof of concept programs: A "hello world" offline HTML5 application using Cache and Service Workers.
Proof of concept programs: A "todo list" application using IndexedDB or Web Storage.
Proof of concept programs: An application to draw shapes using the HTML5 canvas.
Proof of concept programs: A web page that loads and lists Open Street Map data in raw form.
Reports: Basic web page development in HTML5: HTML, JavaScript, CSS.
Reports: Advanced technologies: jQuery, HTML5 canvas.
Reports: Developing an offline HTML5 application: Web Storage, Indexed DB, Service Workers and the Cache API.
Reports: Open Street Map data representation and vector vs. image tile maps.

Final Deliverables

The program will work offline.
The program will allow the user to load and display map data.
The program will allow the user to zoom and move around the map.
Features for improving rendering performance such as database indexing and image caching.

Suggested Extensions

Allowing the user to search for interesting features.
Allowing the dynamic display of different kinds of data (E.g. highlight cafes on screen).
The support of different OSM data formats.
Downloading map data dynamically when a connection is present.

Reading

Web Development Tutorials: http://www.w3schools.com/
HTML5 Canvas: https://www.w3schools.com/graphics/canvas_intro.asp
Service Workers: https://developer.mozilla.org/en-US/docs/Web/API/Service_Worker_API/Using_Service_Workers
Cache API: https://developer.mozilla.org/en-US/docs/Web/API/Cache
IndexedDB: https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API
Web Storage: https://developer.mozilla.org/en-US/docs/Web/API/Web_Storage_API/Using_the_Web_Storage_API
Open Street Map: http://wiki.openstreetmap.org/wiki/Develop

Ontologies

Aims: To design and implement an ontology–based front end for a real data set.

Background: Ontologies are structural frameworks for organizing and querying data through logical inference mechanisms that go beyond what typical databases or spreadsheets support. Ontologies are one of the technologies of the Semantic Web and are currently used extensively in domains such as medicine, biology, and in the general area of "digital humanities” including archives. One of the advantages of ontology frameworks is that they enable different sets of data to be connected, what is usually referred to as "linked data".

The project will offer students the opportunity to learn about and use ontology frameworks (for example, Protégé) in real data sets available at Royal Holloway.

A successful project will (1) define an ontology for a particular domain; (2) represent an existing data set using that ontology; (3) develop a front-end to that data set that supports "intelligent querying".

Early Deliverables

Project plan with tasks and milestones.
An overview of the mechanisms that support ontological reasoning - description logic.
Gathering of requirements from a real user.
Initial experiments with an ontology language.

Final Deliverables

An ontology of the knowledge domain associated with the data set.
At least a partial encoding of the data set in the ontology.
A web front-end for the data set based on the ontology.
The report will include a justification for the chosen ontology representation based on the requirements of the user.
The report will include examples of usage of the system.
The report will include discuss the perceived advantages and disadvantages of using ontology frameworks for representing and querying the chosen data set.

Suggested Extensions

Connect the ontology to other existing ontologies in related areas.
User-friendly interfaces for data representation and querying.

Reading

http://protege.stanford.edu
http://en.wikipedia.org/wiki/Ontology_(information_science)

Parallel Machine Learning Algorithms via GPU

Aims: A GPU consists of thousands of cores. It allows computations in massive parallelism to achieve magnificient speedup (potentially >100x faster than using CPU). The aim of this project is to implement machine learning algorithms using GPU.

Background:

Many machine learning algorithms involve computations that can be run independently in parallel. Examples include: (1) the updates to different coordinates when performing gradient descent (for linear/logistic/ridge regression problems); (2) computing the distances between observations and centroids in K-Means; (3) computing the distances between different pairs of clusters in Agglomerative Hierarchical Clustering. When the datasets are huge (say millions of observations), we really want to perform these computations in parallel, in order to speed up the whole computation process. Thanks to the recent advances of GPU technology, this approach can now be implemented with thousands of cores at a relatively low cost.

In this project, you will learn GPU programming with Python and CUDA (a parallel computing platform and API). You will pick one machine learning algorithms you have learnt in our MSc program, and implement it using GPU.

Note that in GPU programming, you need to implement many computations from first principles, because most library functions (e.g. those in numpy and sklearn) cannot be used in GPU. You are advised to choose this topic only if you are comfortable with the assignments in CS5100 or CS5100J.

Early Deliverables

Be familiar with the machine learning algorithm you will implement. Proof of Concept (PoC) program: implement the algorithm using CPU. This PoC will help you identify the comptuations that can be parallelized in GPU.
Learn GPU programming with Python and CUDA. Proof of Concept programs: (i) compute the elementwise sum of two huge matrices; (ii) compute the product of two huge matrices; (iii) compute the distance between every pair of observations in a huge dataset.

Final Deliverables

Implement the machine learning algorithm using GPU.
A report on comparing the performance of the GPU program against the PoC CPU program.

Suggested Extensions

If you choose to implement stochastic descent algorithm, there are different schemes on how coordinates or observations are sampled. Implement these schemes and compare them.
If you choose to implement K-Means algorithm, consider distance measures other than Euclidean, e.g. Minkowski norm. Then for each cluster, a generalized centroid should be computed instead of the classical centroid. One method to compute the generalized centroid is gradient descent, which can be parallelized using GPU.

Reading

Wes Armour. Course on CUDA Programming on NVIDIA GPUs. Online, first available in 2019.
Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers 2004.
Yun Kuen Cheung, Richard Cole and Yixin Tao. Fully Asynchronous Stochastic Coordinate Descent: A Tight Lower Bound on the Parallelism Achieving Linear Speedup. Mathematical Programming Series A 2020.

Prerequisites: You must be comfortable with Python programming. You must be comfortable with implementing algorithms from first principles. Knowing the basics of parallel computing is preferred.

Parallel Machine Learning Algorithms via GPU

Background:

Early Deliverables

Be familiar with the machine learning algorithm you will implement. Proof of Concept (PoC) program: implement the algorithm using CPU. This PoC will help you identify the comptuations that can be parallelized in GPU.
Learn GPU programming with Python and CUDA. Proof of Concept programs: (i) compute the elementwise sum of two huge matrices; (ii) compute the product of two huge matrices; (iii) compute the distance between every pair of observations in a huge dataset.

Final Deliverables

Implement the machine learning algorithm using GPU.
A report on comparing the performance of the GPU program against the PoC CPU program.

Suggested Extensions

If you choose to implement stochastic descent algorithm, there are different schemes on how coordinates or observations are sampled. Implement these schemes and compare them.
If you choose to implement K-Means algorithm, consider distance measures other than Euclidean, e.g. Minkowski norm. Then for each cluster, a generalized centroid should be computed instead of the classical centroid. One method to compute the generalized centroid is gradient descent, which can be parallelized using GPU.

Reading

Wes Armour. Course on CUDA Programming on NVIDIA GPUs. Online, first available in 2019.
Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers 2004.
Yun Kuen Cheung, Richard Cole and Yixin Tao. Fully Asynchronous Stochastic Coordinate Descent: A Tight Lower Bound on the Parallelism Achieving Linear Speedup. Mathematical Programming Series A 2020.

Prerequisites: You must be comfortable with Python programming. You must be comfortable with implementing algorithms from first principles. Knowing the basics of parallel computing is preferred.

Path Scoring in Symbolic Execution-Based Test Generation

Aims: Build and evaluate scoring functions for increasing the effectiveness of an automated test generation framework for JavaScript based on symbolic execution.

Background:

Software testing is one of the most difficult tasks developers and managers are faced with when developing a project. In complex projects new bugs can be introduced often, and traditional methods of testing software such as unit testing have shown limited effectiveness and carry a large development cost.

Symbolic execution is a technique for testing software by automatically running a program multiple times, each time with a different input which causes a program to follow a different control flow. This can be a valuable tool for testing as it allows for the automated exploration of many distinct controls flows of a program.

Prioritising the potential test cases in the queue to be tested can improve the effectiveness of the testing framework and increase the chances of exposing an issue sooner. This is especially important with symbolic execution frameworks, where the total number of tests that could be generated is often infeasibly large. Metrics such as code coverage or number of high risk features encountered can be used as a method for scoring each test case, and the selection of useful metrics is key to improving vulnerability detection.

The project will build upon an existing framework for symbolic execution and test generation for JavaScript, and extend it with advanced path selection algorithms.

Early Deliverables

Code: A simple implementation of a test prioritisation algorithm
Code: Modifying an existing testing framework to expose code metrics for scoring paths
Report: Effectiveness of software testing metrics, e.g., code coverage
Report: Basic mechanics of symbolic execution
Report: Search strategies in symbolic execution

Final Deliverables

Code: A full test prioritisation algorithm for scoring paths based on code metrics
Code: At least two different strategies for paths scoring, e.g., code coverage or test execution time
Report: Implemented algorithms and search strategies
Report: Experimental evaluation and comparison of search strategies
Report: Responsible disclosure of vulnerabilities
Experiments with all search strategies on real world software. Building test harnesses to exercise multiple entry points.

Suggested Extensions

Implement and evaluate search strategies based on genetic algorithms
Implement and evaluate scoring metrics based on estimated bug probabilities

Reading

Patrice Godefroid, Nils Klarlund, Koushik Sen: DART: directed automated random testing. PLDI 2005: 213-223
Cristian Cadar, Daniel Dunbar, Dawson R. Engler: KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs. OSDI 2008: 209-224

Prerequisites:

Advanced knowledge of JavaScript

Playing Games and Solving Puzzles Using AI

Aims: To understand simple AI techniques and algorithms and how they can be applied to solve puzzles and play games.

Background:

Many puzzles and games require logical thinking to choose a best move, or to find a hidden pattern. We have moved on a long way from games of patience. What unites them is that they are not time based and are all just for one player. They can all be solved by AI.

We most often use a backtrack based search to find the best move, or the solution. We can employ algorithms to improve backtracking like consistency pruning (e.g.) arc-consistency, look ahead (e.g.) forward checking, efficient backtracking (like backjumping) and variable and value heuristics (like Dynamic Variable Ordering).

If there is a pay off and we are trying to optimise then we can use branch and bound, or alpha/beta pruning, or even expectimax optimization.

In the end we can solve these puzzles quickly, and much better than people.

Some of these games have web based implementations, in which case you can extend the project to play the game automatically, as though you were typing, by using the debugging mode available on all modern browsers.

Most of these games can be read from picture files, using another form of AI, image processing.

A common and hard task for such games is to evaluate their difficulty. This is very hard to do as games that computers are slow at may not be those that people see as hardest. Some games like Sudoku have extensive online, pre-coded games libraries that you can use to try to learn what makes a game difficult.

Sometimes there are online cheat pages that give hints and human algorithms for solving games. Perhaps these should be incorporated into your solver.

The following would all make good examples for this project.

Constraint Satisfaction is one of the leading technologies in Artificial Intelligence for solving search problems. A constraint satisfaction problem instance can be seen as a set of related questions that need to be answered. We cannot answer any particular question without knowing the answers to those questions which affect it. You should research and report on how this technique could be used for your puzzle/game. One goal of the project might be to allow people to play your chosen game off-line. If you are puzzle solving the perhaps you could generate new puzzles, or evaluate and solve existing ones. Here you would stress a beautiful experience rather than extensive background theory.

This approach is applicable to more general AI problems found in machine vision, fleet scheduling, staff rostering, machine shop design, game playing, frequency assignment, vehicle routing etc., Perhaps your solver could be generalised to solve a whole class of such problems.

It is ideal for solving Sudoku. Constraint solvers are relatively easy to implement in Java and can solve even hard Sudoku instances in very little time. In this project you will see that the time taken by a constraint solver (although very fast indeed) really represents the difficulty rating that papers gives the puzzles. It seems that we are similar in behaviour to constraint solvers.

As you can see in the extensions it is sensible to extend your project by choosing to write it for a mobile device. This new set of acquired skills would need groundwork done in the fist term.

Other things that would be big changes and would require significant extra work, but would allow very high scores would be to use machine vision to enter puzzle or to automatically playing the game/solve the puzzles on a public website"

Early Deliverables

Proof of concept programs: Simple colourful GUI with an active button
Proof of concept programs: Relevant Data structures including a priority queue.
Proof of concept programs: Solving the eight queens problem using backtracking
Proof of concept program which solves a simple example of your puzzle.
Reports: Design Patterns for AI and Search.
Report on Backtracking and Recursion.
Reports: Constraint Satisfaction, particularly consistency techniques.
Reports: Techniques and Algorithms used by human solvers.
Reports: User Interface design for a solver
Reports: Complexity, NP hardness and the big O notation. Estimating problem size and hardness of your puzzle.

Final Deliverables

The program must have a full object-oriented design, with a full implementation life cycle using modern software engineering principles
The program will have a splash screen and at least two other user interaction screens.
The program will have a Graphical User Interface that can be used to generate, load, save, solve and help with puzzle solving/game playing.
The program must be able to produce puzzles that have only one correct solution, or games that have a single best strategy.
The report will describe the software engineering process involved in generating your software.
The report will describe interesting algorithms and programming techniques used (or able to be used) on the project.
The report will discuss the choice of data structures and their performance impact on the solving and generation algorithms.
The report will describe at least a backtracking and recursion solving algorithm and will compare them (including benchmark).
The report will describe the CSP and algorithms useful for solving the CSP including generalised arc consistency, Forward Checking, DVO and Backjumping.

Suggested Extensions

Porting the program to a mobile device
Solving and benchmarking public domain sets of puzzles
Using XML based files to store and load puzzles
Solving more than one kind of puzzle
Using machine vision to enter puzzle
Automatically playing the game/solving the puzzle on a public website
The program gives hints for users when they are stuck, and allows users to step back.
The program supports printing, different screen size, timers etc.,.

Reading

http://ktiml.mff.cuni.cz/~bartak/constraints/

Predictive Power of Social Media Data

Data: You will have access to the 300G Twitter dataset on the department's cloud, or could use an alternative dataset of your own subject to the supervisor's approval.

Tools: The project must use one or a combination of the large-scale data engineering tools studied in CS5234.

Early Deliverables

Precise specification of the project goals as agreed with the advisor along with data collection and analysis tools, and methodologies you plan to use to achieve the specified goals.
Intermediate report and programs showing and reflecting upon initial findings.

Final Deliverables

The software you built to achieve the project goals.
Final report will discuss techniques and findings, and provide meaningful visualisation of the results.

Reading

Brendan O’Connor, Ramnath Balasubramanyan, Bryan R. Routledge, Noah A. Smith, "From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series" In Proceedings of the International AAAI Conference on Weblogs and Social Media (2010)

Prerequisites: CS5234

Prime numbers and cryptosystems

Aims: The project involves implementing basic routines for dealing with prime numbers and then building cryptographic applications using them.

Background: Modern cryptography allows us to perform different types of information exchange over insecure channels. One of these task is to agree on a secret key over a channel where messages can be overheard. This is achieved by Diffie-Hellman protocol. Other tasks include public key and digital signature schemes; RSA key exchange can be used for them. These protocols are of great importance for bank networks.

Most such algorithms are based upon number theory, namely, the intractability of certain problems involving prime numbers.

Early Deliverables

Proof of concept program: An implementation of RSA using standard data types in Java or C++ encrypting and decrypting numbers.
Proof of concept program: A simple program generating keys.
Report: A description of the RSA encryption/decryption procedure.
Report: Examples worked out on paper and checked using the proof of concept programme.

Final Deliverables

The program must have a full object-oriented design, with a full implementation life cycle using modern software engineering principles.
The program will use integers of arbitrary size and encrypt and decrypt test messages and files.
An application with a GUI will perform a task such as a secure chat or a secure file transfer over the network.
A combination of RSA and a symmetric encryption scheme will be implemented, where RSA will distribute the keys for the symmetric scheme.
The report will contain an overview of modern cryptography.
The report will describe RSA and number theory issues such as primality checking and factorisation.
The report will describe the implementation issues (such as the choice of data structures, numerical methods etc) necessary to apply the theory.
The report will describe the software engineering process involved in generating your software.
The report will contain the data on the running time of the programs.

Suggested Extensions

An advanced symmetric encryption scheme such as DES or AES will be implemented.
Different primality testing and factorisation methods will be implemented. Their performance will be compared and test results reported.
Attacks on RSA and/or the symmetric scheme will be implemented and tested.

Prerequisites:

Taking CS3760, Information Security (or an equivalent 2nd year course) is recommended. Taking CS3580, Advanced Data Communications can be helpful.
Good command of Java or C++.

Privacy-preserving machine learning from homomorphic encryption

Aims: To implement the homomorphic evaluation of a function relevant for secure genomic analysis, based on a recent iDash competition task

Background: A homomorphic encryption (HE) scheme is a type of encryption scheme that enables the evaluation of a function on encrypted data, without requiring access to the secret key. Several open-source libraries exist today that implement one or more homomorphic encryption schemes. Privacy-preserving machine learning is one possible application of homomorphic encryption, for example by applying a neural network to, or performing logistic regression on, encrypted data. Between 2015 and 2019, the annual iDash competition challenged participants to implement a HE-based solution to a particular machine learning task in the application area of genomics.

Prerequisites:

Experience with C++
Being able to autonomously download/compile/install libraries from various repositories (e.g., GitHub).

Early Deliverables

A report describing existing machine-learning applications of homomorphic encryption
The implementation of a homomorphic evaluation of a simple function e.g. weighted average on a small dataset

Final Deliverables

Implement the homomorphic evaluation of a function relevant to machine learning for secure genomics, inspired by recent iDash competition tasks
Justify the scheme / library choice and steps taken to optimise performance
Describe homomorphic encryption scheme, and machine learning task, underlying the solution

Reading

https://homomorphicencryption.org
http://www.humangenomeprivacy.org/2019/
https://simons.berkeley.edu/talks/idash-competition
https://github.com/Microsoft/SEAL

Ranking

Aims: The aim of this project is to implemet techniques used by search engines to rank the output, test them and analyse their performance.

Background: Given a query, a search engine needs to rank documents by relevance so that links to most relevant and authoritative documents could be shown on top of the list.

Benchmark datasets:

LETOR: Learning to Rank for Information Retrieval http://research.microsoft.com/en-us/um/beijing/projects/letor/

Kaggle Learning to Rank 2019 challenge https://www.kaggle.com/c/learning-to-rank-2019/

Yandex Internet Mathematics 2009 competition dataset (archived) https://web.archive.org/web/20150317144535/http://imat2009.yandex.ru/academic/mathematic/2009/en/

Early Deliverables

Proof of concept program: Page rank applied to a small (possibly synthetic) dataset. Its performance and efficiency should be analysed.
Report: An overview of the problem of document ranking and main approaches to it. An overview of PageRank.

Final Deliverables

Two or more different machine learning technique should be applied to a standard benchmark dataset and their performance compared.
The performance of the program should be optimised to enable them to process large amounts of data.
The report will describe the theory of underlying learning algorithms.
The report will describe implementation issues (such as the choice of data structures, numerical methods etc) necessary to apply the theory.
The report will describe computational experiments, analyse their esults and draw conclusions.

Suggested Extensions

Comparison of pointwise, pairwise and listwise approaches.
Use of real Internet dumps such as http://commoncrawl.org/

Reading

Christiane Rousseau, How Google works: Markov chains and eigenvalues, URL http://blog.kleinproject.org/?p=280
Chuan He, Cong Wang, Yi-xin Zhong, and Rui-Fan Li. A survey on learning to rank. Machine Learning and Cybernetics, 2008 International Conference on. DOI 10.1109/ICMLC.2008.4620685.
Tie-Yan Liu, Learning to Rank for Information Retrieval, Berlin, Heidelberg: Springer. 2011

Prerequisites: Taking CS5100 (Data Analysis) is required. Taking CS5920 Computer Learning would be an advantage (but is not required).

Ranking

Aims: The aim of this project is to implemet techniques used by search engines to rank the output, test them and analyse their performance.

Background: Given a query, a search engine needs to rank documents by relevance so that links to most relevant and authoritative documents could be shown on top of the list.

Benchmark datasets:

LETOR: Learning to Rank for Information Retrieval http://research.microsoft.com/en-us/um/beijing/projects/letor/

Kaggle Learning to Rank 2019 challenge https://www.kaggle.com/c/learning-to-rank-2019/

Yandex Internet Mathematics 2009 competition dataset (archived) https://web.archive.org/web/20150317144535/http://imat2009.yandex.ru/academic/mathematic/2009/en/

Early Deliverables

Proof of concept program: Page rank applied to a small (possibly synthetic) dataset. Its performance and efficiency should be analysed.
Report: An overview of the problem of document ranking and main approaches to it. An overview of PageRank.

Final Deliverables

Two or more different machine learning technique should be applied to a standard benchmark dataset and their performance compared.
The performance of the program should be optimised to enable them to process large amounts of data.
The report will describe the theory of underlying learning algorithms.
The report will describe implementation issues (such as the choice of data structures, numerical methods etc) necessary to apply the theory.
The report will describe computational experiments, analyse their esults and draw conclusions.

Suggested Extensions

Comparison of pointwise, pairwise and listwise approaches.
Use of real Internet dumps such as http://commoncrawl.org/

Reading

Christiane Rousseau, How Google works: Markov chains and eigenvalues, URL http://blog.kleinproject.org/?p=280
Chuan He, Cong Wang, Yi-xin Zhong, and Rui-Fan Li. A survey on learning to rank. Machine Learning and Cybernetics, 2008 International Conference on. DOI 10.1109/ICMLC.2008.4620685.
Tie-Yan Liu, Learning to Rank for Information Retrieval, Berlin, Heidelberg: Springer. 2011

Prerequisites: Taking CS5100 (Data Analysis) is required. Taking CS5920 Computer Learning would be an advantage (but is not required).

Real-time data protection of software

Aims: Propose and develop a system that detects GDPR and NDA violations.

Background: The General Data Protection Regulation (GDPR) and Non-Disclosure Agreements (NDA) exist to limit data from being leaked to unintended audiences. While many privacy engineering frameworks exist (see reading material for examples) to enable software engineers to meet the requirements of regulation and agreements, these still remain largely text-based, and it is up to companies to implement ad hoc systems or policies as countermeasures to limit information sharing compliance violations. In this project, the student will propose and develop a misuse-detection system that focuses on identifying personal data being shared between in systems, and whether the data ought to have been shared in the first place (and report it if not). Unlike traditional misuse detection systems, this project focuses examines the appropriate sharing of plain text, images and media files by automated behaviour of systems and humans, instead of network packets and system calls.

Prerequisites:

Good programming skills, particularly C/C++, Python, and knowledge of Windows/Linux internals.
Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub).

Early Deliverables

A report describing the state of the art in privacy engineering and detection of data protection violations.
A clear workplan describing the set of tests to be performed, tools to be implemented and classes of detection mechanisms/techniques to be proposed and studied.

Final Deliverables

Design and implement a framework for development of new and existing techniques to detect information sharing compliance violations
Describe in detail the design of tests
Describe and analyse the features that have been used during the project to detect/stop violations
Critically compare the findings with related works

Reading

Adam Barth, Anupam Datta, John C Mitchell, and Helen Nissenbaum. 2006. Privacy and contextual integrity: Framework and applications. In Security and Privacy, 2006 IEEE Symposium on. IEEE, pp. 184–198
David Basin, Søren Debois, and Thomas Hildebrandt. 2018. On Purpose and by Necessity: Compliance under the GDPR. Presentation.
David Basin, Felix Klaedtke, Srdjan Marinovic, and Eugen Zalinescu. 2015. Monitoring of temporal first-order properties with aggregations. Formal methods in system design 46, 3 (2015), 262–285.
Sean Brooks, Sean Brooks, Michael Garcia, Naomi Lefkovitz, Suzanne Lightman, and Ellen Nadeau. 2017. An Introduction to Privacy Engineering and Risk Management in Federal Systems. US Department of Commerce, National Institute of Standards and Technology.
Ann Cavoukian et al. 2014. Privacy by design documentation for software engineers version 1.0. PbD-SE) Burlington, MA: Organization for the Advancement of Structured Information Standards (OASIS), work in progress (2014).
Anupam Datta, Jeremiah Blocki, Nicolas Christin, Henry DeYoung, Deepak Garg, Limin Jia, Dilsun Kaynar, and Arunesh Sinha. 2011. Understanding and protecting privacy: Formal semantics and principled audit mechanisms. In International Conference on Information Systems Security. Springer, 1–27.
European Commission. 2018. General Data Protection Regulation. https://ec.europa.eu/commission/priorities/justice-and-fundamentalrights/data-protection/2018-reform-eu-data-protection-rules_en
Information Exchange Policy 1.0. https://www.first.org/iep/.
Privacy Management Reference Model (PMRM) https://www.oasis-open.org/committees/pmrm/
Gina Fisk, Calvin Ardi, Neale Pickett, John Heidemann, Mike Fisk, and Christos Papadopoulos. 2015. Privacy principles for sharing cyber security data. In Security and Privacy Workshops (SPW), 2015 IEEE. IEEE, 193–197.
Jassim Happa, Nick Moffat, Michael Goldsmith, and Sadie Creese. 2018. Run-Time Monitoring of Data-Handling Violations. In Computer Security. Springer, 213–232.
Bert-Jaap Koops and Ronald Leenes. 2014. Privacy regulation cannot be hardcoded. A critical comment on the 'privacy by design' provision in data-protection law. International Review of Law, Computers and Technology 28, 2 (2014), 159–171.
Lauren B Movius and Nathalie Krup. 2009. US and EU privacy policy: comparison of regulatory approaches. International Journal of Communication 3 (2009), 19.
Stuart Shapiro, Nickyra Washington, Kristen Miller, Julie Snyder, and Julie McEwen. 2014. MITRE Privacy Engineering Framework. MITRE Corporation Technical Report.
Valeria Soto-Mendoza, Patricia Serrano-Alvarado, Emmanuel Desmontils, and Jose Antonio Garcia-Macias. 2015. Policies composition based on data usage context. In Sixth International Workshop on Consuming Linked Data (COLD2015) at ISWC.
UK Information Commissioner's Office. 2016. How do we apply legitimate interests in practice? https://ico.org.uk/for-organisations/guide-to-data-protection/guide-to-the-general-data-protection-regulation-gdpr/legitimate-interests/how-do-we-apply-legitimate-interests-in-practice/.

Regression Algorithms for Learning

Aims: The aim of the project is to implement a popular machine learning technique called ridge regression, apply it to real-world data, analyse its performance, analyse its performance, and compare it against other methods.

Background:

Regression algorithms play an important role in machine learning. They allow us to learn dependencies between multidimensional attributes and continuous outcomes. Regression algorithms can be applied to different domains.

An example is the Boston housing problem. The Boston housing database consists of records describing houses in Boston (to be precise, average houses in certain neighbourhoods). For each house the values of attributes such as the number of rooms, distance from the city centre, quality of schools etc are known and the price, as evaluated by an expert estate agent, is given. The goal of the learner is to determine the dependency between attributes and prices and to predict the price of a house using the values of the attributes. The program is first shown a number of training examples so that it could learn the dependency and then is tested on test examples. Ridge Regression is known to perform well on this dataset.

Another example is time series data. A significant amount of real world data is temporal in nature, in that the values of the variables are sampled at multiple points over some time period, so that the data stored for each entity has an additional 'time' dimension. Such data are called time series. Examples of time series are the daily closing value of the FTSE 100 index or network traffic measurement on a router. Time series forecasting is the use of a model to predict future values based on previously observed values and regression algorithms can be used here.

Applying a learning algorithm to data is not a straightforward task. One needs to preprocess the data and choose parameters to optimise the performance. The algorithms need to be implemented in such a way so as to run fast and not to suffer from numerical error. Making sense of the results and comparing different algorithms in a fair meaningful way can be tricky. Does one algorithm outperforms another consistently, or is it a mere coincidence?

Early Deliverables

Report: An overview of ridge regression describing the concepts 'training set' and 'test set', giving the formulas for a regression algorithm and defining all terms in them.
Proof of concept program: a regression algorithm applied to a small artificial dataset.
Report: Examples of applications of Ridge Regression worked out on paper and checked using the prototype program.
Loading a real-world dataset, simple pre-processing and visualisation.

Final Deliverables

The program must have a full object-oriented design, with a full implementation life cycle using modern software engineering principles.
The program will work with a real-world dataset, read the data from a file, preprocess the data and apply regression using parameters selected by the user with parameters entered by the user.
The program will have a graphical user interface.
The program will automatically perform tests such as comparison of different kernels, parameters, etc and visualise the results.
The program will implement another learning algorithm such as nearest neighbours or neural networks, and compare regression against it.
The report will describe the theory of regression and derive the formulas.
The report will describe the implementation issues (such as the choice of data structures, numerical methods etc) necessary to apply the theory.
The report will describe the software engineering process involved in generating your software.
The report will describe computational experiments with different kernels and parameters and draw conclusions.
The report will describe the competitor algorithm(s), compare the performance and draw conclusions.

Suggested Extensions

Use of regression in the on-line mode with growing and sliding window.
Application of regression to different datasets (including those found by the student).
Implementation of regression-type algorithms such as Kernel Aggregating Algorithm Regression, Kernel Aggregating Algorithm Regression with Changing Dependencies etc and computational experiments with them.

Prerequisites:

Taking CS3920 (Computer Learning) in the 3rd year is recommended.
Interest in numerical methods.
Good command of Java, C++, or Python.

Suitable for CS with AI students.

Reinforcement Learning

Aims: To understand reinforcement learning (RL) and implement RL for some simple scenarios

Background: Reinforcement Learning (RL) is the branch of machine learning in which an agent learns to perform sequences of actions in response to rewards and punishments. There is growing evidence that some animal learning uses essentially these algorithms: within the brain, rewards seem to be represented by dopamine surges.

In this project, you will understand the basic formalisation of RL as finding an optimal policy in a Markov Decision Process. This theory is nice, and not quite as complicated as it sounds (although there is a lot to learn). You will need to understand policy optimisation by dynamic programming, and the Bellman equation. This is widely applicable computational maths (not just in RL), and is good to know.

You will implement a demonstration of RL with a GUI as a MVC (model view controller) design pattern, and you will perform some experiments to determine the efficiency of the learning.

Early Deliverables

Proof of concept program: Implementation of exact policy optimisation by value iteration for a general MDP.
Proof of concept program: Implementation of Q-learning for grid world.
Proof of concept program: GUI built with buttons etc.,
Proof of concept program: MVC program with with controls.
Report on Markov decision processes (MDPs)
Report on notions of policy and value function; optimal policy/value function via the Bellman equation. Finding optimal policies by value iteration and policy iteration
Report on Concept of learning as incrementally optimising policy in a MDP
Report on Q-learning

Final Deliverables

The program must have a full object-oriented design, with a full implementation life cycle using modern software engineering principles
The program will at least demonstrate Q-learning with various exploration strategies in a simple world.
The program must have a text based control file from which it can be configured.
The program will have a Graphical User Interface that can be used to animate simulations.
The report will describe the software engineering process involved in generating your software.
The report will include a description of the theory of MDPs, optimal policies and value functions via the Bellman equation, dynamic programming methods of finding optimal policies, and Q-learning.
The report will include some experiments that compare the effects of different parameter choices on the progress of Q-learning: over-optimistic values causing excessive exploration, and pessimistic initial values causing the development of superstitions.

Suggested Extensions

Implementing RL for more a more challenging problem than a grid world, such as an acrobat.
Experimental analysis of the performance of different learning algorithms is interesting but challenging

Reading

Chapter on Reinforcement Learning in 'Machine Learning' by Tom Mitchell

Prerequisites: This project is already challenging. The maths is not complicated - but you will find it new, and you really will have to understand it properly. You would find it helpful to take the third year courses in Computational Optimisation and also in Machine Learning, but this is not essential.

Ridge Regression

The project is recommended for Computational Finance students.

Early Deliverables

Proof of concept program: Kernel Ridge Regression applied to a small artificial dataset.
Report: An overview of ridge regression describing the concepts 'training set' and 'test set', giving the formulas for Ridge Regression and defining all terms in them.
Report: Examples of applications of Ridge Regression worked out on paper and checked using the prototype program.

Final Deliverables

The program will work with a real dataset such as Boston housing, read the data from a file, preprocess the data (normalise, split the dataset into the test and training sets) and apply Ridge Regression using a kernel with parameters.
The program will automatically perform tests such as comparison of different kernels, parameters, etc. The results will be visualised.
The program will work in batch and on-line modes.
Tests will be performed to compare plain Ridge Regression against other regression-based algorithms such as Kernel Aggregating Algorithm Regression, Kernel Aggregating Algorithm Regression with Changing Dependencies, or gradient descent.
The report will describe the theory of Ridge Regression and derive the formulas.
The report will describe implementation issues (such as the choice of data structures, numerical methods etc) necessary to apply the theory.
The report will describe computational experiments, analyse their results and draw conclusions

Suggested Extensions

Application of Ridge Regression to different datasets (including those suggested by the student).
Comparison with other learning algorithms (e.g., aggregating algorithm regression, neural networks, gradient descent-based etc).
Combining regression with methods of prediction with expert advice.

Reading

Y.Kalnishkan, Kernel Methods, http://onlineprediction.net/?n=Main.KernelMethods
J.Shawe-Taylor and N.Cristianini, Kernel Methods for Pattern Analysis Cambridge University Press, 2004
C.E.Rasmussen and C.Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006 http://www.gaussianprocess.org/
B.Schlkopf and A.J.Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, The MIT Press, 2001.

Prerequisites: Taking CS5100 (Data Analysis) is required. Taking CS5920 (Computer Learning) would be an advantage (but is not required).

Ridge Regression

The project is recommended for Computational Finance students.

Early Deliverables

Proof of concept program: Kernel Ridge Regression applied to a small artificial dataset.
Report: An overview of ridge regression describing the concepts 'training set' and 'test set', giving the formulas for Ridge Regression and defining all terms in them.
Report: Examples of applications of Ridge Regression worked out on paper and checked using the prototype program.

Final Deliverables

The program will work with a real dataset such as Boston housing, read the data from a file, preprocess the data (normalise, split the dataset into the test and training sets) and apply Ridge Regression using a kernel with parameters.
The program will automatically perform tests such as comparison of different kernels, parameters, etc. The results will be visualised.
The program will work in batch and on-line modes.
Tests will be performed to compare plain Ridge Regression against other regression-based algorithms such as Kernel Aggregating Algorithm Regression, Kernel Aggregating Algorithm Regression with Changing Dependencies, or gradient descent.
The report will describe the theory of Ridge Regression and derive the formulas.
The report will describe implementation issues (such as the choice of data structures, numerical methods etc) necessary to apply the theory.
The report will describe computational experiments, analyse their results and draw conclusions

Suggested Extensions

Application of Ridge Regression to different datasets (including those suggested by the student).
Comparison with other learning algorithms (e.g., aggregating algorithm regression, neural networks, gradient descent-based etc).
Combining regression with methods of prediction with expert advice.

Reading

Y.Kalnishkan, Kernel Methods, http://onlineprediction.net/?n=Main.KernelMethods
J.Shawe-Taylor and N.Cristianini, Kernel Methods for Pattern Analysis Cambridge University Press, 2004
C.E.Rasmussen and C.Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006 http://www.gaussianprocess.org/
B.Schlkopf and A.J.Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, The MIT Press, 2001.

Prerequisites: Taking CS5100 (Data Analysis) is required. Taking CS5920 (Machine Learning) would be an advantage (but is not required).

Satisfiability Modulo Theories and Bounded Model Checking for AI Planning

Background:

Early Deliverables

Report: overview of SAT, SMT and SMT solving techniques.
Report: bounded model checking and encoding into SMT.
Install z3 and Python APIs. Develop code for SMT solving of a simple example problem (e.g., Sudoku, or job shop scheduling).
Develop code for BMC of generic transition systems.

Final Deliverables

Develop code for encoding a planning problem on a 2-d grid with obstacle as a BMC problem.
Evaluate your solution with arbitrary and randomly generated planning instances.
Analyze SMT solver performance for different grid sizes and numbers of obstacles.
Final dissertation. It should include background on SAT, SMT, BMC, and AI planning, as well as the description of your solution for the encoding of AI planning into SMT, some example solutions obtained with the developed planning method, and performance analysis results.

Suggested Extensions

Extend the planning problem and your solution with transition costs and cost constraints.
Develop code to visualize path solutions on the 2d grid.

Reading

De Moura, Leonardo, and Nikolaj Bjørner. "Satisfiability modulo theories: introduction and applications." Communications of the ACM 54.9 (2011): 69-77. (available at https://dl.acm.org/doi/abs/10.1145/1995376.1995394)
De Moura, Leonardo, and Nikolaj Bjørner. "Satisfiability modulo theories: An appetizer." Brazilian Symposium on Formal Methods (2009): 23-36. (available at https://link.springer.com/chapter/10.1007/978-3-642-10452-7_3)
Amla, Nina, et al. "An analysis of SAT-based model checking techniques in an industrial environment." (up to section 2.3 included). (available at https://link.springer.com/chapter/10.1007/11560548_20)
Lectures on " Introduction to planning" and " Planning formalisms" of the " Autonomous Intelligent Systems" course
Z3 tool page, https://github.com/Z3Prover/z3
Programming Z3 tutorial (sections 1,2,3,5), https://theory.stanford.edu/~nikolaj/programmingz3.html
Jupyter notebooks illustrating Z3's Python APIs (includes code for Programming Z3 tutorial), https://notebooks.azure.com/nbjorner/projects/z3examples
Online Z3 tutorial (using SMT-lib input format), https://rise4fun.com/z3/tutorial/guide
Dennis Yurichev, "SAT/SMT by Examples", https://yurichev.com/writings/SAT_SMT_by_example.pdf (a huge collection of maths and CS problems solved with SAT or SMT)

Prerequisites: Prerequisites: CS5980 (Autonomous Intelligent Systems) or some knowledge of AI planning. Good programming skills in Python. Some understanding of logic.

Satisfiability Modulo Theories and Bounded Model Checking for AI Planning

Background:

Early Deliverables

Report: overview of SAT, SMT and SMT solving techniques.
Report: bounded model checking and encoding into SMT.
Install z3 and Python APIs. Develop code for SMT solving of a simple example problem (e.g., Sudoku, or job shop scheduling).
Develop code for BMC of generic transition systems.

Final Deliverables

Develop code for encoding a planning problem on a 2-d grid with obstacle as a BMC problem.
Evaluate your solution with arbitrary and randomly generated planning instances.
Analyze SMT solver performance for different grid sizes and numbers of obstacles.
Final dissertation. It should include background on SAT, SMT, BMC, and AI planning, as well as the description of your solution for the encoding of AI planning into SMT, some example solutions obtained with the developed planning method, and performance analysis results.

Suggested Extensions

Extend the planning problem and your solution with transition costs and cost constraints.
Develop code to visualize path solutions on the 2d grid.

Reading

De Moura, Leonardo, and Nikolaj Bjørner. "Satisfiability modulo theories: introduction and applications." Communications of the ACM 54.9 (2011): 69-77. (available at https://dl.acm.org/doi/abs/10.1145/1995376.1995394)
De Moura, Leonardo, and Nikolaj Bjørner. "Satisfiability modulo theories: An appetizer." Brazilian Symposium on Formal Methods (2009): 23-36. (available at https://link.springer.com/chapter/10.1007/978-3-642-10452-7_3)
Amla, Nina, et al. "An analysis of SAT-based model checking techniques in an industrial environment." (up to section 2.3 included). (available at https://link.springer.com/chapter/10.1007/11560548_20)
Lectures on " Introduction to planning" and " Planning formalisms" of the " Autonomous Intelligent Systems" course
Z3 tool page, https://github.com/Z3Prover/z3
Programming Z3 tutorial (sections 1,2,3,5), https://theory.stanford.edu/~nikolaj/programmingz3.html
Jupyter notebooks illustrating Z3's Python APIs (includes code for Programming Z3 tutorial), https://notebooks.azure.com/nbjorner/projects/z3examples
Online Z3 tutorial (using SMT-lib input format), https://rise4fun.com/z3/tutorial/guide
Dennis Yurichev, "SAT/SMT by Examples", https://yurichev.com/writings/SAT_SMT_by_example.pdf (a huge collection of maths and CS problems solved with SAT or SMT)

Prerequisites: Prerequisites: CS5980 (Autonomous Intelligent Systems) or some knowledge of AI planning. Good programming skills in Python. Some understanding of logic.

Scalable Peer-To-Peer Key Value Store.

Aims: Implementing a scalable key value store in a peer-to-peer setting and understanding its performance.

Background:

Early Deliverables

A report giving an overview of prominent existing key value stores such as Chord and SkipGraph.
A report describing how existing key value stores deal with environmental changes.
A proof-of-concept program implementing a simple (possibly inefficient) distributed key value store.

Final Deliverables

A distributed key value store that provides standard dictionary operations and is scalable to large data sets.
A report describing the architecture of the implemented key value store.
A report on the performed experiments and simulation results.

Suggested Extensions

Supporting advanced dictionary operations, e.g., range queries.
More efficient storage of data by using coding techniques.
Simulating node churn.

Reading

Ion Stoica, Robert Morris, David R. Karger, M. Frans Kaashoek, Hari Balakrishnan: Chord: A scalable peer-to-peer lookup service for internet applications. SIGCOMM 2001: 149-160
James Aspnes, Gauri Shah: Skip graphs. ACM Trans. Algorithms 3(4): 37 (2007)

Prerequisites: CS5860; Understanding of the basic concepts of distributed systems. Distributed programming.

Scheduling jobs with security in mind

Aims: To provide a tool for solving workflow scheduling and Access Control problems.

Background: This project deals with an Access Control problem in Information Security. The workflow satisfiability problem (WSP) asks whether there exists an assignment of authorised users to the steps in a workflow specification, subject to certain constraints on the assignment. This is a problem faced every day by very many large companies.

The problem is computationally intractable in general. However the modern world does not rely so much on the polynomial versus NP-hard distinction (P vs NP). Instead we use the much more nuanced Fixed Parameter Tractability (FPT) landscape. This is when we hope to find some problem characteristic that allows a computationally hard problem to become tractable subject only to limiting a particular parameter.

Recent research, led by researchers in this department, has discovered that the workflow problem, subject to access control restrictions, is indeed FPT in some cases. In this project you will research the theory of fixed parameter tractability and the workflow scheduling problem. You will implement heuristic approaches to solving the problem, and also implement the FPT algorithm for work flow scheduling (the parameter is the number of required steps in a given workflow).

Early Deliverables

Report on Fixed Parameter Tractability and how it is different from the P vs NP tradition, with examples.
About Workflow Scheduling subject to access control: Where it occurs, why it is important and the kinds of constraints and authorisations that are useful.
Report on the analysis of the Fixed Parameter Tractability and the NP-hardness of WSP
Report on various other constraint satisfaction techniques used to solve industrial problems such as SAT and CSP solvers.
Programs: A simple backtracking algorithm, an FPT algorithm for the Vertex Cover problem.

Final Deliverables

The program must have a full object-oriented design, with a full implementation life cycle using modern software engineering principles
The program should have an excellent GUI for experimentation.
The report should include experimental assessment of the FPT algorithm, possibly in comparison with a SAT or CSP solver.
The report will describe the software engineering process involved in generating your software.
The report will describe and analyse your implementation of the backtracking FPT algorithm for solving WSP with user-independent constraints.

Suggested Extensions

Study of the Valued WSP allowing us to provide solutions even when WSP is unsatisfiable.

Secure Online Auctions via MPC

Aims: Implement an MPC-based online auction platform.

Background: With secure multi-party computation (MPC), it is possible to compute functions over the inputs of multiple parties such that only the computation result is revealed whereas the inputs stay private. A prominent use case for MPC is running online auctions, as MPC can provide provable security guarantees for the submitted bids. The goal of this project is to utilize one existing MPC framework to implement the functionality of a first-price sealed-bid auction (FPSBA) and make it available via a usable web interface. More precisely, the envisioned workflow is as follows: 1) In a web interface, users can select items they are interested in buying and submit bids. 2) The bids are secret-shared on the client side and the shares are distributed among a number of non-colluding servers that run the MPC protocol. 3) Once all bids are submitted, the servers compute the result and notify the participants of the outcome.

Prerequisites:

Basic knowledge in cryptography
Knowledge of C/C++ and web development (HTML, CSS, JavaScript, REST)

Early Deliverables

Implementation of the first-price sealed-bid auction (FPSBA) functionality in one existing MPC framework
Implementation of REST interfaces to submit bids to and receive results from auction servers
Implementation of controller node to configure the auction platform (e.g., which items should be auctioned and how long the auction runs)

Final Deliverables

User-friendly web interface
Implementation of client-side secret sharing + communication with auction servers

Reading

Marcella Hastings, Brett Hemenway, Daniel Noble, Steve Zdancewic: "SoK: General Purpose Compilers for Secure Multi-Party Computation". In IEEE Symposium on Security and Privacy 2019.
Marcel Keller: "MP-SPDZ: A Versatile Framework for Multi-Party Computation". In CCS 2020.
Lennart Braun, Daniel Demmler, Thomas Schneider, Oleksandr Tkachenko: "MOTION - A Framework for Mixed-Protocol Multi-Party Computation". In ACM TOPS 2022.
https://github.com/data61/MP-SPDZ
https://github.com/encryptogroup/MOTION

Security Issues in Serverless Functions

Aims: Analyse the security issues of using serverless functions on the cloud.

Background: Severless functions are a relatively new cloud paradigm. Instead of deploying a full virtual machine or application, developers can deploy functions that are executed on demand when certain triggers happen. This can streamline and simplify the development of cloud-based systems as the developer doesn't have to worry about machine or application configurations. Serverless functions can sometimes result in very complex systems of interconnected functions that result in very complex environments. This can result in uncontrolled recursive calls or incorrect policies that can be used by an attacker to get access to the system. This project will analyse the security issues that developers may face when developing serverless backends and produce a set of guidelines to help them while developing using this new paradigm.

Prerequisites:

Good programming skills, particularly Python and/or NodeJS.
Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub).

Early Deliverables

A report describing current serverless offerings from Google (G Cloud), Microsoft (Azure) and Amazon (AWS)
A report with an overview access control systems for G Cloud, Azure and AWS
A proof of concept function that executes in one of these clouds

Final Deliverables

A detailed descriptoin of the access control system of one of the cloud providers
Examples of functions that represent each of the risks described
A detailed description of a set of policies and recommendations so developers avoid these risks
A critical comparison with findings from related works

Reading

https://cloud.google.com
https://aws.amazon.com
https://azure.microsoft.com
https://dl.acm.org/doi/abs/10.1145/3366423.3380173
https://www.usenix.org/conference/hotcloud19/presentation/obetz
https://dl.acm.org/doi/abs/10.1145/3366615.3368351>

Single-Agent and Multi-Agent Search in Maze Games

Aims: The goal of this project is to implement general search algorithms and apply them to single-agent and multi-agent scenarios in maze games (e.g. Pac-Man).

Background:

Maze games are a video game genre in which the entire playing field is a maze. Usually, the player needs to navigate the maze to collect items and also performs other actions (e.g. escape monsters and outrace an opponent) within a time limit. The most popular example is Pac-Man by Namco (1980), but many other maze games exist.

You will model a maze game as a search problem and then design and implement agents that inhabit the maze. The agents will find efficient paths through the maze world, to reach a particular location, to collect food and to combat opponents.

Early Deliverables

Given a maze game of your interest (e.g. Pac-Man), design and implement a simple graphical visualisation for it.
Consider an agent inhabiting the maze world. Model the agent's navigation task as a search problem. Implement a graph search version of depth-first search and breadth-first search to reach a specific location in the maze. Run experiments with three different size maze: small, medium and large. Visualise the agent following the plan step-by-step.
Now change the cost function in such a way that traversing some areas is more expensive than traversing other areas (e.g. Pac-Man: you can charge more for dangerous steps in areas close to the ghosts). Implement the uniform-cost graph search algorithm to reach a specific location in the maze. Run experiments with three different size maze: small, medium and large. Visualise the agent following the plan step-by-step.

Final Deliverables

Implement the algorithm A* to find the shortest path through the maze that touches all four corners. Use two different, consistent heuristic functions. Run experiments with three different size maze: small, medium and large. Visualise the agent following the plan step-by-step.
Assume each cell in the maze has an item to collect. Formulate a search problem for collecting all items (e.g. in Pac-Man: eating all the food) and implement A* to do that in as few steps as possible. Also implement a greedy search and compare the results of the two algorithms in three different size maze: small, medium and large. Visualise the actions of the agent.
Now assume that you have additional agents in your maze, one is the main player and the other are the opponents (e.g. Pac-Man and the ghosts). Model this new problem and implement the player as a reflex agent that tries to visits all cells avoiding the opponents.
Implement an adversarial Minmax search agent for each agent; start with the player and one opponent and then increase the number of opponents. You need to write an evaluation function that evaluates states. Run experiments with three different size maze: small, medium and large. Visualise the actions of the agents.
Design and implement a new agent that uses Alpha-beta pruning to more efficiently explore the minimax tree. Run experiments with three different size maze: small, medium and large. Visualise the actions of the agents.
Design and implement an Expectimax agent to model the probabilistic behaviour of an agents who makes suboptimal choices. Run experiments with three different size maze: small, medium and large. Visualise the actions of the agents.

Suggested Extensions

Use value iteration and Q-learning to underpin the behaviour of the player.

Reading

Stuart Russell and Peter Norvig, AI: A Modern Approach, 3rd Edition.

Prerequisites: Having attended a course in AI.

Solving the Impossible - Practical solutions for intractable problems

Aims: In this project, you will select an NP-hard problem, with several applications and many interesting solution strategies, implement some of these algorithms, and report on how they work in practice.

The recommended problems are either the graph problem Vertex Cover or the SAT problem which is the original NP-hard problem. The Vertex Cover problem has a structure that is simple to understand, yet many different approaches work for solving it (see below) and there are many interesting algorithms. There are also good benchmark repositories with example problems to test your code against. The SAT problem comes from logic, and simply asks whether a given propositional formula has a satisfying assignment. For SAT there is a very powerful yet simple family of algorithms that are successfully used in industry for solving many applied problems such as program verification. Like Vertex Cover there are also good benchmark repositories with example problems to test your code against. Students are expected to choose from two options in this project: option 1 (heuristics and approximations) and Option 2 (SAT-solving).

Background: NP-hard problems are an important challenge to computer science, both in theory and in practice. You may have heard this term from the important "P vs NP" question (or from some of your courses).

These problems have the important feature that they are "puzzle-like": you can easily recognise or verify a solution to these problems if you are given one, but it can be very challenging to find a solution given an instance of such a problem, or decide whether there is no solution.

In particular, it is quite strongly believed that there are no algorithms that can efficiently find solutions (if they exist) for all instances of such a problem, in time that is "worst-case tractable" (e.g., in polynomial worst-case time).

Problems like this are ubiquitous in reality - verify that a proposed circuit design has no bugs in it; plan routes for your delivery vehicles to deliver some goods with a minimum number of delivery runs; schedule speakers at an event so that you never have two talks at the same time about the same topic... In all these cases, you may run into situations where there is no algorithm that is guaranteed to always work efficiently, but you still have to find a solution to the problem somehow. Many approaches for this have been proposed, both in theory and in practice.

Some strategies for solving NP-complete problems are using heuristics; approximation algorithms; tractable special cases; algorithms that are slow in the worst case but fast on some 'non-worst-case' inputs.

An example of an algorithm for SAT that is known to be slow for some worst-case inputs but is actually very fast for many typical is the DPLL algorithm and its extensions. DPLL is the name of a very simple algorithm for SAT introduced in 1962, which basically amounts to (a partial) exhaustive search of the space of solutions to a given SAT instance. Almost all successful industrial level SAT-solvers nowadays use variants of DPLL equipped with clever heuristics to improve their performance. Though theoretically, DPLL is known to require huge (namely, exponential) run-time, striking experience shows that current SAT-solvers are very efficient on real-world instances. Hence the ubiquitous successful use of SAT-solving in the industry.

Prerequisites: CS2860 Algorithms and Complexity and a general interest in algorithms. Further algorithms courses are helpful, but not required. Good programming skills. Basic knowledge in algorithms and run-time analysis is a plus.

Early Deliverables

You are to choose either Option 1 (heuristics and approximations) or Option 2 (SAT-solving).
Report: A full formal description of the problem.
Report: P vs NP, NP-completeness, NP-hardness
Option 1: Report: Strategies for solving NP-complete problems, such as: heuristics; approximation algorithms; tractable special cases; algorithms that are slow in the worst case but fast on some 'non-worst-case' inputs.
Option 1: Report: Fast in theory, slow in practice. Merge sort and heap sort are sorting algorithms with excellent theoretical worst-case guarantees, but they may find themselves beaten in practice by quicksort, which has worse worst-case guarantees. Another example: linear programming is the basis of many planning systems, and there are algorithms for solving LPs in polynomial time, but these are not used in practice; instead, algorithms are used that have worst-case exponential-time behaviour. Report and comment on this phenomenon.
Option 1: Proof of concept program: A program that generates a random instance of Vertex Cover, shows it to the user, reads a suggested solution from the user, and verifies whether this solution is correct.
Option 1: Proof of concept program: A simple greedy heuristic
Option 1: Proof of concept program: The simple 2-approximation algorithm
Option 2: Encoding and data structures: write a program that reads a CNF from a file in the DIMACS format and a truth-assignment (generated randomly), and outputs true or false depending if the assignment satisfies the input CNF or not.
Option 2: Investigate the DPLL algorithm and write a report that shows manually worked examples of its operation. The report should include the pseudo code of the DPLL algorithm, providing also diagrammatic examples of its run on simple example input CNF formulas. Explain the algorithm and its running time.
Option 2: Write procedures that will serve as building blocks for the e cental SAT solver. Specifically, write a procedure that gets as an input a CNF formula and a *partial* assignment to its variable and outputs the new CNF formula resulting from assigning to the original formula. This should include simplifying (getting rid) of satisfied clauses, if they exist, and should output False if the CNF formula becomes false under the given assignment.

Final Deliverables

Option 1: A program with a GUI that can show an instance on the screen and illustrate a solution for it. Your program should be able to read a problem instance from disk, in a format taken from an existing 'benchmark library', and it should also be able to generate an instance of the problem at random with specified parameters.
Option 1: The program should offer at least three different solution strategies, such as a heuristic, an approximation algorithm, and a backtracking search 'branching algorithm'.
Option 1: The report should contain a section on the outcomes of experiments done between the different algorithms for instances of some benchmark library, with conclusions drawn about which algorithms seem suitable in different situations.
Option 2: Write a program which solves SAT using the DPLL algorithm. It is advisable to write the code in a language like C++ or C. But it is also possible to implement it in java or python, as the program is intended for educational purposes.
Option 2: Demonstrate the run of the SAT-solver on substantial instances (taken from known benchmarks).
Option 2: Investigate the conflict-driven clause learning heuristic (CDCL). Show in your report manually worked examples of DPLL with the clause learning heuristic as well a pseudocode for CDCL.
Option 2: Extend your DPLL solver to uses some kind of very simple clause learning heuristic, and explain your choice, showing how it improves the run time on some of the benchmarks.
Final report should contain a section on the outcomes of experiments done between the different algorithms/SAT-solvers for instances of some benchmark library, with conclusions drawn about which algorithms seem suitable in different situations.

Suggested Extensions

A more careful theory report, including notions of 'problem reductions' for NP-complete problems. For Vertex Cover, this should at least cover the two problems Independent Set and Max Clique.
A larger range of algorithms, or algorithms using more advanced concepts. One advanced approach for this problem would be the use of an LP relaxation. Another advanced approach would be to use methods from parameterized complexity (you may ask for references here).
Careful optimisation of one particular algorithm for the problem, for example a carefully tweaked branching algorithm, to make it as fast as possible across a range of instances.
Investigations into 'lower bounds' for your problem, to test how far off each algorithm is from an optimal solution.
For option 2: Extend the DPLL SAT-solver to include a fully functional conflict driven clause learning (CDCL). The full specification for this could be taken from the Handbook (see references below) or from the Wikipedia page. Apply these extensions to the SAT-solver and systematically compare the run-time improvement.
For option 2: Extend the DPLL SAT-solver to include Restarts. The full specification for this could be taken from the Handbook (see references below) or from the Wikipedia page. Apply these extensions to the SAT-solver and systematically compare the run-time improvement.
Investigate the notion of Satisfiability Modulo a Theory (SMT). Show how it extends SAT. Apply this to your tool for a chosen theory and systematically compare the run-time improvement on relevant instances.
Investigate the theoretical foundation of SAT solving: the resolution refutation system and its provable limitations.

Reading

Selected parts of Handbook of Satisfiability. Editors: Biere, A., Heule, M., Van Maaren, H., Walsh, T. Feb 2009. Volume 185 of Frontiers in Artificial Intelligence and Applications.
https://en.wikipedia.org/wiki/Conflict-Driven_Clause_Learning

Solving the Impossible - Practical solutions for intractable problems

A good problem choice is Feedback Vertex Set (see Wikipedia for a description). This is a problem that is easy to understand, but still allows for several interesting algorithms and solution strategies (see below). It is also possible to instead select the Vertex Cover problem as recommended for the BSc full-unit level, but in that case several of the extensions should be implemented as well.

The milestones below are written under the assumption that the problem Feedback Vertex Set is used.

These problems have the important feature that they are "puzzle-like": you can easily recognise or verify a solution to these problems if you are given one, but it can be very challenging to prove or verify that an instance of such a problem does not have a solution (or that it does not have a solution that is "good enough", e.g., that all solutions are too big, too expensive...).

In particular, it is strongly believed that there are no algorithms that can efficiently find solutions (if they exist) for all instances of such a problem, in time that is "worst-case tractable" (e.g., in polynomial worst-case time).

Prerequisites: CS2860 Algorithms and Complexity and a general interest in algorithms. Further algorithms courses are helpful, but not required.

Early Deliverables

Report: A full formal description of the problem
Report: P vs NP, NP-completeness, NP-hardness
Report: Strategies for solving NP-complete problems, such as: heuristics; approximation algorithms; tractable special cases; algorithms that are slow in the worst case but fast on some 'non-worst-case' inputs
Report: Fast in theory, slow in practice. Merge sort and heap sort are sorting algorithms with excellent theoretical worst-case guarantees, but they may find themselves beaten in practice by quicksort, which has worse worst-case guarantees. Another example: linear programming is the basis of many planning systems, and there are algorithms for solving LPs in polynomial time, but these are not used in practice; instead, algorithms are used that have worst-case exponential-time behaviour. Report and comment on this phenomenon.
Proof of concept program: A program that takes a graph and either finds a cycle in it or reports that the graph contains no cycles.
Proof of concept program: A 'cleaning algorithm' to remove vertices of degree 1 and 2 and self-loops
Proof of concept program: A simple greedy heuristic, with or without using the cleaning step
Proof of concept program: An approximation algorithm, for example the 3-approximation described in the lecture notes at https://www.cs.cmu.edu/afs/cs/academic/class/15854-f05/www/scribe/lec9.pdf

Final Deliverables

A program with a GUI that can show an instance on the screen and illustrate a solution for it. Your program should be able to read a problem instance from disk, in a format taken from an existing 'benchmark library', and it should also be able to generate an instance of the problem at random with specified parameters.
The program should offer at least three different solution strategies, such as a heuristic, an approximation algorithm, and a backtracking search 'branching algorithm'.
The report should contain a section on the outcomes of experiments done between the different algorithms for instances of some benchmark library, with conclusions drawn about which algorithms seem suitable in different situations.

Suggested Extensions

A more careful theory report, including notions of 'problem reductions' for NP-complete problems. For Feedback Vertex Set, this should include at least the hardness reduction from Vertex Cover.
A larger range of algorithms, or algorithms using more advanced concepts. One advanced approach for this problem would be the use of an LP relaxation. Another advanced approach would be to use methods from parameterized complexity (you may ask for references here).
Careful optimisation of one particular algorithm for the problem, for example a carefully tweaked branching algorithm, to make it as fast as possible across a range of instances.
Investigations into 'lower bounds' for your problem, to test how far off each algorithm is from an optimal solution.

Reading

A good set of benchmark instances are found on the following page: https://pacechallenge.wordpress.com/pace-2016/track-b-feedback-vertex-set/

Sports Data Analytic

Aims: Nowadays, data analysis is an essential component for managing a sports team successfully (e.g. the Moneyball movie). In this project, you will be a data scientist of a sports team. You will analyze data to help your colleagues accomplish in coaching, scouting, and bringing in the right players. Your colleagues may not be good with numbers; to help them understand your analysis, you need to provide them with simple but informative visual illuminations.

Background:

In this introduction I will focus on football. You are welcome to choose any other sport you are familiar with, but you must ensure that you can acquire the relevant datasets.

Around the globe, there are hundreds of football matches every week. An extensive amount of data on the players' performance is collected. While some of these data is behind paywall, other data is openly available in websites including FBref, sofascore, transfermarkt and kaggle. In this project, you will be a data scientist of a sports team. By using data, you and your colleagues together aim to build a strong team.

There are many aspects a data scientist can be useful in sports team management. For instance, when your team needs to bring in a player of a certain type (say a centre back with good aerial ability), you can use clustering algorithm or principal component analysis (PCA) to identify the right players from less elite leagues. This can help the team bring in a "hidden gem" with a low transfer fee. On the other hand, you can use the players' performance data to identify their weaknesses, so the coaching staffs know which training programs should be provided to the players. Remember, your colleagues are good at kicking the ball, but they may not be as good with numbers as you. To help them understand your analysis, you need to provide them with simple but informative visual illuminations.

Sports data analytic is a science but also an art. You must have good sense about the sport you analyze. If you have watched only a few sport matches in your whole life, you are strongly advised not to choose this topic.

Early Deliverables

You will need to work with datasets from different sources. Be familiar with the datasets. Identify the attributes which are strong indicators of good players in every position.
Learn the powerful visualization library ggplot2 (or plotnine if you are using Python). In particular, learn how to represent high-dimensional data using bubble plot, and/or via PCA.
Make a plan how you are going to use the datasets to complete a mission that can help your sports team. The proposal should indicate how the data should be pre-processed, which learning algorithms you will use, and how the results of data analysis will be used.

Final Deliverables

Generate simple but informative visual illuminations which allow you to present your data analysis results to your colleagues.
Implement a software with user-friendly interface, which allows your colleagues to explore the data themselves. The software should achieve the followings: (i) recommend which players your team can bring in, based on the constraints on position, age and budget your colleagues inputs; (ii) provide your colleagues of customizable visual illuminations, and (iii) display annotated tables of detailed data when your colleague wants to look into it.

Prerequisites:

You must have done Data Analysis (CS5100 or CS5100J). You must be comfortable with programming in Python or R. Doing well in Visualisation and Exploratory Analysis (CS5250) is preferred.

Structural Bioinformatics Framework using a MapReduce formalism

Aims: The aim of this project is to be provide a framework where large numbers of protein structures can be analysed using a user-provided executable in the MapReduce formalism.

Background: Structural Biology is the study of three-dimensional macro-molecular structures (Proteins, DNA, RNA) that are found in cells. Understanding their function is a key part of Biology and is one the major drivers of drug discovery (the most recent Nobel Prize inBiology was awarded in this field). Structural Bioinformatics involves the analysis of large numbers of structures (usually protein structures) using a variety of different user-supplied executables. Large amounts of the data are available publicly in the Protein Data Bank (PDB) where nearly 100,000 structures are deposited. The analysis works in a batch mode and hence is ideal for a MapReduce formalism.

Early Deliverables

Report - Protein Structures
Report - The Protein Data Bank and the file formats used in accessing it.
Proof of Concept - distributing protein structure files amongst MapReduce cluster.
Proof of Concept - Running MapReduce using a single type of executable for analysing protein structures.

Final Deliverables

The final framework will allow users to input a map and reduce step (namely an executable to run on an individual protein structure in the former and a parser for the generated log file from the reduce step) to analyse a set of protein structures.
The developer will provide a clear set of guidelines for the map/reduce executables.
The final data should be returnable in variety of different formats.

Suggested Extensions

The PDB is updated on a regular basis. The data set can mirror the official database and is distributed amongst the MapReduce cluster.
A set of libraries so that arbitrary executables can be easily put into the MapReduce formalism.
The PDB is comprised of many different types of structure (DNA and RNA macro-molecules are included as well as proteins). An interface to select fractions of the PDB for analysis.
Implementation on a Public Mapreduce cluster e.g. Amazon.

Prerequisites: Not for students doing the MSci Computer Science (Artificial Intelligence)

Reading

The protein data bank H.M.Berman et al. Nucl. Acid. Res. (2000) 28(1): 235-242. Doi: 10.1093/nar/28.1.235

Test Coverage Instrumentation for Java Bytecode

Aims: A tool for instrumenting Java code to compute and visualise coverage metrics used for safety-critical systems.

Background:

The quality of a set of unit tests (a test suite) is difficult to assess: how much testing is enough? This question is especially important for safety-critical systems. As a consequence, many coverage metrics such as statement or function coverage have been developed to give an intuition for how well a test suite exercises the program under test. A number of coverage visualisers are available, but new metrics require new tool support.

The goal of this project is to use the Soot framework for instrumenting Java class files to insert profiling code that can gather coverage information. Soot is written in Java and allows to both analyse and modify Java bytecode. You will implement different types of instrumentation for several different coverage metrics and a visualisation for the results.

In addition to standard metrics, you should also implement MC/DC - Multiple Condition/Decision coverage, which is mandated for safety-critical avionics software.

Early Deliverables

Code: Instrumentation for statement and function coverage and basic visualisation
Report: Java Bytecode and Soot's intermediate representation
Report: Code instrumentation
Report: Survey on coverage metrics

Final Deliverables

Code: Branch coverage, condition coverage, and MC/DC coverage. GUI for running and visualising the analysis.
Report: Architecture of Soot
Report: Design and implementation of the instrumentation tool
Report: Test requirements for safety critical systems
Formal description and practical implementation of a custom coverage metric.

Suggested Extensions

Computing cyclomatic complexity using Soot's analysis components

Reading

Lam et al: The Soot framework for Java program analysis: a retrospective. Cetus Users and Compiler Infastructure Workshop (CETUS 2011)
The Soot Wiki: https://github.com/Sable/soot/wiki

Prerequisites: Confident in Java programming skills.

That Guy: Identifying and Mitigating Bad Behaviour in Autonomous Task Allocation Algorithms

Aims: Detecting and Mitigating Selfish Behaviour in Distributed Task Allocation Algorithms.

Background: Distributed task allocation algorithms are used in autonomous swarms of drones for mapping, search and rescue, and tactical operations (among many others). One must consider the role of trust in such systems: once released there is no guarantee that individual drones do not come under the sway of malicious parties, or that malicious drones are inserted into the network as reinforcements or additional resources. The identification of malicious entities and prevention of any negative impacts they may have on the mission is critical. Detecting such actors and protecting the network, without the intervention of an operator, may be required and forms the focus of this project.

Prerequisites:

Good programming skills, particularly in Python C, or C++.
A basic grasp of network security and the OSI model.
Willingness to learn and engage with simulation and statistical analysis software (NS3/SimPy and MATLAB/Octave/R).
Being able to autonomously download/compile/install tools from various repositories (e.g., GitHub).

Early Deliverables

A report on task allocation algorithms in current usage and the degree to which trust is implicit in their design
A report on the current state of the art in anomaly detection (behavioural and statistical) in task allocation systems. This may extend to underlying network infrastructure (for example selfish routing attacks in Mobile Ad hoc Networks (MANETs) performed by malicious nodes in an attempt to gain intelligence about the larger mission)
A clear workplan describing the set of tests to be performed, and a specific use case defining what type of network (MANET, traditional wireless/wired, Internet-based) and task allocation protocol will form the focus of experiments

Final Deliverables

A simulation or experimental demonstration of the principles explored in the project, such as a simple (half dozen nodes) decision making system with attack and mitigation demonstrations
A final report documenting the research, design, and implementation of the system. This report will extend to cover the experimental design, and how the choices made in the systems design phase informed experimental choices (and vice versa). A clear grasp of experimental process and hypothesis testing should be demonstrated through appropriate use of tools and techniques

Reading

https://ieeexplore.ieee.org/abstract/document/9002704
https://ieeexplore.ieee.org/abstract/document/7776949
https://gala.gre.ac.uk/id/eprint/16178/
https://ieeexplore.ieee.org/abstract/document/8758417
https://ieeexplore.ieee.org/abstract/document/7809102

The development of PKI system using Blockchain Technology (Distributed blockchain-based PKI)

Aims: Take advantage of Blockchain technologies to develop a distributed robust service of PKI.

Background: Public Key Infrastructure (PKI) is a very important element to support SSL/TLS protocol which is widely used in many modern web applications to create a secure channel. PKI is based on a public key crypto system (Private and Public key pair) to manage security keys and certificates in terms of their generation, distribution, storing, validation and revocation processes, relying on trusted issuing authorities at the root. However, almost all well-known implementations of PKI have some issues of centralisation, integrity and reliability system features in maintaining and protecting the security materials being created. Therefore, it is arguable that incorporating Blockchain technologies in developing a PKI may offer a solution for those concerns.

Prerequisites:

Ability to research a new topic and implement designs
System modelling
Good programming skills particularly Node JS, Java, Golang, C/C++ etc and knowledge of Windows/Linux internals.
Knowledge Data/Database Modelling
Familiarity with DLT and Blockchain technologies (e.g. HyperLedger)
Public Key encryption and digital certificates standards

Early Deliverables

A report describing architectural components of a PKI system and addressing Blockchain technologies potentials
A design of the PKI data model
A proof of concept implementation of the auditing system using a DLT technologies
Testing the Blockchain-based PKI system

Final Deliverables

Design and develop the data model of PKI
Develop a prototype of PKI system using Blockchain
Complete report
The report will have a User Manual

Reading

https://medium.com/swlh/top-5-dlt-blockchain-protocols-to-consider-in-your-project-dae7fc6d381d
https://www.ibm.com/blogs/research/2018/02/architecture-hyperledger-fabric/
https://hyperledger-fabric.readthedocs.io/en/release-1.4/getting_started.html
https://medium.com/swlh/top-5-dlt-blockchain-protocols-to-consider-in-your-project-dae7fc6d381d
https://pki-tutorial.readthedocs.io/en/latest/simple/
www.openssl.org
EJBCA https://www.ejbca.org
Dierks, T., (2008). The transport layer security (TLS) protocol version 1.2, RFC-5246
Vacca, J. R. (2004). Public Key Infrastructure. Boca Raton: C R C Press LLC.
Cooper, D., Santesson, S., Farrell, S., Boeyen, S., Housley, R. Polk, W. (2008) Internet X.509 Public Key Infrastructure Certificate and Certificate Revocation List (CRL) Profile (RFC-5280)
Santesson, S., Myers, M., Ankney, R., Malpani, A., Galperin, S., Adams, C., (2013) X.509 Internet Public Key Infrastructure a Certificate Status Protocol – OCSP (RFC-6960)
Eastlake, D., (2011) Transport Layer Security (TLS) Extensions: Extension Definitions (RFC-6066)

Timetabling using simulated anealing

Aims: Build a system which uses the simulated annealing algorithm to construct high quality timetables

Background:

Timetabling is a computationally hard problem. It is believed that the only way to find an optimal timetable is to construct all possible timetables and choose the best one. This exhaustive search is impractical for all but trivial problems.

Timetabling is an example of combinatorial optimisation - a problem in which a set of objects must be ordered so as to optimise some cost function. Since exhaustive search is impractical, we instead use heuristic algorithms to find solutions which are good enough, rather than guaranteed optimal. For instance, another well known combinatorial optimisation problem is the Travelling Salesman Problem in which a set of cities must be visited once each in an order which minimises the overall route length. One heuristic might be to start at the start city, then move to the next nearest city that has not yet been visited, and then repeat until no more cities are left. Actually this often produces quite good results, but it is hard to see what the equivalent heuristic might be for the timetabling problem which is multi-dimensional and has a variety of different cost factors.

Simulated annealing is a general purpose heuristic which minimises combinatorial systems using the Metropolis Algorithm - a simulation of annealing behaviour in crystals as they are warmed and slowly cooled. The underlying theory is drawn from statistical thermodynamics, and is likely to be quite opaque to most computer scientists so you are not required to master it. However, the resulting algorithm is quite easy to implement and experiment with, and produces remarkably good results in many kinds of combinatorial optimisation problems.

To succeed in this project, you need to be interested in computationally hard algorithms; and to possess good programming skills.

Early Deliverables

Write a Java program which exhaustively solves the Travelling Salesman problem for small problems - say up to 10 cities.
Write a Java program which uses the nearest city hearistic and test its performance on large problems with 100-200 cities.
Write a report on these two algorithms which gives both asymptotic and actual runtimes for your implementations.
Write a Java implementation of simulated annealing which solves large Travellig Salesman problems.
Write a report describing the effects of modifying the parameters of your algorithm.

Final Deliverables

Write a report which gives a specification of a useful timetabling problem, and describes the software engineering approach that you will use to develop the final application.
Demonstrate a working timetabling system.
A final report on the project activities, fully acknowledging and citing sources.

Suggested Extensions

Add graphical visualisations of your algorithms runtime activity and results.
Consider the proble of rip-up and retry: taking an existing solution with new constraints and costs and attempting to produce a good modified solution that retains as much as possible of the old solution.

Reading

Section on simulated annealing from the Numerical Recipes book published by Cambridge.
Many online resources: start with Wikipedia

Trace Aware Extensions of CTL Model-Checking

Aims: To produce a new model-checking tool for extensions of the popular temporal logic CTL.

Background: Computational Tree Logic (CTL) is a popular temporal logic for specifying branching-time properties of systems. A specification in this logic might read


    AG (request => AF response)

This means "it is always true that if a request event happens, there will always be a point in the future where the request is responded to".

A problem with the above example is that there may be multiple request and response events. The formula cannot ensure that each request is paired with an response. In fact, the set of all traces that satisfy this property is a "context free language".

Recent work extends CTL to allow languages such as context free languages to be built into the logic, to enable these kinds of properties to be expressed.

While many CTL model-checkers have been implemented, there are currently no implementations of model-checkers for this extended CTL. The goal of this project would be to produce the first such implementation.

Early Deliverables

Report: A survey of CTL and CTL model-checking algorithms.
Report: A survey of extended CTL.
Report: A survey of model-checking algorithms for context-free languages.
Proof of concept programs: A simple CTL model-checker.
Proof of concept programs: A simple analysis tool for context-free languages.

Final Deliverables

An analysis tool for finite-state systems and CTL extended with finite-state and context-free languages.
An experimental evaluation of the tool and comparison with CTL model-checkers.

Suggested Extensions

Extensions allowing infinite-state systems to be analysed.

Reading

Travelling Salesperson Heuristics

Aims: To understand and implement a number of heuristics for the Travelling Salesperson Problem (TSP). To study applications of TSP.

Background: : TSP is a famous problem which is easy to formulate: given n "cities", find a shortest route that visits every "city" exactly once. TSP is hard to solve and the problem has a large number of applications. Large instances of the problem are solved by heuristics which are fast algorithms, but without quality guarantee. Heuristics are also used when the distances between the "cities" are not exact.

Early Deliverables

A report describing several TSP heuristics (e.g., Nearest Neighbour, Repeated Nearest Neighbour, Greedy).
Reports: Complexity, NP hardness and the big O notation
Implementations of Nearest Neighbour and Repeated Nearest Neighbour.

Final Deliverables

Description of several TSP heuristics (e.g., Nearest Neighbour, Repeated Nearest Neighbour, Greedy, Vertex Insertion).
Implementation of all heuristics described in the report.
Tables comparing quality and runtime of these heuristics on various testbeds.
GUI for the heuristics.
A report on TSP applications.
A short report on complexity of algorithms.

Suggested Extensions

To study and implement Greedy+2-Opt. To include it into the tables.

Reading

Start with Wikipedia on TSP.

Travelling Salesperson Heuristics

Aims: To understand and implement a number of heuristics for the Travelling Salesperson Problem (TSP). To study applications of TSP.

Early Deliverables

A report describing several TSP heuristics (e.g., Nearest Neighbour, Repeated Nearest Neighbour, Greedy).
Reports: Complexity, NP hardness and the big O notation
Implementations of Nearest Neighbour and Repeated Nearest Neighbour.

Final Deliverables

Description of several TSP heuristics (e.g., Nearest Neighbour, Repeated Nearest Neighbour, Greedy, Vertex Insertion).
To study and implement Greedy+2-Opt.
Implementation of all heuristics described in the report.
Tables comparing quality and runtime of these heuristics on various testbeds.
GUI for the heuristics.
A report on TSP applications.

Reading

Start with Wikipedia on TSP.

Value at Risk

Aims: Write a program for computing Value at Risk. Check its performance using backtesting.

Early Deliverables

You will write a simple proof-of-concept program for computing Value at Risk for a portfolio consisting of 1 or 2 stocks using two different methods: model-building and historical simulation.
Different ways of computing Value at Risk will be back tested and the statistical significance of the results analyzed.
The report will describe the program (software engineering, algorithms etc.,).
The report will also discuss different ways of estimating the variance of returns and the covariance between returns of two different securities (using the standard formula, exponentially weighted moving average, or GARCH(1,1)).

Final Deliverables

The program should be extended by allowing derivatives (such as options) in the portfolio and using Monte Carlo simulation.
Allow an arbitrary number of stocks (using eigen- or Cholesky decomposition)
The final program will have a full object-oriented design, with a full implementation life cycle using modern software engineering principles.
Ideally, it will have a graphical user interface.
The final report will describe: the theory behind the algorithms.
The final report will describe: the implementation issues necessary to apply the theory.
The final report will describe: the software engineering process involved in generating your software.
The final report will describe: computational experiments with different data sets, methods, and parameters.

Suggested Extensions

Computing conditional VaR (also known as expected shortfall)
Explore carefully the choice of parameters for EWMA and GARCH(1,1) using a loss function such as square or log loss
Empirical study of the two approaches to historical simulation for n-day VaR: reducing to 1-day VaR and direct estimation
Complement back testing by stress testing
Replacing the Gaussian model for stock returns by robust models
Computing monetary measures of risk different from VaR

Reading

John C. Hull. Options, futures and other derivatives, 7th ed., Pearson, Upper Saddle River, NJ, 2009.
Hans Follmer and Alexander Schied. Stochastic Finance: An Introduction in Discrete Time, 3rd ed., de Gruyter, Berlin, 2011.
Yahoo Finance as source of data: http://finance.yahoo.com/
http://www.gloriamundi.org

Prerequisites: Java or C++ programming skills. Taking CS3930.

Value at Risk

Aims: Write a program for computing Value at Risk. Check its performance using backtesting (more generally, exploring its calibration and resolution).

The project is recommended for Computational Finance students.

Early Deliverables

You will write a simple proof-of-concept program for computing Value at Risk for a portfolio consisting of 1 or 2 stocks using two different methods: model-building and historical simulation.
Different ways of computing Value at Risk will be back tested and the statistical significance of the results analyzed.
The report will describe the program (software engineering, algorithms, etc.); discuss different ways of estimating the variance of returns and the covariance between returns of two different securities (using the standard formula, exponentially weighted moving average, or GARCH(1,1)).

Final Deliverables

The program should be extended by allowing derivatives (such as options) in the portfolio and using Monte Carlo simulation. Option pricing can be performed using the Black-Scholes formula, or binomial trees (for American options), or a different variety of Monte Carlo simulation (for Asian options).
Allow an arbitrary number of stocks in the model-building approach (using eigen- or Cholesky decomposition).
Allow an arbitrary number of stocks using histotical simulation.
Ideally, it will have a graphical user interface.
The final report will describe the theory behind the algorithms; the implementation issues necessary to apply the theory; the software engineering process involved in generating your software; computational experiments with different data sets, methods, and parameters.

Suggested Extensions

Computing conditional VaR (also known as expected shortfall)
Explore carefully the choice of parameters for EWMA and GARCH(1,1) using a loss function such as square or log loss
Empirical study of the two approaches to historical simulation for n-day VaR: reducing to 1-day VaR and direct estimation
Complement back testing by stress testing.
Replacing the Gaussian model for stock returns by robust models.
Computing monetary measures of risk different from VaR.
Using proper scoring rules for quantiles for valuating computed VaRs.
Using methods of worst-case risk aggregation, such as the Rearrangement Algorithm, or worst-case online decision making, such as the Weak Aggregating Algorithm (currently active areas of research).

Reading

John C. Hull. Options, futures and other derivatives, 10th ed., Pearson, Upper Saddle River, NJ, 2018.
Hans Follmer and Alexander Schied. Stochastic Finance: An Introduction in Discrete Time 3rd ed., de Gruyter, Berlin, 2011.
Yahoo Finance as source of data: http://finance.yahoo.com/
http://gloria-mundi.com

Prerequisites: Python, R, MATLAB, Java, or C++ programming skills.

Value at Risk

Aims: Write a program for computing Value at Risk. Check its performance using backtesting (more generally, exploring its calibration and resolution).

The project is recommended for Computational Finance students.

Early Deliverables

You will write a simple proof-of-concept program for computing Value at Risk for a portfolio consisting of 1 or 2 stocks using two different methods: model-building and historical simulation.
Different ways of computing Value at Risk will be back tested and the statistical significance of the results analyzed.
The report will describe the program (software engineering, algorithms, etc.); discuss different ways of estimating the variance of returns and the covariance between returns of two different securities (using the standard formula, exponentially weighted moving average, or GARCH(1,1)).

Final Deliverables

The program should be extended by allowing derivatives (such as options) in the portfolio and using Monte Carlo simulation. Option pricing can be performed using the Black-Scholes formula, or binomial trees (for American options), or a different variety of Monte Carlo simulation (for Asian options).
Allow an arbitrary number of stocks in the model-building approach (using eigen- or Cholesky decomposition).
Allow an arbitrary number of stocks using histotical simulation.
Ideally, it will have a graphical user interface.
The final report will describe the theory behind the algorithms; the implementation issues necessary to apply the theory; the software engineering process involved in generating your software; computational experiments with different data sets, methods, and parameters.

Suggested Extensions

Computing conditional VaR (also known as expected shortfall)
Explore carefully the choice of parameters for EWMA and GARCH(1,1) using a loss function such as square or log loss
Empirical study of the two approaches to historical simulation for n-day VaR: reducing to 1-day VaR and direct estimation
Complement back testing by stress testing.
Replacing the Gaussian model for stock returns by robust models.
Computing monetary measures of risk different from VaR.
Using proper scoring rules for quantiles for valuating computed VaRs.
Using methods of worst-case risk aggregation, such as the Rearrangement Algorithm, or worst-case online decision making, such as the Weak Aggregating Algorithm (currently active areas of research).

Reading

John C. Hull. Options, futures and other derivatives, 10th ed., Pearson, Upper Saddle River, NJ, 2018.
Hans Follmer and Alexander Schied. Stochastic Finance: An Introduction in Discrete Time 3rd ed., de Gruyter, Berlin, 2011.
Yahoo Finance as source of data: http://finance.yahoo.com/
http://gloria-mundi.com

Prerequisites: Python, R, MATLAB, Java, or C++ programming skills.