KWEST: A Semantically Tagged Virtual File System

Sat Jun 01 2013 Report
Pune University
✍ Aseem Gogte* , Sahil Gupta* , Harshvardhan J. Pandit* , Rohit Sharma*
Description: A virtual filesystem that allows automated semantic organisation and suggestions
published version 🔓open-access archives: harshp.com , zenodo
📦resources: code , documentation

ABSTRACT

The limitation of data representation in today’s ﬁle systems is that data representation is bound only in a single way of hierarchically organising ﬁles. A semantic ﬁle system provides addressing and querying based on the content rather than storage location. Semantic tagging is a new way to organise ﬁles by using tags in place of directories. In traditional ﬁle systems, symbolic links become non-existent when ﬁle paths are changed. Assigning multiple tags to each ﬁle ensures that the ﬁle is linked to several virtual directories based on its content. By providing semantic access to information, users can organise ﬁles in a more intuitive way. In this way, the same ﬁle can be accessed through more than one virtual directory. The metadata and linkages for tagging are stored in a relational database which is invisible to the user. This allows efﬁcient searching based on context rather than keywords. The classiﬁcation of ﬁles into various ontologies can be done by the user manually or through automated rules. For certain ﬁles types, tags can be suggested by analysing the contents of ﬁles. The system would be modular in design to allow customisation while retaining a ﬂexible and stable structure.

Keywords : virtual ﬁle system, semantics, indexing, classiﬁcation, database, tagging, information access, metadata

Chapter 1

INTRODUCTION

1.1 Overview

Information organisation and management is an essential and inevitable part of everyday computer usage. Huge amount of data is produced on daily basis. With data growing in size, we are faced with the problem of locating ﬁles. Traditional ﬁle systems impose a hierarchical structure of storage on the user. Traditional ﬁle systems are mono-hierarchical and implement directory trees to categorise and store ﬁles. In such systems, directories are the only means to access particular ﬁles.

In such systems, directories are the only means to access particular ﬁles. The path of a ﬁle contains directories, which refer to its context and categorisation. As an example

“c:nphotosncollegetripnmuseum*.jpg” refers to all photos of a museum from a college trip. In this case, it is not possible to store that photo in another directory say “c:n photosnmuseum*.jpg” without copying the ﬁle. This severely limits the searching capabilities in a ﬁle system. The user is faced with the dilemma of which directory best represents the context of current ﬁle. While storing the ﬁle is identiﬁed by its ﬁle name alone which serves as its identiﬁer. For searching a particular ﬁle, the user has to accurately remember the path and ﬁle name.

A ﬁle cannot be searched by any other information relating to its context. Creating the directory structure is based on the users organisational skills. Searching or browsing throughsomeoneelsedataistrickyastheorganisationisdifferentforeveryuser.Previous approaches to such problems provided symbolic links and aliases as an incomplete answer. Symbolic links become redundant when the target ﬁle paths are changed. Similarly, aliases may become redundant or may not function properly with certain programs. Working with such solutions requires advanced skills on the users part. Keyword based searches which extract metadata from ﬁles were brought to fore by Apple’s Spotlight[13] and Google’s Desktop Search[14] . Both function only on limited ﬁletypesanddonotallowmanualcategorisation.Thisledtothedevelopmentofsemantic ﬁle systems, containing categorisation of ﬁles based on context. It provides access to ﬁles by using categories formed from extracting metadata. It is similar to how music ﬁles can be searched by artist, genre, album etc. However, this presents a limitation on the amount and capabilities of what metadata can be extracted from a ﬁle. Virtual directories are used to represent data from the ﬁle system. These directories do not have a permanent listing and the user has to explicitly query for data. There have been several implementations based on semantic ﬁle systems.

However, they have several limitations in usability. Most of the projects are based only on a few key points, such as limitations over ﬁle types. This project thus is to create a semantic solution to the problems and shortcomings of traditional ﬁle systems while covering the limitations of other implemented projects.

1.2 Brief Introduction

Organising and retrieving information accurately and efﬁciently has attracted a lot of attention. While few have been successful, a number of innovative implementations have

emerged. KWEST is a virtual ﬁle system capable of storing semantics with which it facilitates the ﬁnding of relevant information. Information is stored in tags, which are extracted from a ﬁles metadata. This information may be generated implicitly by the system or supplied explicitly by the user. Thus, the validity of information is based on the user’s level of organising things.

The current system can extract metadata from a limited set of known ﬁle types. However, the modular architecture allows for plugins to be added which can add functionality for other ﬁle types. This allows for the project to be extended and modiﬁed according to the functionality required. The level of awareness generated by the system is based on the frequency of access and input provided by the user. The amount of relevance is determined by the tags generated and their associated ﬁles. This affects how the system categorises and searches ﬁles. Thus the actual outcome of the system which is the searching capabilities is totally dependent on these relationships. The current implementation is based on the Linux kernel. Future implementations can be extended to other platforms and devices. Furthermore, as the system is a virtual one, it needs only slight modiﬁcations to be ported to other ﬁle systems and operating systems.

1.3 Problem Deﬁnition

The goal of this project is to develop a semantic ﬁle system that extracts metadata from ﬁles and allows storage and searching based on its context. Such a system should overcome the drawbacks of traditional ﬁle systems while leveraging the limitations of other such similar implementations.

1.4 Feasibility Study

1.4.1 Technical Feasibility

The team have knowledge of C and Object Oriented concepts. The Project is being implemented using loadable kernel module known as FUSE. In the current versions of some Linux based OS this module is included in the kernel itself. The query processing and programming will be done using SQLite. It is a relational database contained in a small C programming library. In contrast to other database management systems, SQLite is not a process that is accessed from the client application, but an integral part of it. This is also open source and is freely available. Thus, the cost for developing KWEST will be minimal and hence will be feasible without the need for large capital.

1.4.2 Economic Feasibility

Cost of Software: We have used open-source technologies for building our system, thus there was no software cost incurred.

Cost of Hardware: No hardware was required to be purchased, thus no cost was incurred in building the system.

1.5 Application of a Software Engineering Approach

We are using incremental model of Software Development Life Cycle. The Software development life cycle (SDLC) is a process used by a systems analyst to develop an information system, training, and user (stakeholder) ownership. The SDLC aims to

produce a high quality system that meets or exceeds customer expectations, reaches completionwithintimeandcostestimates,workseffectivelyandefﬁcientlyinthecurrent and planned Information Technology infrastructure, and is inexpensive to maintain and cost-effective to enhance.The Software Development Life Cycle framework provides a sequence of activities for system designers and developers to follow. It consists of a set of steps or phases in which each phase of the SDLC uses the results of the previous one. A Software Development Life Cycle (SDLC) adheres to important phases that are essential for developers, such as planning, analysis, design, and implementation.

Figure 1.1: Incremental Model

1.5.1 Planning

During this stage, business opportunities and problems are identiﬁed, and information technology solutions are discussed. Multiple alternative projects may be suggested and their feasibility analysed.

Tasks proposed were

1. Analysing need of user to access related data.

2. Identifying drawbacks of existing ﬁle system.

3. The development of a project goals- identifying relations between data, ﬁle system operations, metadata storage, accessing related data.

4. The collection of project requirements.

5. The development of a project schedule.

1.5.2 System Analysis

The goal of system analysis is to determine where the problem is in an attempt to ﬁx the system.This step involves breaking down the system in different pieces to analyse the situation,analysingprojectgoals,breakingdownwhatneedstobecreatedandattempting to engage users so that deﬁnite requirements can be deﬁned.

Tasks proposed were

1. Interface for implementing ﬁle system.

2. Analyse various database alternatives based on size and speed of operation.

3. Analysis of libraries to be used for metadata operations.

4. Generation of ﬁles and tags suggestions.

5. Analysing various methods to provide user with suggestions.

1.5.3 System Design

In systems design the design functions and operations are described in detail, including screen layouts, business rules, process diagrams and other documentation. The output of this stage will describe the new system as a collection of modules or subsystems.

Tasks proposed were

1. Separating system implementation into following modules.

2. File System Operations.

3. Generating Tags.

4. Importing Semantics.

5. Exporting Semantics.

6. Database Consistency.

1.5.4 Coding

In coding we do actual implementation of system design, produced running software.

Tasks proposed were

1. Implementing modules of project using ﬁnalised programming language and libraries

2. Having modularity in code to allow for reusable modules.

3. Documenting the code using standard tools.

4. Testing project using various software testing approaches.

1.5.5 Testing

Testing is the process of evaluating a system or application, to check whether the application meets all requirements of the client and to detect the errors.

Tasks proposed were

1. Testing of executable code using robust testing tools.

2. Performing Regression Testing.

3. Usability Testing.

4. GUI Testing.

5. Stress Testing.

6. Integrity Testing.

1.5.6 Deployment

Deployment starts after the code is appropriately tested, approved for release, and sold or otherwise distributed into a production environment. This may involve installation, customisation (such as by setting parameters to the customer’s values), testing, and possibly an extended period of evaluation.

Tasks proposed were

1. The project will be deployed as a executable which will mount the KWEST ﬁle system.

2. Utilise available GUI tools in the form of ﬁle managers.

3. Installation of utility should follow standard Linux OS procedures.

Chapter 2

LITERATURE SURVEY

Over the years, organising and retrieving information accurately and efﬁciently has attracted lot of attention. While few have been successful, a number of innovative implementations [11] have emerged. The idea of using a ﬁle’s semantics as the means to categorise it has been around for quite some time. This section discusses the various implementations made in the ﬁeld of semantic ﬁle system.

An efﬁcient implementation of keyword based searching was brought to the desktop by Apple’s Spotlight[13] and Google’s Desktop Search[14] . Both allow efﬁcient and quick ﬁle retrieval based on keywords. They support many ﬁle types and have a simple interface which attracts a large number of users. However, both of them are limited to returning search results without any way to organising contents. In addition, they do not provide any provision to the user for classiﬁcation of data. This limitation prevented the user from having a personalised way to retrieve data stored by them.

Semantic systems depend on data stored inside the ﬁles rather merely relying on an ﬁle’s attributes. Most implementations use common methodologies like content recognition[2] , tagging[7] , extracting metadata, etc. to categorise ﬁles by using various algorithms.

“Semantic File System”[8] , as developed by O’Toole and Gittord in 1992, provides access to ﬁle contents and metadata by extracting the attributes using special modules called “transducers”. It was one of the very ﬁrst attempts to classify ﬁles by semantics using metadata. Its biggest drawback was the need for ﬁle type speciﬁc transducers which were necessary to extract meta information and content from the ﬁle. Also, the user does not have any say in what kind of category the ﬁle is classiﬁed under. This drawback makes it an unattractive option to the general user. It was decided during designing KWEST, that it is necessary to involve the end-user in the tagging process. This allows each user to have their own personal way of classiﬁcation and organisation of ﬁles.

NHFS (Non Hierarchical File System)[12] was a project developed by Robert Freund in July 2007. It allows the user to place any ﬁle into any number of directories. Likewise, any directory can be placed into as many directories as required. NHFS therefore allows one to create a non-hierarchical structure with poly-hierarchically connected ﬁles. This allows for a powerful metaphor of ﬁnding a ﬁle in any of the category (directory) it could be stored under. Therefore, we decided to retain this feature by using tags in place of actual directories. Tags are associated with ﬁles and other tags as well. Thus, a tag may be placed under multiple tags allowing a relationship to be deﬁned between them. This analogy is much more powerful than restricting ﬁles to actual directories. Using tags prevents duplication and redundancy, making it an efﬁcient implementation.

A more recent implementation is Tagsistant[15] , which is a semantic ﬁle system that also attempts to organise ﬁles using tags. It interacts with the Linux kernel using the FUSE module. Under Tagsistant, directories are considered to be equivalent to tags. As a consequence, creating a directory is creating a tag and putting a ﬁle inside a directory means tagging that ﬁle. After you have tagged your ﬁles, you can search all of them by using queries. Queries are just paths where each element is either a directory or logical operators. The entire system has a modular design and uses SQLite. However, it suffers from some speed issues and the lack of SQL indexes. Major ﬂaws of this design were

high consumption of inodes on real ﬁle systems and high computational time which was required to fulﬁl each request. Most of the features of Tagsistant were decided to be included in KWEST. These were modular design, SQLite repository, tagged structure, etc. whichenhancethesemanticsofaﬁlesystem. However,caremustbetakentoprevent the occurrence of similar drawbacks.

Another implementation called Tagster[16] , is a peer-to-peer tagging application for organisingdesktopdata.ItisplatformindependentandisimplementedinJAVA.Multiple ﬁles and also directories can be tagged through its interface. The selected directories are recursively examined and all ﬁles contained within them are tagged. The GUI for a Linux system consists of three main areas.

1. Tag view

It displays a list of tags.

2. Resource view

It lists resources that have the currently selected tags assigned.

3. User view

It displays a list of users that have tagged the currently selected resource with some selectedtag. ItalsoincludesGUIsupportforWindowswithsomeunresolvedissues. However, it lacks auto classiﬁcation of data due to which several common tags may be generated for each user increasing the database size.

Papers referred for the developement of KWEST.

1. K. Chang, W.T. Perdana, B. Ramadhana, K. Sethuraman, T.V. Le, and N. Chachra, Knowledge File System-A principled approach to personal information management, 2010 IEEE International Conference on Data Mining Workshops, Sydney, December 2010, pp. 1037-1044[1] .

Abstract: The Knowledge File System (KFS) is a smart virtual ﬁle system that sits between the operating system and the ﬁle system. Its primary functionality is to automatically organise ﬁles in a transparent and seamless manner so as to facilitate easyretrieval.ThinkoftheKFSasapersonalassistant,whocanﬁleeveryoneofyou documents into multiple appropriate folders, so that when it comes time for you to retrieveaﬁle,youcaneasilyﬁnditamonganyofthefoldersthatarelikelytocontain it. Technically, KFS analyses each ﬁle and hard links (which are simply pointers to a physical ﬁle on POSIX ﬁle systems) it to multiple destination directories (categories). The actual classiﬁcation can be based on a combination of ﬁle content analysis,ﬁleusageanalysis,andmanuallyconﬁguredrules.SincetheKFSorganises ﬁles using the familiar ﬁle/folder metaphor, it enjoys 3 key advantages against desktopsearchbasedsolutionssuchasGoogle’sdesktopSearch,namely1)usability, 2) portability, and 3) compatibility. The KFS has been prototyped using the FUSE (File system in USErspace) framework on Linux. Apache Lucene was used to provide traditional desktop search capability in the KFS. A machine learning text classiﬁer was used as the KFS content classiﬁer, complimenting the customisable rule-based KFS classiﬁcation framework. Lastly, an embedded database is used to log all ﬁle access to support ﬁle-usage classiﬁcation.

Usefulness: This paper describes approach to personal information management through Knowledge File System. It is designed to help users organise information usingVirtualFileSystemtoreducetheproblemofmanualinformationclassiﬁcation andretrieval. KFSprovidesfunctionssoastoautomaticallyclassifytheinformation based on the content similarity with respect to predeﬁned ontologies or also give the option for manual classiﬁcation of the information. The operations carried out on the KFS can also be monitored with the event logger feature. Searching of ﬁles can be carried out by keyword with the help of a text indexer. Furthermore the comparisons between Google desktop ﬁle system, beagle and KFS are given. Finally the details of the implementation of the KFS on the Linux platform using of FUSE are given.

2. O. Eck, and D. Schaefer, A semantic ﬁle system for integrated product data management, Advanced Engineering Informatics, 2011, pp. 177-184[4] .

Abstract:Weinitiallydiscussanumberofdisadvantagesofcurrentﬁlemanagement systems. In the body of the paper our main contribution is presented. That is, a formal mathematical model of a new semantic ﬁle system, SIL (Semantics Instead of Location), that allows engineers to access data based on semantic information rather than storage location is proposed. A major difference between our approach and previous related work is that we do not aim at yet another point solution and, instead, propose an approach that may be employed by next generation engineering

dataprocessingsystemsonalargerscale.Inaddition,acorrespondingprogramming interface along with a graphical user interface used as a ﬁle browser is presented and the beneﬁts of utilising the proposed semantic ﬁle system for product data management in the ﬁeld of integrated design of mechatronic systems are discussed.

Usefulness: This paper describes a formal mathematical model that allows engineers to access data based on semantics rather than actual storage location. The main goal of this ﬁle system is to search ﬁles based on content of data. The browsing of the system is done based on ﬁle’s metadata and attributes. Logical operator such as AND, OR, NOT are used to ﬁlter the results. The classiﬁcation allows multiple tags to be created for ﬁles. Furthermore API’s are written to create views and for the automation of notiﬁcation updates.

3. P. Mohan, S. Venkateswaran, Raghuraman, and A. Siromoney, Semantic File Retrieval in File Systems using Virtual Directories, Proc. Linux Expo Conference, Raleigh, NC, May 2007, pp. 141-151[5] .

Abstract: Hard Disk capacity is no longer a problem. However, increasing disk capacityhasbroughtwithitanewproblem,theproblemoflocatingﬁles. Retrieving a document from a myriad of ﬁles and directories is no easy task. Industry solutions are being created to address this short coming. We propose to create an extendable UNIX based File System which will integrate searching as a basic function of the ﬁle system. The File System will provide Virtual Directories which list the results of a query. The contents of the Virtual Directory are formed at run time. Although, the Virtual Directory is used mainly to facilitate the searching of ﬁle, it can also be used by plugins to interpret other queries.

Usefulness: This paper describes the design of SemFS, which provides semantics based on the ﬁle’s meta-data and attributes. It allows the usage of logical operators to ﬁlter query results. It is implemented as a user space ﬁle system upon journaling storage. The architecture is server-clientwithsupportforAPI’s to extend functionality. Features such as ﬁle tagging and versioning are also implemented.

4. R. Agarwal, T. Imielinski, and A. Swami, Mining Association Rules between Sets of Items in Large Databases, 1993 ACM SIGMOD Conference, Washington DC, USA, May 1993, pp. 207-216[6] .

Abstract : We are given a large database of customer transactions. Each transaction consists of items purchased by a customer in a visit. We present an efﬁcient algorithm that generates all signiﬁcant association rules between items in the database. The algorithm incorporates buffer management and novel estimation and pruning techniques. We also present results of applying this algorithm to sales data obtained from a large retailing company, which shows the effectiveness of the algorithm.

Usefulness : For ﬁnding frequently occurring ﬁle and tags, creating association rules based on them give and providing suggestions to user.

Many tagging systems exist that allow efﬁcient manual classiﬁcation of information. Most implementations tend to be theoretical demonstrations or compleximplementations[9]existingforsomeveryspeciﬁcpurpose. Thesesuggest the possibility of using semantics[3] in operating systems in some future date. But a major problem is scalability with regard to related information. However, on a large multi-user ﬁle system, one can get tons of tags to shift through in each folder, increasing the load for users to search and maintain data. Our idea is to introduce a new concept of relating tags to overcome this situation. Implementing all the desired and necessary features from previous implementations,our design goal is to create an efﬁcient Semantic File System which could be used by any class of users.

Chapter 3

SOFTWARE REQUIREMENT SPECIFICATION

3.1 Introduction

3.1.1 Purpose

The goal of this project is to develop a semantic ﬁle system that extracts metadata from ﬁles and allows storage and searching based on its context. The purpose of this document is to present a detailed description of KWEST. It will explain the purpose and features of the system, the interfaces of the system, what the system will do, the constraint under which it must operate and how the system will react to the users actions.

3.1.2 Intended Audience and Reading Suggestion

The intended audience of this document includes both developers and reviewers of the system.

3.1.3 Project Scope

This project is a virtual ﬁle system capable of storing semantics with which it facilitates the ﬁnding of relevant information.

1. Information is stored in tags, which are extracted from a ﬁle’s metadata. This information may be generated implicitly by the system or supplied explicitly by the user.

2. The validity of information is based on the users level of organising things.

3. The system is currently designed to extract metadata from a limited set of popular ﬁle types for audio(.mp3, .ogg, .spx, .mpc, .ape, .ﬂac, .wv, .tta, .wma, .m4a, .wav, .aif[f]), video(.ﬂv, .real, .riff(.avi), .mpeg, .qt, .asf), images(.jpeg, .gif, .png, .tiff) and PDF documents.

4. The modular architecture allows for plugins to be added which can add additional functionality, and recognition for more ﬁle types. This allows the project to be extended and modiﬁed according to the functionality required.

5. The level of awareness generated by the system is based on the frequency of access and input provided by the user.

6. The current implementation is based on the Linux kernel.

7. As the system is an virtual entity, it does not need extensive modiﬁcations to be ported to other ﬁle systems and operating systems.

User Classes and Characteristics

The system can be used by three types of users:

1. General User

Uses the system without any complex modiﬁcations in the system.

2. Advanced User

Understands the system and creates rules and automations based on personal needs.

3. Developer

UsestheAPIprovidedanddevelopsmodulesthatextendthesystem.Onlyadvanced users utilise the semantic nature of underlying ﬁle system to the fullest. This does not create any blocks for the general user, who can also use KWEST satisfactorily. Developers are a differentgroupofusers who can extendKWEST throughmodules. Thesemodulescanmodifyordeﬁneadditionalbehaviourforthesystemforspeciﬁc ﬁle types.

Operating Environment

1. KWEST requires FUSE [17] minimum version 2.8.6

2. KWEST can run on any Linux installation which contains required versions of FUSE.

3. Furthermore, since kernel version 2.6.14, FUSE has been merged into the mainstreamLinuxkerneltree.AsaresultKWESTcanrunonanyLinuxdistribution having a minimum Kernel version 2.6.14.

4. KWEST is a virtual ﬁle system mounted to a folder or a loop device.

3.1.4 Design and Implementation Constraints

Implementation Constraints

1. KWEST uses FUSE to manage userspace ﬁle systems. Hence, access is limited to the executing userspace for the program.

2. Sincetheentireapplicationisexecutedinuserspaceonly,therecannotbeinteraction swith the kernel directly.

3. AlthoughKWESTimplementsavirtualﬁlesystemwhichisaccessibletoallentities, wehaveimplementedthesystemwithcommandlineastheprimaryinterface. Other ﬁle managers such as Nautilus can only browse but not tag ﬁles. This limitation can be addressed with plugins or additional modules built for the speciﬁc ﬁle manager.

4. It is the responsibility of the user that mounting and unmounting of the system be done with standard rules and precaution.

5. The SQLite database is an integral part of KWEST and is contained in a single ﬁle. It is vital for the system that integrity of the database is maintained.

Design Constraints

1. Currently, the auto-tagging feature has been limited to common and popular ﬁle types such as audio(.mp3, .ogg, .spx, .mpc, .ape, .ﬂac, .wv, .tta, .wma, .m4a, .wav, .aif[f]), images(.jpeg, .gif, .png, .tiff), video(.ﬂv, .real, .riff(.avi), .mpeg, .qt, .asf) and .pdf documents. This functionality can be extended with modules or through special tools made speciﬁcally for this purpose.

2. The amount of information visible through common ﬁle system commands (e.g. ls - list contents) is a limitation for KWEST. We cannot show tagging information through these interfaces. Alternate methods for this can be implemented keeping the end user in mind.

3.1.5 Assumptions and Dependencies

Assumptions

1. Users of this software should be aware of how semantics are used to categorise information.

2. Users should recognise or identify appropriate tags in relation to ﬁles.

3. It is assumed that the user is well versed in organisation information and uses KWEST as a tool rather than an assistant.

4. The user should have the required privileges and rights to run KWEST and all its operations.

Dependencies

1. KWEST uses several external libraries to extract metadata from supported popular formats like TagLib for audio, Libextractor for images, video and Poppler for PDF documents.Theselibrariesarerequiredatcompilationtime.Theyenablethesystem to handle metadata extraction.

3.2 System Features

3.2.1 Tags

1. Manual Tagging

Manual tagging is the basis of semantics in KWEST. The user can assign any tag to the ﬁles in KWEST. These tags are then stored internally in a database. The user can create new tags or use tags already deﬁned by the system. Total freedom is given to the user to organise data. Multiple tags can be assigned to the to a single thus allowing its access through multiple locations without duplication of data.

2. Automatic Tagging

KWEST also features automatic tagging of ﬁles. The user can deﬁne certain rules under which ﬁles will be assigned tags. The system will implement those rules for all ﬁles satisfying the deﬁned constraints. This would prevent repetitive tagging operations for the user.

3. Importing tags

Certain popularﬁle formats such as mp3,jpeg etc have metadata embedded in them. KWEST supports such popular format and uses this metadata to automatically assign tags to the ﬁles. This feature enables the user to collectively classify and store the data under these tags.

3.2.2 Database

1. Consistency

KWEST uses an internal database to store and manage data. It is vital that the database always remains consistent. KWEST uses logging mechanisms to ensure that operations on the database always reach an endpoint.

2. Access

The database ﬁles used by KWEST are not locked down or access restricted. The KWEST API provides facilities that can be used to access the database. However, this feature is made available with the understanding that the integrity of the database will be maintained always.

3.2.3 Relation with Existing Data

1. Importing semantics

Users alreadyhave certain organisationalstructures in the waytheystore data in ﬁle systems. KWEST imports these semantics by converting the storage hierarchy to tag-based hierarchy. This allows the entire ﬁle system to be imported into KWEST along with the users previous organisation structure.

2. Files are executable ready

The ﬁles can be directly executed through the virtual ﬁle system without making any modiﬁcation to the ﬁles like audio, video ﬁles can be played through the virtual system, images can viewed and documents can be opened and read.

3.2.4 Exporting Semantics

1. Export ﬁle system

As the entire ﬁle system exists as a virtual entity, KWEST provides the export feature. Where the ﬁle system can be exported to another system where the data can be imported by another instance of KWEST.

2. Export tagged ﬁles

Itisalsopossiblefortheusertoexportdataundercertaintagstoanexternallocation. The semantic organisation showed by tags is converted to actual directories and ﬁles are then copied to these directories. This way the user can export KWEST semantics and data to outside locations.

3.2.5 Modularity

1. Modules As Plugins

KWEST is an extendable system. It can use external modules to increase functionality or to modify existing operations. Support for using modules is built into KWEST right from the design stage. Additional extraction libraries can be included using the plugin module.

2. Support for developers

KWEST provides support to developers by providing access to all internal features and database. The API layer allows developers to easily supplement internal operations with their modules.

3.3 External Interface Requirements

3.3.1 User Interfaces

The user can use the system through the command line. The system mounts a virtual ﬁle system which the user can use to navigate through. If the ﬁle explorer/browser supports virtual ﬁle system, the user can use that to navigate through the ﬁles.

3.3.2 Hardware Interfaces

No hardware interfaces used as the ﬁle system exists as a virtual entity.

3.3.3 Software Interfaces

• FUSE[17]

KWEST uses FUSE to run the ﬁle system code in user space without editing the kernel code.The FUSE module provides only a “bridge” to the actual kernel interfaces. Major ﬁle system operations are deﬁned by FUSE and forwarded to KWEST for implementation.

• SQLite[18]

KWEST uses SQLite database to store data. In contrast to other database management systems, SQLite is not a separate process but an integral part of KWEST. Database is accessed and modiﬁed for most of the operations performed by KWEST.

3.3.4 Communication Interfaces

KWEST can be accessed like any other ﬁle system via Command line or ﬁle managers like Nautilus, Thunar etc.

3.4 Nonfunctional Requirements

3.4.1 Performance Requirements

• Response Time

The response time for any action on the ﬁle system or the database should be reasonable under normal operational circumstances.

• Capacity

KWEST can be used by any user with sufﬁcient permissions to initialise the ﬁlesystem. Subsequent operations like read,write,modify etc. are restricted by user permissions similar to normal ﬁle systems.

• Scalability

KWEST provides suggestions and includes automated tagging rules for Audio, Video and Images. It also allows manual tagging of ﬁles by the user. Various modules can be further added for recognising and categorising other ﬁle types.

3.4.2 Safety Requirements

• Power Failure

The system maintains a log ﬁle for database. It prevents data corruption by committing data that has been fully written to the log. This should prevent most, if not all, data corruption.

• Data Loss

If any ﬁle is accessed which is mentioned in database but deleted from the system then it is removed from the database and appropriate message is displayed to user.

• Access Rights

The system checks if the ﬁle tagged is available to user for access. It does not allow ﬁles to be included which cannot be access by the user.

3.4.3 Security Requirements

KWEST can be used only by a single user with sufﬁcient rights to execute the software. No other user will have permissions to modify tags and ﬁles in the system.

3.4.4 Software Quality Attributes

• Availability

The System shall be available from mounting the ﬁle system until its unmounted.

• Updatability

The system shall allow for addition or deletion of ﬁles under tags.

• Reliability

– The system shall save new tags created by active user.

– The system shall save the ﬁle path of an active user to database whenever new ﬁles are added to a tag.

– The system shall maintain a log ﬁle which records every operation on database. – The system shall modify database whenever tags or ﬁles are deleted.

– The system shall remove ﬁle entries from database whenever it is unable to access them.

• Portability

The system is implemented on Linux. It is compatible with various other Linux distributions like Ubuntu, Fedora, Red Hat, etc.

• Testability

New modules designed to be added to the system, to identify ﬁle types other than audio, video and images must be tested to check if they are compatible with input-output format of the system.

• Usability

The system does not have a large learning curve as the user deals with commands common to all ﬁle systems. Documentation provided for the system will include user manuals, developer reference and common FAQ.

3.5 Other Requirements

3.5.1 Legal Requirements

All the libraries, programs used in this project are open sourced under GPL. SQLite is a database which is free to use, distribute or modify. The FUSE kernel module is merged with the Linux Kernel, which is open sourced and freely available under GPL. There are no proprietary or closed source products, libraries or interfaces used in this program. The project KWEST and its subsequent implementations will be open sourced under the GPL upon completion.

3.6 Analysis Model

3.6.1 Data Flow diagram

Level 0

Figure 3.1: Data ﬂow diagram - Level 0

Level 1

Figure 3.2: Data ﬂow diagram - Level 1

Level 2

Figure 3.3: Data ﬂow diagram - Level 2

3.6.2 Entity Relationship diagram

Figure 3.4: Entity Relationship diagram

3.7 System Implementation Plan

Phase 1: September 2012 - November 2012

Task	Start Date	End Date
Problem identiﬁcation	01/09/12	07/09/12
Information gathering	08/09/12	14/09/12
Creating problem deﬁnition	15/09/12	21/09/12
Understanding underlying technology	22/09/12	05/10/12
Analysing problem	06/10/12	12/10/12
Designing solution	13/10/12	26/10/12
Reﬁning design	27/10/12	02/11/12
Creating design report	03/11/12	09/11/12

Table 3.1: Phase 1 Implementation Plan

Phase 2: December 2012 - March 2013

Task	Start Date	End Date
Building a stub implementation	01/12/12	14/12/12
Implement ﬁle system operations	15/12/12	28/12/12
Extract Metadata from ﬁles	28/12/12	11/01/13
Implement modular plugins	12/01/13	25/01/13
Apply Apriori algorithm to database	26/01/13	08/02/13
Rigorous testing	09/02/13	22/02/13
Debugging and reﬁning	23/02/13	08/03/13
System testing	09/03/13	15/03/13
Creating implementation report	16/03/13	31/03/13

Table 3.2: Phase 2 Implementation Plan

Chapter 4

SYSTEM DESIGN

4.1 System Architecture

Figure 4.1: System Architecture

4.2 UML Diagrams

4.2.1 Use-case diagram

Figure 4.2: Use Case diagram

4.2.2 Component diagram

Figure 4.3: Component diagram

4.2.3 Deployment diagram

Figure 4.4: Deployment diagram

Chapter 5

TECHNICAL SPECIFICATIONS

5.1 Technology Details Used in the Project

5.1.1 Development

FUSE

For developing and managing the virtual ﬁle system we have used FUSE. Minimum required: 2.8

Version used: 2.9.0

SQLite

We have used SQLite as the data repository for this project. Minimum required: 3.6

Version Used: 3.7.13

Language for Implementation

We use ANSI C for implementing our project modules. Speciﬁcally, we follow the GNU C99 standards while compiling our code. GNU C99 is an extension of the C99 providing some extra features. Note: Some features are incompatible with other standards.

GNU C99

Compiler

For compiling our project modules we use the GNU Compiler Collection (GCC). Minimum required: 4.6

Version used: 4.7.2 on Linux Mint 14

5.1.2 Operating Environment

Platform

The project is based on operating systems utilising the linux kernel. Any implementation which provides a POSIX compatible environment is sufﬁcient.

Minimum required: 2.6.14

Version used: 3.5.0-generic on Linux Mint 14

Operating System

We have used various operating environments while creating and testing the project. Linux distributions were Ubuntu 12.04, Ubuntu 12.10, Fedora 18, Linux Mint 13, Linux Mint 14.

5.1.3 External Libraries

TagLib

Used for extracting Audio metadata from ﬁles. Minimum required: TagLib 1.7.1

Version used: TagLib 1.8

LibExtractor

GNU Libextractor is a library used to extract meta data from ﬁles. It supports extraction from Image, PDF and Video ﬁle types.

Minimum required: 1.0.0 Version used: 1.0.1

Poppler

Poppler is a PDF rendering library based on the xpdf-3.0 code base. Minimum required: 0.21

Version used: 0.22

5.1.4 Debugging

GDB: GNU Project Debugger

Used to debug program crashes and memory errors. Version used: 7.5

Valgrind

Used as a proﬁling tool, for memory related issues and program crashes. Version used: 3.7.0

5.2 Reference to Technology

5.2.1 Development

FUSE

FUSE is a loadable kernel module for Unix-like computer operating systems that lets non-privileged users create their own ﬁle systems without editing kernel code. This is achieved by running ﬁle system code in user space while the FUSE module provides only a bridge to the actual kernel interfaces.

http://fuse.sourceforge.net/

SQLite

SQLite is a relational database management system contained in a small C programming library. It implements a self-contained, zero-conﬁguration, transactional SQL database engine which can be embedded in applications.

http://www.sqlite.org/

Language for Implementation

We use GNU99 C standard for implementing our project modules. http://gcc.gnu.org/c99status.html

Compiler

GCC is a compiler system which provides front ends for various languages including C. It provides optimisation’s, debugging and other features to help program development. http://gcc.gnu.org/

5.2.2 Operating Environment

Platform

The project is based on operating systems utilising the linux kernel. Any implementation which provides a POSIX compatible environment is sufﬁcient.

https://www.kernel.org/

Operating System

Linux distributions were : http://www.ubuntu.com/ http://linuxmint.com/ http://fedoraproject.org/

5.2.3 External Libraries

TagLib

TagLib is a library for reading and editing the meta-data of several popular audio formats. Currently it supports both ID3v1 and ID3v2 for MP3 ﬁles, Ogg Vorbis comments and ID3 tags and Vorbis comments in FLAC,MPC,Speex,WavPack TrueAudio,WAV,AIFF, MP4 and ASF ﬁles.

http://taglib.github.com/

LibExtractor

GNU Libextractor is a library used to extract meta data from ﬁles. The goal is to provide developers of ﬁle-sharing networks, browsers or WWW-indexing bots with a universal library to obtain simple keywords and meta data to match against queries and to show to users instead of only relying on ﬁle names.

http://www.gnu.org/software/libextractor/

Poppler

Poppler is a PDF rendering library based on the xpdf-3.0 code base. http://poppler.freedesktop.org/

5.2.4 Debugging

GDB: GNU Project Debugger

GDB, the GNU Project debugger, allows you to see what is going on ‘inside’ another programwhileitexecutes–orwhatanotherprogramwasdoingatthemomentitcrashed. http://www.gnu.org/software/gdb/

Valgrind

Valgrind is a GPL licensed programming tool for memory debugging, memory leak detection,andproﬁling. Valgrindwas originallydesignedto be afree memorydebugging tool for Linux on x86, but has since evolved to become a generic framework for creating dynamic analysis tools such as checkers and proﬁlers.

http://valgrind.org/

Chapter 6

PROJECT ESTIMATION, SCHEDULE AND TEAM STRUCTURE

6.1 Project Estimate and Schedule

We use the Work Breakdown Structure (WBS) for project estimation. The WBS is organised around the primary products of the project (or planned outcomes) instead of the work needed to produce the products (planned actions). The Work Breakdown Structure presented here represents all the work required to complete this project.

1	Project Management	1 Week
1.1 1.2 1.3 1.4	Project Initiation Project Planning Project Execution and Control Project Closeout	1 Day 2 Days 3 Days 1 Day
2	Deﬁnition	2 Weeks
2.1 2.2 2.3 2.4 2.5	Start-up and Orientation Determine Project Requirements Create future process model Reconcile Project Requirements Functional Speciﬁcation	2 Days 3 Days 2 Days 2 Days 5 Days
3	System Design	3 Weeks
3.1 3.2 3.3 3.4 3.5 3.6	Technical Architecture System Standards Physical Environment Technical Speciﬁcation Creating Prototypes Test Plans	5 Days 2 Days 2 Days 3 Days 6 Days 3 Days
4	System Development	5 Weeks
4.1 4.2 4.3 4.4 4.5 4.6	Develop and Test Software Modules User Training Materials Technical Documentation Unit, Integration and System Test Results Performance testing Acceptance testing	10 Days 2 Days 5 Days 10 Days 4 Days 4 Days
5	System Implementation	3 Weeks
5.1 5.2 5.3 5.4	Acceptance Environment Data Initialisation and Conversion Test Results Acceptance test results Supporting Material	5 Days 8 Days 4 Days 4 Days
6	Transition to Ready State	1 Week
6.1 6.2 6.3	Train Users Convert Data Deploy System	2 Days 2 Days 3 Days

Table 6.1: Work Breakdown Structure

6.2 Team Structure

Member Member 1 Member 2 Member 3 Member 4

Member 1: Aseem Gogte Prior Experience:

• Well versed in C, C++

• Visual Basic

• Java

• Databases - Oracle

New Technology Learnt:

• Version Control - GIT

• FUSE

Name Aseem Gogte Sahil Gupta

Harshvardhan Pandit Rohit Sharma

• Proﬁling and Debugging Tools - Valgrind, GDB

Member 2: Sahil Gupta Prior Experience:

• Well versed in C, C++

• Visual Basic

• Java

• Databases - Oracle

New Technology Learnt:

• Version Control - GIT

• FUSE

• Databases - SQLite

Member 3: Harshvardhan Pandit Prior Experience:

• Well versed in C, C++

• Visual Basic, .NET

• Java

• System programming

New Technology Learnt:

• Version Control - GIT

• FUSE

• Proﬁling and Debugging Tools - Valgrind, GDB

Member 4: Rohit Sharma Prior Experience:

• Well versed in C, C++

• Visual Basic

• Java

• Databases - Oracle

New Technology Learnt:

• Version Control - GIT

• FUSE

• Databases - SQLite

Chapter 7

SOFTWARE IMPLEMENTATION

7.1 Introduction

The implementation ofthe software was done in a modularmannerusing the incremental approach of SDLC. ANSI C was the language used to implement the project. An SQLite database was used for creating and managing the data repository. Various external libraries were used for the extraction of the metadata. The following modules constituted our project: •

• Module 1: Creation of a virtual ﬁle system using FUSE

• Module 2: Interfacing a Data Repository using SQLite

• Module 3: Adding automated Extraction of Metadata

• Module 4: Importing Semantics in to the ﬁle system

• Module 5: Exporting Semantics from the ﬁle system

• Module 6: Association Rule learning using Apriori Algorithm

7.2 Databases

The project was to implement a semantic ﬁle system. This required a data repository to store all the information such as ﬁle name, physical location, attributes, etc. Also, the metadata extracted from the ﬁles was also stored in the database. This way the database formed a central information location for the ﬁle system. Therefore, for the ﬁle system implementation to be stable andefﬁcient,the system requiredan in-place robustdatabase whichoffers integrity andspeed. From the various options available,the SQLite database was selected and used.

SQLite is a relational database management system contained in a small C programming library. Unlike client-server database management systems, the SQLite engine has no standalone processes with which the application program communicates. Instead,theSQLitelibraryislinkedinandthusbecomesanintegralpartoftheapplication program. SQLitestorestheentiredatabase(deﬁnitions,tables,indices,andthedataitself) as a single cross- platform ﬁle on a host machine. Features include :

1. Zero Conﬁguration

2. Serverless

3. Single Database File

4. Stable Cross-platform database

5. Manifest typing

7.3 Important modules and algorithms

7.3.1 Creation of a Virtual File System using FUSE

The ﬁrst module of implementation was to create a basic ﬁle system using FUSE. Using FUSE we created a virtual ﬁle system capable of doing all the operations that a normal ﬁle system does. The module consists of the following phases: •

• Phase 1: Implement FUSE to create a basic ﬁle system structure using ANSI C as the implementation language.

• Phase 2: Connect the ﬁle system created to the data repository created using a SQLite database. Create tags and ﬁles view by querying the database through FUSE.

• Phase 3: Implementing common ﬁle operations with respect to tags and ﬁles such as read, write, open, copy, move etc.

• Phase 4: Extraction of metadata from the ﬁles using external libraries and organisation of the ﬁle system based on metadata.

• Phase 5: Displaying suggestion based on associations derived using Apriori Algorithm.

7.3.2 Interfacing a Data Repository using SQLite

The database is an important module of the ﬁle system. All the data required to browse and navigate the ﬁle system is stored in the database. FUSE interacts with the data in the database by querying for particular data based on path accessed. It is vital for the proper functioning of the system that the database always remains consistent. Logging mechanisms ensure that operations on the database always reach an endpoint. This module is used to check, correct and maintain integrity of the database by checking for redundant entries. Also, if there are new ﬁles which have not been added to KWEST, this module can help the user add them. We implement this module in the following ways: •

• Phase 1: Create database tables for a ﬁle system.

• Phase 2: The relation tables between tags, ﬁles are stored.

• Phase 3: Store the extracted metadata in the database.

• Phase 4: The association rules for the data are derived using the Apriori algorithm.

7.3.3 Adding Automated Extraction of Metadata

Metadata (meta content) is deﬁned as data providing information about one or more aspects of the data. Metadata can be stored either internally, in the same ﬁle as the data, or externally, in a separate ﬁle. Metadata that is embedded with content is called embedded metadata. The metadata of the ﬁle is extracted by using external libraries. The data repository stores the extracted metadata in a predetermined format. •

• Phase 1: Test external libraries to determine which of them can be used.

• Phase 2: Extract metadata using external libraries.

• Phase 3: Store the extracted metadata in the database.

• Phase 4: Form relations between metadata and ﬁles using association rules.

7.3.4 Importing Semantics in to the File System

Users already have certain organisational structures in the way they store data in ﬁle systems. This module imports semantics by converting the storage hierarchy to tag-based hierarchy. This means the directory structure present in the ﬁle system will be used to form tags and the ﬁles listed under the directory are tagged under that tag. •

• Phase 1: Parse the folder structure on local hard disk.

• Phase 2: Add entry for each ﬁle and folder to the database. • Phase 3: Remove or ignore hidden and system ﬁles.

• Phase 4: Prune the database entries on every start up.

7.3.5 Exporting Semantics from the File System

This module can export the storage hierarchy to some external location. The semantic organisation of tags is converted to actual directories and the ﬁles are then copied to these directories. This is similar to copying contents from one ﬁle system to another. •

• Phase 1: Copy virtual locations to external location. • Phase 2: Perform physical copy of ﬁles.

• Phase 3: Create folders and sub folders based on tags. • Phase 4: Copy suggestions using data repository.

7.3.6 Association Rule Learning using Apriori Algorithm

Association rules help in organising the ﬁle system data by providing suggestions while tagging ﬁles. These suggestions can be helpful when the user has either forgotten to tag the ﬁle, or is yet about to do it. This association rule learning approach uses the Apriori algorithm. •

• Phase 1: Run Apriori over the KWEST database. • Phase 2: Perform optimisation’s and prune steps. • Phase 3: Store association rules in database.

• Phase 4: Integrate with KWEST to show suggestions.

7.4 Buisness logic and architecture

7.4.1 Buisness logic

In computing, a ﬁle system (or ﬁlesystem) is a type of data store which can be used to store,retrieve andupdate a setofﬁles. The term couldreferto the abstractdata structures used to deﬁne ﬁles, or to the actual software or ﬁrmware components that implement the abstract ideas.

Traditionally, ﬁle systems were always developed with performance parameters in mind. However, with the data-rich systems of today, the responsibilities of a ﬁle system have increased. The onus of organisation and retrieval of data is much more on the ﬁle system than the user. The ﬁle system structure deﬁnes the capability and capacity of searches performed on it. In such a scenario, a ﬁle system that facilitates retrieval by providing features that help automate organisation is a lucrative option.

7.4.2 Expenses and Legal ramniﬁcations

The software has been open sourced under the Apache License. As such, there are no cost requirments that can be incurred by the adaption of KWEST. All the utilised tools, libraries, platforms and algorithms are free to use. This encourages technology adaption for other students, developers and organisations.

7.4.3 Novelty of idea

There have been no previous implementations of adapting data mining techniques to a ﬁle system. This suggests the novelty of the idea of KWEST and the possibilities for future work.

7.4.4 Buisness Architecture

Chapter 8

SOFTWARE TESTING

8.1 Introduction

8.1.1 Purpose

Software testing can be stated as the process of validating and verifying that a computer program/application/product:

• Meets the requirements that guided its design and development. • Works as expected.

• Can be implemented with the same characteristics. • Satisﬁes the needs of stakeholders.

Software testing, depending on the testing method employed, can be implemented at any time in the development process. Traditionally most of the test effort occurs after the requirements have been deﬁned and the coding process has been completed, but in the Agile approaches most of the test effort is on-going. As such, the methodology of the test is governed by the chosen software development methodology.

8.1.2 Scope

The testing of the system was done manually and no testing tools were used. The test plan describes the unit, functional, performance, usability, regression tests that were performed. Only codes that were pushed as commits were considered as candidates for testing.

8.1.3 Intended Audience

The testing of this system is intended for 3 types of audiences:

1. End Users: The users who will be using the system will review the testing as a mark of stability and performance of the system.

2. Developers: Will view the testing for knowing existing limitations and bugs. 3. Reviewers: Will use these test results as a metric to evaluate the project.

8.2 Test Plan

8.2.1 Target Items

The following have been identiﬁed as targets for testing: 1. Code and associated areas

2. File system operations 3. Databases: SQLite3

4. Operating Systems

8.2.2 Outline of Tests

Tests performed

1. Performance tests

2. Functional tests

3. Data Integrity tests

4. Regression tests

5. Usability tests

8.2.3 Test Approach

Any bugs found should be reported with related information, which should include who discovered it, how, a description of the bug, and who ﬁxed it and when. Also, re-testing of the code done to make sure that defect has been ﬁxed and there no new bugs produced due to change in code.

1. Performance Testing:

The focus of Performance testing is checking a software program’s

• Speed : Determines whether the system responds quickly.

• Scalability : Determines maximum user load the software application can handle.

• Stability : Determines if the application is stable under varying loads.

Tools required: Software timers Success criteria:

a) Manual(user) perception does not notice any “lags”.

b) Time to perform operations is within an acceptable range.

2. Functional Testing:

The prime objective of Functional testing is checking the functionalities of the software system. It mainly concentrates on -

• Mainline functions : Testing the main functions of an application.

• Basic Usability : It involves basic usability testing of the system. It checks whetheran usercan freelynavigate throughthe screens withoutanydifﬁculties.

• Accessibility : Checks the accessibility of the system for the user.

• Error Conditions : Usage of testing techniques to check for error conditions. It checks whether suitable error messages are displayed.

Tools required: None(manual testing)

Success criteria: All of the following are successfully tested:

a) all key use-case scenarios. b) all key features.

3. Data Integrity Testing:

Data integrity refers to the quality of the data in databases and is the measurement by which users examine data quality, reliability and usefulness. Data integrity testing veriﬁes that converted data is accurate andfunctions correctly within a given application. Testing data integrity involves:

• Database : Verifying that correct values are saved in databases. • Write-back : Correct data is written to disk.

• Read : Correct data is read from disk.

• File Integrity : Operations do not break existing ﬁles.

Tools required: File compare tools (manual testing)

Success criteria: All of the following are successfully tested:

a) ﬁles are same in size, byte-blocks, permissions and parameters. b) data is not changed, modiﬁed or removed unless intended.

4. Regression Testing:

Regression Testing is required when there is a :

• Change in requirements and code is modiﬁed according to the requirement • New feature is added to the software

• Defect ﬁxing

• Performance issue ﬁx

Tools required: None(manual testing)

Success criteria: All of the following are successfully tested:

a) all previous operations are successfully executed. b) previously solved bugs are not re-introduced.

c) operations do not suffer from unwanted performance hits.

5. Usability Testing:

Goal of this testing is to satisfy users and it mainly concentrates on the following parameters of a system:

Effectiveness of the system

• Is the system is easy to learn?

• Is the system useful and adds value to the target audience?

• Is Content, Colour, Icons, Images used are aesthetically pleasing ? Efﬁciency

• Navigation required to reach desired screen/web page should be very less. Scroll bars shouldn’t be used frequently.

• Uniformity in the format of screen/pages in your application/website. • Provision to search within your software application or website

Accuracy

• No outdated or incorrect data like contact information/address should be present.

• No broken links should be present. • User Friendliness

• Controls used should be self-explanatory and must not require training to operate

• Help should be provided for the users to understand the application / website • Alignment with above goals helps in effective usability testing

Tools required: None(manual testing)

Success criteria: All of the following are successfully tested:

a) operations are not changed drastically from a traditional ﬁle system. b) user can use the ﬁle system without any special tools.

8.2.4 Entry and Exit Criteria:

1. Test Plan

A Test Plan Entry Criteria: Code is complete and has been pushed to the Git repository.

B Test Plan Exit Criteria: All functional requirements have been veriﬁed.

C Suspension and Resumption Criteria: Testing will be suspended on critical design ﬂaws that will changes in redesign of critical components. Testing will resume when the coding is complete and code is reviewed successfully.

2. Test Cycle

A Test Cycle Entry Criteria: When a module has been completed.

B Test Cycle Exit Criteria: All tests speciﬁed at the start of the testing have completed successfully.

8.2.5 Risks, Dependencies, Assumptions, Constraints

Risk	Mitigation Strategy	Contingency
FUSE API changes	Use FUSE version numbers to run a static check while compiling for required version of FUSE.	Change operation code to new version.
External Library is no longer maintained	Try to use the latest version number of library available and keep a source ready for distribution.	Change to alternate library.
Performance has degraded	Code with performance in mind, using fast algorithms and approaches.	Use proﬁling tools to detect memory issues and static code analysers for code checking.

Table 8.1: Risk Management

8.2.6 Problem Reporting, Escalation, and Issue Resolution

Eachbugwillbegivenapriority,whichwilldeterminewhenitisaddressedinthecurrent iteration. The bug priority may change due to other bugs, issues or re-evaluation of the bug by a peer review.

8.3 Test Cases

8.3.1 Introduction

The purpose of this Test Case document is to specify and communicate the speciﬁc conditions which need to be validated to enable an assessment of the system. Test Cases aremotivatedbymanythingsbutwillusuallyincludeasubsetofUseCases,performance

characteristics and the risks the project is concerned with. A separate test case document is prepared for each testing phase (unit, integration, integrity, etc.) identiﬁed in the test plan. The test cases should be organised into related groups that are meaningful to the project – i.e. test suites.

8.3.2 File System Operations

Testing the ﬁle system for implementations of required operations:

Operation	Status	Comment
getattr readdir access truncate destroy open release mknod rename unlink read write chmod chown mkdir rmdir symlink readlink link utimens statfs fsync setxattr getxattr listxattr removexattr	YES YES YES YES YES YES YES YES YES YES YES YES YES YES YES YES INVALID INVALID INVALID NO NO INVALID NO NO NO NO	checks whether the given path exits lists the contents of the given tag checks for access to speciﬁed tag closes ﬁle after operation called on ﬁle system unmount opens for ﬁle for access releases ﬁle after access creates new ﬁle renames ﬁles and folders removes ﬁle from system reads data from ﬁle writes data to ﬁle changes permissions changes owner creates new directory removes directory not required in KWEST not required in KWEST not required in KWEST not implemented not implemented not required in KWEST not implemented not implemented not implemented not implemented

Table 8.2: File system operations

8.3.3 Performance of ﬁle system operations

Comparison of KWEST ﬁle system against underlying ﬁle system. Test Bench:

• Operating System: Linux Mint 14 3.5.0-25-generic

• Original ﬁle system: ext4 500GB disk with partition size 150GB

• RAM: 4GB

• swap: 8GB on disk

• CPU utilisation: average 4

Contents of Music folder imported into KWEST:

• Audio: 17 ﬁles totalling 102.9MB

• Images: 81 ﬁles totalling 196MB

• PDF: 11 ﬁles totalling 17.9MB

• Video: 4 ﬁles totalling 1GB

• Others: 7 ﬁles totalling 7MB

• Total: 120 ﬁles of size 1.3GB

Test	Time taken	Comment
all ﬁles	120sec	total ﬁle size imported was 1.3GB
videos	40sec	extracting metadata from videos is more expensive compared to other ﬁle types
images	35sec	images having metadata take longer than those without
audio	4sec	audio ﬁles are the fastest to parse and load
PDF	2sec	PDF ﬁles are parsed quickly as compared to other document types
forming associations	2sec	time is proportional to number of common ﬁles in user tags

Table 8.3: Performance tests for mounting KWEST

Test scripts

The following is a simple test script used to evaluate ﬁle system operations. The script works by calculating the difference in time before and after the execution of operations.

#store current time let DA=(‘date +%s ‘)

#perform file system operation ls -R kwest/src/mnt

#store new time

let DB=(‘date +%s‘) #calculate the difference let DC=$DB-$DA

#output the time taken echo $DC

Operation	ext4	KWEST	Comment
list directory	500ms	550ms	there is no noticeable difference
read ﬁle	700ms	850ms	some extra time is taken to read a ﬁle depending on the amount of data being read. In general, there is no noticeable difference.
write ﬁle	1200ms	2200ms	(for5MB textﬁle) writing takes slightlymore time, but the difference is within acceptable range.
read and write	1010ms	2400ms	(for 5MB text ﬁle) reading and writing simultaneously does not produce any performance degradation.

Table 8.4: Performance tests for using KWEST

READ operation TEST

Read operation test included reading from a 4.1MB ﬁle and outputing the contents on terminal. The ﬁle was accessed on the KWEST ﬁle system.

GNU nano 2.2.6 File: ./test.sh echo "READ test"

echo "cat ./mnt/Files/Music/t2.txt let kT=0

for i in 1 2 3 4 5 do

echo "Run#" $i #store current time let kS=(‘date +%s‘)

#perform file system operation

cat ./mnt/Files/Music/t2.txt > ./mnt/Files/Music/test.txt #store new time

let kE=(‘date +%s‘) #calculate the difference let kO=$kE-$kS

let kT=kT+kO

#output the time taken

echo "kwest operation time = " $kO "sec" done

let kT=kT/5

echo "average time taken = " $kT

Figure 8.1: Read operation average time for 4.1MB text ﬁle

WRITE operation test

Write operation test included reading from a 4.1MB ﬁle and writing the contents to another ﬁle. Both the ﬁles were accessed from the KWEST ﬁle system.

GNU nano 2.2.6 File: ./test.sh echo "READ test"

echo "cat ./mnt/Files/Music/t2.txt > ./mnt/Files/Music/ test.txt"

let kT=0

for i in 1 2 3 4 5 do

echo "Run#" $i #store current time let kS=(‘date +%s‘)

#perform file system operation

cat ./mnt/Files/Music/t2.txt > ./mnt/Files/Music/test.txt #store new time

let kE=(‘date +%s‘) #calculate the difference let kO=$kE-$kS

let kT=kT+kO

#output the time taken

echo "kwest operation time = " $kO "sec" done

let kT=kT/5

echo "average time taken = " $kT

Figure 8.2: Write operation average time on 4.1MB text ﬁle

8.3.4 Proﬁling Code

Code can be proﬁled using Manual methods, or using speciﬁc tools such as Valgrind, GDB, Splint etc. For testing KWEST, we have used the following proﬁling tools:

GDB

GDB can be used to debug the ﬁle system and check for memory leaks, errors and irregular operations. The sample output given below shows a clean mount and unmount of the KWEST ﬁle system.

$ gdb ./kwest

GNU gdb (GDB) 7.5-ubuntu

Copyright (C) 2012 Free Software Foundation, Inc. Reading symbols from kwest/src/kwest....

(gdb) run -s -d -f mnt

Starting program: kwest/src/kwest -s -d -f mnt [Thread debugging using libthread_db enabled]

Using host libthread_db library "/lib/x86_64-linux-gnu/ libthread_db.so.1".

KWEST - A Semantically Tagged Virtual File System ...

...

[Inferior 1 (process 20863) exited normally] (gdb) bt

No stack.

Test	Status	Possible Errors
list directory	PASS	I/O error, illegal operation, transport endpoint not connection, connection abort
read ﬁle	PASS	I/O error, illegal operation, access denied, database error
write ﬁle	PASS	I/O error,illegal operation,access denied,ﬁle system busy
mknod	PASS	Operation not permitted, I/O error, database error
unlink	PASS	Device busy, Operation not permitted, I/O error, database error
mkdir	PASS	Operation not permitted, I/O error, database error
rmdir	PASS	Device busy, Operation not permitted, I/O error, database error
chmod	PASS	Access denied, I/O error, device busy, database error
associations	PASS	memoryerror,segmentationfault,unconditionaljump,I/O error
fuse main	PASS	incompatible version

Table 8.5: GDB debugging for KWEST

Chapter 9

RESULTS

9.1 File System View after Mounting

The KWEST ﬁle system needs to be mounted by running a program and specifying a mount point. Once mounted, the ﬁle system can be used through any tool or application.

9.1.1 Terminal View

Figure 9.1: KWEST ﬁle system in Terminal after mounting

9.1.2 File Manager View

A ﬁle manager or ﬁle browser is a computer program that provides a user interface to work with ﬁle systems. Files are typically displayed in a hierarchy. This is the default and only way traditional ﬁle systems can work. For semantic ﬁle systems like KWEST, virtual entries act like folders and ﬁles. These allow the ﬁle system to specify what entries to display without adhering to any strict hierarchy.

File managers used and tested with KWEST include:

• Nautilus - The default ﬁle manager on Gnome distributions.

• Nemo - The default ﬁle manager for Cinnamon desktop.

• PCManFM - Lightweight ﬁle manager found in LXDE.

• XFE - File manager for the X Window system.

• Dolphin - The default ﬁle manager for KDE distributions.

• Thunar - Fast and extendable ﬁle manager which supports plugins.

Figure 9.2: KWEST ﬁle system in PCManFM after mounting

9.1.3 Organisation by File Types

The KWEST ﬁle system extracts metadata from ﬁles while it is being imported. Depending on the extracted metadata, ﬁles are categorised and put into folders corresponding to their types. Currently, KWEST supports the following ﬁle types:

1. Audio ﬁles

2. Images

3. PDF Documents

4. Videos

The separation of ﬁles into individual folders based on their types makes it easy to ﬁnd a particular ﬁle. This is because each ﬁle type has its own metadata, which is used to categorise that ﬁle. An automated and organised view of ﬁles helps avoid clutter and provides efﬁcient searching facilities.

Figure 9.3: Chaotic organisation in normal ﬁle system vs KWEST auto-organisation

9.2 Usage and Performance

Figure 9.4: Performance test for KWEST with normal user task loads

Although KWEST is a virtual ﬁle system, it can be used for normal daily tasks such as listening to music, reading documents, watching videos, etc. In the screenshot below, the applications each run or open a ﬁle on the KWEST ﬁle system simultaneously. No perceptible lag or performance degradation was noticed while playing music or watching the video. The ﬁle system performed sufﬁciently well in all four application operations.

9.3 Displaying Suggested Files

Suggestions are ﬁles which are not actually present in that tag, but can be tagged as per the system’s recommendation. There are two types of suggestions:

• Probably Related : The ﬁles probably belong to the tag, but the system is not sure about it.

• Related : The ﬁle deﬁnitely belongs to the tag, and the user can tag it.

Figure 9.5: KWEST showing suggestions for including ﬁles in the favourites tag

9.4 Exporting Files

Figure 9.6: KWEST Export function - Copy ﬁle to external location

The entries, ﬁles, tags in KWEST are all virtual entries. They do not exist anywhere on a physical disk. In order to “export” this organisation to another location outside of KWEST, there is the Export feature. Practically there is not difference between a normal copy and export. Export copies the virtual entries as real ﬁles by physically accessing the location where they are stored, and then copying them. To the end user, there is no increase in steps or time for the copying to complete.

Chapter 10

DEPLOYMENT AND MAINTAINENCE

10.1 Installation and Uninstallaion

10.1.1 Installing

• Get the latest copy of KWEST by downloading from the project hosting site: https://code.google.com/p/kwest/downloads

• After extracting the contents, open up a terminal and type in the following commands:

make kwest_libs

export LD_LIBRARY_PATH=../lib:D_LIBRARY_PATH make

10.1.2 Mounting and Unmounting

• To mount a KWEST ﬁle system, the user needs to run the mount script which will mount the ﬁle system in the speciﬁed folder.

./kwest mount-point

• TounmountaKWESTﬁlesystem,theuserneedstorunthefusermount-ucommand with the argument path of KWEST mount point. A successful unmount operation does not return any message.

fusermount -u mount-point

10.1.3 Uninstallation and Removal of Data

1. KWEST “installs” ﬁles in the users local directory.

2. To clean allthe installation data including the database andlog ﬁles,the usershould use the make clean or make ob commands.

3. These commands are present in the Makeﬁle, which comes with the project source.

4. In case of manual installation, there is a KWEST folder in the .conﬁg directory.

5. Incomplete removal of ﬁles may cause haphazard execution of the program or corruption of ﬁles.

10.2 User Help

KWEST is a virtual ﬁle system. This means that the folders and ﬁles represented by it are a part of its virtual organisation. Each ﬁle in KWEST represents an actual ﬁle stored somewhere on the underlying ﬁle system. The main focus of using KWEST is organisation.

10.2.1 Importing Files and Folders

• By default KWEST imports from the users HOME folder. It recursively scans all the sub folders and imports the folder hierarchy into KWEST. Hidden ﬁles are ignored by KWEST.

• The user can also explicitly use the import tool to import ﬁles and folders into KWEST. The import tool accepts the folder to import as argument and imports all ﬁles and folders within it. It can work regardless of whether the KWEST ﬁle system is mounted or not.

• While importing, the metadata of the ﬁle is used to organise each ﬁle. For e.g.: A music ﬁle contains the song’s artist, which is used to categorise that ﬁle.

10.3 Browsing Files by Type

10.3.1 Audio

This folder contains all the ﬁles recognised by the system as being of type audio. Upon accessing the folder, each Audio ﬁle is further categorised by Album, Artist, Genre. If a particular metadata is absent for the audio ﬁle, KWEST categorises it in the Unknown folder.

For e.g. If the audio ﬁle does not have any artist associated with it, it can be found in UnknownArtist.

Figure 10.1: Audio ﬁles being categorised by /Album and /Artist/Album views

10.3.2 Image

This folder contains all the image ﬁles recognised by the system. Inside, the images are organised by ImageCreator and ImageDate. The ImageCreator sub folder is based on Creators like software - Adobe Photoshop, or hardware - Camera Models. Each ImageCreator tag is further organised by ImageDate. The ImageDate folder contains images sorted by Month-Year of creation. E.g.: A picture taken on “2nd March 1992” will appear under “1992Mar”. As with Audio, ﬁles with metadata missing will appear under appropriate Unknown sub folders.

Figure 10.2: Images being categorised by Creator and CreationDate

10.3.3 PDF

All PDF documents are tagged with the PDF tag. Each PDF document is organised by its Author,Publisher,Subject and Title. Files with metadata missing are appropriately tagged under Unknown tags.

Figure 10.3: PDF documents being separated by Author and Title

10.3.4 Video

Currently, the system organises video based only on Length, with the categories being Short, Medium and Long. A short video is anything with less than 1800s of playtime. Videos with play time equal to or greater than 5400s are considered Long, and anything between them is considered Medium. The Video tag does not contain any categories as most of the videos stored by the user do not have any metadata present.

Figure 10.4: Browsing the KWEST Video folder in a terminal

10.3.5 USER

The tag ‘USER’ refers to the username of the current user. The user can create and manage his own personal tags in this folder.

Figure 10.5: User tags displayed in KWEST

10.3.6 Using Suggestions to tag ﬁles

KWEST helps the user with organisation by providing suggestions for tagging ﬁles. These suggestions are provided as ﬁles preﬁxed with the word - “SUGGESTED”. The user may make use of that suggestion by tagging that ﬁle in the current tag.

e.g. in tag Zodiac, there are 3 suggestions: ARIES, GEMINI, CANCER each shown as a ﬁle with SUGGESTED preﬁxed in their names. To make use of the suggestion on ARIES, the user tags the ﬁle ARIES under the tag Zodiac. The ﬁle is now seen in the tag without the suggested preﬁx. The other suggestions are still present and may be used further or deleted.

Figure 10.6: KWEST offering suggestions for tag favourites

Chapter 11

Conclusion and Future Scope

11.1 Conclusion

Considering data organisation

A ﬁle system, considering that it stores data, is created with stability and performance in mind. An end-user is more concerned with how they can store and access their data efﬁciently. Using a KWEST ﬁle system, the user can organise their data efﬁciently. All ﬁles are stored and accessed by their content or context rather than just a bunch of string-names. This allows the user to think and remember the ﬁle in terms of what it represents, rather than a path-name which may not be related to the data it contains. This semantic approach is helpful to the user, as it becomes easier to manage for them to search for and manage their data.

Automation of Tasks

KWEST automatically uses the metadata embedded in a ﬁle to apply tags and categorise it.Thisautomationhelpstheuserbyprovidingaccesstoﬁlesaccordingtotheirrespective contexts. By doing this, the user gets all Audio ﬁles under the Audio folder, all Image ﬁles under the Image folder and so on. Further, each folder is organised by meta-type belonging to that mime type. E.g. the Audio folder is sorted by Album, Artist, Genre.

Providing Suggestions

It may happen that the user inadvertently misses out tagging some ﬁle, which may result in an incomplete organisation. The user may later search for that particular ﬁle in the tag, but will not ﬁnd it. In a KWEST ﬁle system, based on the occurrences of ﬁles in various user tags, the system provides suggestions that help the user tag a ﬁle in appropriate places. This helps avoid missing out on important ﬁles, and allows a faster method of organisation as the ﬁles to be tagged are available as suggestions.

Performance and Stability

Although a KWEST ﬁle system is created and operates as a VFS, there is no perceptible lag or performance hit on the operations the user performs. Common operations like listening to music, watching a video, writing to a document can be carried out smoothly.

Also, since KWEST is based on an always stable implementation of FUSE, the ﬁle system itself is stable. Strict coding standards and rigorous tests against memory allow the ﬁle system to remain stable even under moderate usage.

11.2 Future Scope

The project, with its novel concept of Applying association rule learning in a semantic ﬁle system; is the ﬁrst implementation of a semantic ﬁle system to actively help the user categorise and organise their data.

Support for more Standard File Types

Right now the import feature can extract metadata only from a ﬁxed set of ﬁle types. More ﬁle types can be handled which increase the feature and usefulness of the ﬁle system. Every operating system or ﬁle manager has a certain knowledge of what kind of metadata each ﬁle type can contain. Using this knowledge in the KWEST ﬁle system will allow the user to be able to browse a ﬁle system completely based on it’s semantics.

More Relations and Associations

Currently, the system creates associations based on the common occurrences of ﬁles between various tags. This is done using the Apriori algorithm. There are a lot of other interesting approaches which can be utilised. Like algorithms to create various different associations. OrchangingthewayApriorihandlesﬁlesandtags. Fileaccesses,frequency of usage, explicit user choices can also be utilised for forming results.

Efﬁciency and Performance

Although the system is both efﬁcient and performant, it can be vastly improved to provide a high-quality ﬁle system. Operations can be threaded to reduce the wait time. Simultaneous access to ﬁles can be used to provide a ﬂuid experience. Algorithm throughput can be raised to get more accurate results. These are just a few performance and efﬁciency related things we can do with KWEST. The ultimate approach is to integrate this ﬁle system at the kernel level. This will allow performance and stability similar those of traditional ﬁle systems.

Collect Data from Various Locations

The KWEST ﬁle system imports data only from the user’s HOME folder. The data stored there might not be the only location a user wants to use. In today’s world, each person has a multitude of devices ranging from laptops,PCs,tablets and phones to Cloud services like Dropbox, Google Drive. Each of them have data which the KWEST ﬁle system can utilise to form associations and display using virtual suggestions. It will result in a uniﬁed view of all the user’s data categorised and organised, which is spread and available across all of their devices.

11.3 Need for KWEST Tomorrow

With data usage almost doubling with each passing year, people are bound to focus on organising it. A tool like KWEST, with its semantic roots, automation and suggestions will be immensely helpful when a large data storage has to be properly catalogued, organised and accessed. Thus, we have tried to implement a research based project with its usefulness reﬂected in the problems of tomorrow.

Chapter 12

APPENDIX

A: MATHEMATICAL MODEL

The relationship between ﬁles andtags can be representedby using Settheory. Settheory is the branch of mathematics that studies sets, which are collections of objects. The following mathematical model represents the working of this ﬁlesystem.

The following dynamic and variable sets are deﬁned as, F : Set of Files

T : Set of Tags

S : Set of Tags in query ( S T )

.1 Relation between Files (F) and Tags (T)

R=f(f;t)j f hastagt; f 2F;t 2Tg

Here R deﬁnes the relation between a ﬁle f and its tag t where R F T. This relationship is many-to-many. That is a ﬁle can have many tags, and a tag can describe many ﬁles.

.2 Association between Tags (T)

Using discovered associations, we can form various relations between tags. These tag-to-tag help in displaying related information. For any two tags there exists a distinct relation between them given by r. The function Xr(A;B) returns the relation between two tags. Associations can be broadly categorized as:

• AB : A and B are not directly related, but there may exist some indirect relation between them.

• AB : A and B are directly related, where A always has a path leading to B. This relation is similar to AB.

• A./B : A and B are not directly related, but B supplements additional information related to A.

• f : This relation states that there does not exist any relation between the two tags.

.3 Operations

g(f)= ft : f Rtg

g is an operation which takes input as a ﬁle f and returns the set of tags (t 2S) related by R to that ﬁle.

h(t)= ff : f Rtg

h is an operation which takes input as a tag t and returns the set of ﬁles (f 2F ) related by R to that tag.

.4 Storing Tags and Files

The relation R is stored as a set of ordered pairs (f;t), where RFT. The operations g and h operate on these ordered pairs and return mapped or matched elements. A relation which has to be added must be represented in the form of of ordered pair (f;t). Storage of all relations is given by FT where ordered pairs exist according to R=ff 2F;t 2T j f Rtg.

For example, we have the sets and their relations as:

F =ff1; f2; f3g;T =ft1;t2;t3g;R=ff1Rt1; f2Rt2; f3Rt1; f1Rt3g

Then we store this relation by its ordered pairs given by:

R=f(f1;t1);(f2;t2);(f3;t1);(f1;t3)g

.5 Extraction of metadata

The extraction of metadata is deﬁned by the function XE which takes a ﬁle f(f 2F) and returns a set of tags (TE T) that form the relation (f Rt :t 2TE).

XE(f)= TE 22T

Addition of new information (metadata, tags) is done as:

if(t 2= T)then(T T [ftg)

We then store this relation as a ordered pair f(f;t)8t 2TEg. At the end of this operation TE T will hold true.

.6 Importing Semantics

The existing ﬁle-directory structure can be imported to the system and represented in the form of tags and ﬁles. We deﬁne:

FH : Set of Files on hard-disk which are not represented in system.

DH : Set of Directories on hard-disk which are not represented in system.

Then for every ﬁle stored within a directory d, the relation R is expressed as (f Rd). When importing semantics we create the ordered pair (f;d) given by the relation (f Rd;8d 2Dd). Where Dd contains the directory the ﬁle is stored in, as well as every parent directory of that directory itself.

Directories also contain sub-directories which we store in the form of tag relationships. We represent them as: 8(d1;d2)2DH, if d2 d1 (d2 is a sub-directory of d1) store the relation as

d1 !d2

.7 Apriori Algorithm

The apriori algorithm[10] is a classic algorithm for learning association rules. The algorithm is designed to operate on databases containing transactions. As is common in association rule mining, given a set of itemsets (for instance, sets of retail transactions, each listing individual items purchased), the algorithm attempts to ﬁnd subsets which are common to at least a minimum number C of the itemsets. Apriori uses a ”bottom up” approach, where frequent subsets are extended one item at a time (a step known as

candidategeneration),andgroupsofcandidatesaretestedagainstthedata. Thealgorithm terminates when no further successful extensions are found.

The purpose of the apriori algorithm is to ﬁnd associations between different sets of data. Each set of data has a number of items and is called a transaction. The output of Apriori is sets of rules that tell us how often items are contained in sets of data.

Itemset

A collection of one or more items. Example: A, B, C

k-itemset

An itemset that contains k items.

Support count (S)

Number of transactions containing an itemset. Example: S(A, B) = 2

Support (supp)

The support supp(X) of an itemset X is deﬁned as the proportion of transactions in the data set which contain the itemset. Suppose minsup is the minimum support threshold. Example: supp(A, B) = 2/5

Frequent Itemset (L)

Anitemsetsatisﬁesminimumsupportiftheoccurrencefrequencyoftheitemsetisgreater or equal to a threshold. If an itemset satisﬁes minimum support, then it is a frequent itemset. Thusanitemsetwhosesupportisgreaterorequaltominsupisafrequentitemset.

Conﬁdence

The conﬁdence of a rule is deﬁned as,

Conf(A!B)= supp(A!B)=supp(A) (12.1) Suppose minconf is the minimum conﬁdence threshold.

Rule Generation

Given a setoftransactions T,the goalofassociation rule mining is to ﬁndallrules having

support minsupthreshold (12.2) confidenceminconf threshold (12.3)

Given a frequent itemset L, ﬁnd all non-empty subsets f L such that f ! L f satisﬁes the minimum conﬁdence requirement.

Example: If A;B;C is a frequent itemset, then the following candidate rules are formed AB!C; AC !B; BC !A; A!BC; B!AC; C !AB

If jLj = k, then there are (2k) 2 candidate association rules (ignoring L ! Fand F!L)

Apriori principle

The princciple sttes that if an itemset is frequent, then all of its subsets must also be frequent. Apriori principle holds due to the following property of the support measure:

8X;Y :(X Y)!s(X)s(Y) (12.4)

Support of an itemset never exceeds the support of its subsets. This is known as the anti-monotone property of support.

Algorithm Input

T - Database of transactions I - Items

L - Itemset s - support

c - conﬁdence

Output

R - Association rules satisfying s and c

Algorithm to Generate Frequent Itemsets Apriori(T,s)

L1 Large1 itemset k 2

whileLk 1 =F

Ck =Generate(Lk 1) fortransactionst 2T

C Subset(Ck;t) for candidatesc2C

count[c] count[c]+1 Lk c2Ckjcount[c]s k k+1

return[Lk

Algorithm to Generate Association Rules GenRule

R=F;

for eachl 2L do

for eachx l suchthat x =Fand x =l do if(supp(l)=supp(x)) cthen

R=R [ (x !(l x));

B: TESTING OF DATA

Testing of the system will be done on the following points:

.2 Storing the relation R:

The working of the system depends on correctly storing the relation R. These tests check whether the relations are stored and represented correctly.

1. For any ﬁle f associated with tag t there should exist an ordered pair (f;t).

2. For a ﬁle f associated with tags S = ft1;t2;t3g, the operation g(f) should return exacty S.

3. For a tagt containing ﬁles F =ff1; f2; f3g, the operation h(t) should return exactly F.

.3 Relation between tags:

1. The relation r between two tags t1;t2 is given by function Xr(t1;t2)= c.

2. The relation r is distinct i.e. there exists only one relation between any two tags. If contradictions arise where more than one relation is present between two tags, then all those relations must be made void and r =f. The user can then explicitly specify which relation should be created between those two tags.

3. Extensive testing must be done on relations to determine which r needs to be set under certain conditions.

.4 Extraction of Metadata:

Let M be the set of all metadata tagst for ﬁle f. The function XE represents an algorithm to extract t from f. It returns a set TE such that ft 2TE j f Rtg and (TE M). To test the efﬁciency of the function, or the effectiveness of it, we compare the cardinality of the generated set TE with the set of Metadata M. The efﬁciency can be calculated by

Efﬁciencye= jTE j

Using e we can compare algorithms and their efﬁciency. Appropriate algorithms can be chosen for various ﬁle types so that efﬁciency of the entire system remains high. For e.g.

XE1 :fe=0:7 for audio;e=0:4 for images g XE2 :fe=0:6 for audio;e=0:6 for images g

Then we have the following options:

1. XE2 is a better choice as it yeilds a more consisten efﬁciency.

2. XE1 is used only for audio and XE2 is used only for image extractions.

.5 Queries and their results:

Queries are parsed into tokens of the form (t1;s;t2). We need to test whether s returns the correct results for the query. A Query Q(S) is said to be successfully executed when the expected result are shown.

• The Query q(t) for a single tag t should return a set of ﬁles (f 2 F ) through the operation h(t).

• In a query if no operation is given, Intersubsection should be performed.

• For Query Q(S) the operations should be performed from left to right unless precedence is speciﬁed by paranthesis.

Example: Let t1;t2;t3 be tags having the following ﬁles: t1 : fphoto1;photo2;doc1g

t2 : fphoto1;doc1;ppt1g t3 : fphoto1;doc3g

1. q(t) = F where f should follow the relation (f 2 F j f RT). For example, q(t1) returns F =fphoto1;photo2;doc1g.

2. Consider a query containing tags S=ft1;t2g. It generates results by performing the operations Q(S)= q(t1)sq(t2) where s is an operator.

a) Putting s = [ for Union in query Q(S) = q(t1)[q(t2); the result should be F =fphoto1;photo2;doc1;ppt1g.

b) Putting s = \ for Intersubsection in query Q(S) = q(t1)\q(t2); the result should be F =fphoto1;doc1g.

c) Puttings=nforSetDifferenceinqueryQ(S)=q(t1)nq(t2);theresultshould be F =fphoto2g,whereas the query Q(S)=q(t2)nq(t1) willreturn the result as F =fppt1g.

d) Putting s = for Symmetric Difference in query Q(S) = q(t1) q(t2); the result should be F =fphoto2;ppt1g.

3. Consider a query containing tags S = ft1;t2;t3g. It performs operations as : Q(S) = q(t1)s1q(t2)s2q(t3) The default order for processing is from left to right unless precedence is speciﬁed through parenthesis.

a) For the query Q(S) = q(t1)q(t2)q(t3) the result will be returned as ﬁles fphoto1g

b) For the query Q(S) = q(t1)[q(t2)\q(t3) the result will be returned as ﬁles fphoto1;doc1;doc3g

After verifying that individual queries return correct results, we must check whether Q(S) runs correctly as well. This is done by seperating the query into tokens, and calculating their results in turn. It must be veriﬁed that results are correct and have not been mis-intepreted through tokenization of the query.

.6 Forming Associations

Consider a database, D, consisting of 9 transactions. Suppose minimum support count required is 2 (i.e. minsup = 2=9 = 22% ). Let minimum conﬁdence required is minconf =70%. We have to ﬁrst ﬁnd out the frequent itemset using apriori algorithm. Then, Association rules will be generated using minsup and minconf.

TagID	Files
T100	F1, F2, F5
T101	F1, F2, F5
T102	F2, F4
T103	F2, F3
T104	F1, F2, F4
T105	F1, F3
T106	F2, F3
T107	F1, F3
T108	F1, F2, F3, F5
T109	F1, F2, F3

Step 1: Generating initial Candidate Itemset

In the ﬁrst iteration of the algorithm, each item is a member of the set of candidate.

Itemset	Support count
fF1g	6
fF2g	7
fF3g	6
fF4g	2
fF5g	2

Step 2: Generating 1-itemset Frequent Pattern

The set of frequent 1-itemsets, L1, consists of the candidate 1-itemsets satisfying minimum support.

Itemset	Support count
fF1g	6
fF2g	7
fF3g	6
fF4g	2
fF5g	2

Step 3: Generating 2-itemset Frequent Pattern

To discover the set of frequent 2-itemsets, L2, the algorithm uses L1 Join L1 to generate a candidate set of 2-itemsets, C2. Next, the transactions in D are scanned and the support count for each candidate itemset in C2 is accumulated. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2-itemsets in C2 having minimum support.

Itemset	Support count
fF1, F2g	4
fF1, F3g	4
fF1, F5g	2
fF2, F3g	4
fF2, F4g	2
fF2, F5g	2

Step 4: Generating 3-itemset Frequent Pattern Thegenerationofthesetofcandidate3-itemsets,C3,involvesuseoftheAprioriProperty. First, we generate C3 using L2 join L2.

C3 = ffF1, F2, F3g, fF1, F2, F5g, fF1, F3, F5g, fF2, F3, F4g, fF2, F3, F5g, fF2, F4, F5gg.

Now we will apply Apriori property to determine which candidate itemsets are frequent. The 2-item subsets of fF1, F2, F3g are fF1, F2g, fF1, F3g and fF2, F3g. Since all

2-item subsets of fF1, F2, F3g are members of L2, We will keep fF1, F2, F3g in C3. The 2-item subsets of fF2, F3, F5g are fF2, F3g, fF2, F5g and fF3, F5g. But, fF3,

F5g is not a member of L2 and hence it is violating Apriori Property. Thus we will remove fF2, F3, F5g from C3. Therefore,

C3 = ffF1, F2, F3g, fF1, F2, F5gg.

Now, the transactions in D are scanned in order to determine L3, consisting of those candidates 3-itemsets in C3 having minimum support.

Itemset	Support count
fF1, F2, F3g	2
fF1, F2, F5g	2

Step 5: Generating 4-itemset Frequent Pattern

The algorithm uses L3 Join L3 to generate a candidate set of 4-itemsets, C4. Although the join results in ffF1, F2, F3, F5gg, this itemset is removed since its subset ffF2, F3, F5gg is not frequent. Thus,C4=F , and algorithm terminates, having found all of the frequent items. This completes our apriori algorithm.

Step 6: Generating Association Rules from Frequent Itemsets

For each frequent itemset ’l’, generate all nonempty subsets of l. For every nonempty subset s of l, output the rule s!(l s) if supp(l)=supp(s)minconf

We had L = ffF1g, fF2g, fF3g, fF4g, fF5g, fF1, F2g, fF1, F3g, fF1, F5g, fF2, F3g, fF2, F4g, fF2, F5g, fF1, F2, F3g, fF1, F2, F5gg.

Consider l = fF1, F2, F5g. Its all nonempty subsets are fF1, F2g, fF1, F5g, fF2, F5g, fF1g, fF2g, fF5g.

The association rules are shown below, each listed with its conﬁdence.

1. R1:F1;F2!F5 Confidence=suppfF1;F2;F5g=suppfF1;F2g=2=4=50% R1 is Rejected.

2. R2:F1;F5!F2 Confidence=suppfF1;F2;F5g=suppfF1;F5g=2=2=100% R2 is Selected.

3. R3:F2;F5!F1 Confidence=suppfF1;F2;F5g=suppfF2;F5g=2=2=100% R3 is Selected.

4. R4:F1!F2;F5 Confidence=suppfF1;F2;F5g=suppfF1g=2=6=33% R4 is Rejected.

5. R5:F2!F1;F5 Confidence=suppfF1;F2;F5g=suppfF2g=2=7=29% R5 is Rejected.

6. R6:F5!F1;F2 Confidence=suppfF1;F2;F5g=suppfF5g=2=2=100% R6 is Selected.

In this way, we have found the following three strong association rules.

1. F1;F5!F2

2. F2;F5!F1

3. F5!F1;F2

C: PAPERS PUBLISHED

Sr.No. Paper Name

[1] A. Gogte, S. Gupta, H. Pandit, R. Sharma, Using association rule learninginaSemanticﬁlesystem,InternationalConferenceonAdvanced Computer Sciences and Information Technology, Pune, Februrary 15, 2013, pp. 29-31. [Published]

[2] A. Gogte, S. Gupta, H. Pandit, R. Sharma, KWEST - A Semantically Tagged Virtual File System, International Conference on Advanced Computer Engineering and Applications, Trivandrum, 2012. [Accepted]

D: PAPERS REFERRED

Sr.No. Paper Name

[1] K. Chang, W.T. Perdana, B. Ramadhana, K. Sethuraman, T.V. Le, and N. Chachra, Knowledge File System-A principled approach to personal information management, 2010 IEEE International Conference on Data Mining Workshops, Sydney, December 2010, pp. 1037-1044.

[2] O. Eck, and D. Schaefer, A semantic ﬁle system for integrated product data management, Advanced Engineering Informatics, 2011, pp. 177-184.

[3] P. Mohan, S. Venkateswaran, Raghuraman, and A. Siromoney, Semantic FileRetrievalinFileSystemsusingVirtualDirectories,Proc.LinuxExpo Conference, Raleigh, NC, May 2007, pp. 141-151.

[4] R. Agarwal, and R. Srikant, Fast algorithms for mining association rules in large databases, 20th International Conference on Very Large Data Bases, VLDB, Santiago, Chile, September 1994, pp. 487-499.

E : CONTRIBUTION OF TEAM MEMBERS

Member 1: Aseem Gogte

• Literature Survey

• Mathematical model

• Module for Extraction of metadata from audio using TagLib

• Testing related to File operations

• Source code formatting and documentation

Member 2: Sahil Gupta

• Designing database queries

• Database design

• Module for importing and exporting semantics

• Tag associations

• Integrating database with FUSE

• Data mining, association rule learning alternatives

• Module for Apriori algorithm study and implementation

Member 3: Harshvardhan Pandit

• Design program ﬂow

• Component design

• Module for interact with FUSE technology

• Module for extraction of metadata from images and videos using Libextrator

• Plugin architecture for metadata extraction

• API for developers, dynamic libraries

• Logging mechanism for for debugging code

• Debugging using GDB, valgrind

Member 4: Rohit Sharma

• Database design

• Module for managing database consistency

• Module for extraction of metadata from PDF using Poppler

• Integrating modules

• Testing related to database

APPENDIX F: GLOSSARY

Acronym SDLC FUSE GCC GPL

API GUI FAQ SFS VFS KFS

Deﬁnition

Software Development Life Cycle File system in Userspace

GNU Compiler Collection General Public License

Application programming interface Graphical user interface Frequently asked questions Semantic File System

Virtual File System Knowledge File System

Chapter 13

BIBLIOGRAPHY

[1] K. Chang, W.T. Perdana, B. Ramadhana, K. Sethuraman, T.V. Le, and N. Chachra, Knowledge File System-A principled approach to personal information management, 2010 IEEE International Conference on Data Mining Workshops, Sydney, December 2010, pp. 1037-1044.

[2] S. Gopal, Y. Yang, K. Salomatin, and J. Carbonell, Statistical Learning for File-Type Identiﬁcation, 2011 10th International Conference on Machine Learning and Applications, May 2011, pp. 68-73.

[3] Y. Hua, H. Jiang, Y. Zhu, D. Feng, and L. Tian, Semantic-Aware Metadata Organization Paradigm in Next-Generation File Systems, IEEE Transactions On Parallel And Distributed Systems, Vol. 23, No. 2, February 2012, pp. 337-344.

[4] O. Eck, and D. Schaefer, A semantic ﬁle system for integrated product data management, Advanced Engineering Informatics, 2011, pp. 177-184.

[5] P. Mohan, S. Venkateswaran, Raghuraman, and A. Siromoney, Semantic File Retrieval in File Systems using Virtual Directories, Proc. Linux Expo Conference, Raleigh, NC, May 2007, pp. 141-151.

[6] R. Agarwal, T. Imielinski, and A. Swami, Mining Association Rules between Sets of Items in Large Databases, 1993 ACM SIGMOD Conference, Washington DC, USA, May 1993, pp. 207-216.

[7] S. Bloehdorn, O. Grlitz, S. Schenk, and M. Vlkel, TagFS - Tag Semantics for Hierarchical File Systems, 6th International Conference on Knowledge Management (I-KNOW 06), Graz, Austria, September 6-8, 2006.

[8] D. Gifford, P. Jouvelot, M. Sheldon, and J.W. O’Toole, Sematic File Systems. 13th ACM Symposium on Operating Systems Principles, ACM Operating Systems Review, October 1991, pp. 16-25.

[9] A. Schroder, R. Fritzsche, S. Schmidt, A. Mitschick, and K. Meißner, Semantic Extension of a Hierarchical Storage Management System for Small and Medium-sized Enterprises, 1st International Workshop on Semantic Digital Archives, 2011.

[10] R. Agarwal, and R. Srikant, Fast algorithms for mining association rules in large databases, 20th International Conference on Very Large Data Bases, VLDB, Santiago, Chile, September 1994, pp. 487-499.

[11] C. Mangold, A survey and classiﬁcation of semantic search approaches, Int. J. Metadata, Semantics and Ontology, Vol. 2, No. 1, pp.23-34.

[12] R. Freund, File Systems and Usability — the Missing Link, Cognitive Science, University of Osnabruck, July 2007.

[13] Apple Spotlight, from http://developer.apple.com/macosx/spotlight.html

[14] Google Desktop Search, from http://googledesktop.blogspot.in

[15] Tagsistant, from http://www.tagsistant.net

[16] G.Olaf (2008), Tagster, from http://www.uni-koblenz.de

[17] Homepage and documentation File system in USERspace (FUSE), from http://fuse.sourceforge.net

[18] Homepage and documentation SQLite database, from http://www.sqlite.org