Run-Length Encoded BWT

7/6/2019 COMP9319 2019T2 Assignment 2 1/5COMP9319 2019T2 Assignment 2: RLFM Index(Run- BWT)Your task in this assignment is to create a search program that implements BWT backward search, which canefficiently search a RLFM record file. The original file (before RLFM) format is:[][][]… …where , , , etc. are integer values that are used as unique record identifiers;and , , , etc. are record values (text), which include any ASCII alphabets with ASCII valuesfrom 32 to 126, tab (ASCII 9) and newline (ASCII 10 and 13). For simplicity, there will be no open or closesquare bracket in the record values.Your C/C++ program, called rlebwt , accepts:
a command argument of either:-m for the number of matching substrings (count duplicates),-r for the number of unique matching records,-a for listing the identifiers of all the matching records (no duplicates and in ascending order), or-n for displaying the record value of a given record identifier;
the path to a RLFM encoded file (without its file extension);
the path to a index folder; and
a quoted query string (i.e., the search term) for option -m, -r, -a, or a quoted record identifier for option -nas commandline input arguments. The search term can be up to 512 characters. To make the assignment easier, weassume that the search is case sensitive.If -a is specified, using the given query string, rlebwt will perform backward search on the given RLFM encodedfile, and output the sorted and unique identifiers (no duplicates) of all the records that contain the input querystring to the standard output. Each identifier is enclosed in a pair of square brackets, one line (ending with a ‘n’)for each match.If -m is specified, given a query string, rlebwt will output the total number of matching substrings (countduplicates) to the standard output. The output is the total number, with an ending newline character.Similarly, rlebwt will output the total number of unique matching records (do not count duplicates) if -r isspecified.If -n is specified, using the given record identifier, rlebwt will output the original record value (text) to thestandard output with a ‘n’ at the end.For any of the above options, if a match cannot be found, simply output nothing.Although you do not need to submit a BWT encoder, it is a part of this assignment that you will implement asimple BWT encoding program based on RLFM (this will help you in understanding the lecture materials andassist in testing your assignment).File Extensions and FormatsSample files are provided in ~cs9319/a2/.wagner % pwd/import/kamen/1/cs9319/a2wagner % ls -ltotal 12088-rw-r–r– 1 cs9319 cs9319 911184 Jun 27 23:12 dblp.b-rw-r–r– 1 cs9319 cs9319 911184 Jun 27 23:12–r– 1 cs9319 cs9319 3132682 Jun 27 23:12 dblp.s-r–r–r– 1 cs9319 cs9319 7289468 Jun 27 23:12 dblp.txt-rw-r–r– 1 cs9319 cs9319 3191 Jun 27 22:40 shopping.b7/6/2019 COMP9319 2019T2 Assignment 2 2/5-rw-r–r– 1 cs9319 cs9319 3191 Jun 27 22:40–r– 1 cs9319 cs9319 13700 Jun 27 22:40 shopping.s-r–r–r– 1 cs9319 cs9319 25525 Jun 27 23:15 shopping.txt-rw-r–r– 1 cs9319 cs9319 2 Jun 27 23:11 simple1.b-rw-r–r– 1 cs9319 cs9319 2 Jun 27 23:11–r– 1 cs9319 cs9319 11 Jun 27 23:11 simple1.s-r–r–r– 1 cs9319 cs9319 15 Jun 27 23:11 simple1.txt-rw-r–r– 1 cs9319 cs9319 8 Jun 27 23:11 simple2.b-rw-r–r– 1 cs9319 cs9319 8 Jun 27 23:11–r– 1 cs9319 cs9319 35 Jun 27 23:11 simple2.s-r–r–r– 1 cs9319 cs9319 58 Jun 27 23:11 simple2.txt-rw-r–r– 1 cs9319 cs9319 70 Jun 27 23:11 simple3.b-rw-r–r– 1 cs9319 cs9319 70 Jun 27 23:11–r– 1 cs9319 cs9319 378 Jun 27 23:11 simple3.s-r–r–r– 1 cs9319 cs9319 553 Jun 27 23:11 simple3.txtwagner %The file extensions represent their corresponding types:FILENAME.txt – the original text file. It is provided for your reference only. It will not be available duringauto marking.FILENAME.s – corresponds to S in the RLFM lecture slides and its original paper. It is the BWT text withthe consecutive duplicates removed.FILENAME.b – corresponds to the bit array B in the RLFM lecture slides and its original paper. It is inbinary format, which can be inspected using xxd as shown – corresponds to the bit array B’ in the RLFM lecture slides and its original paper. It is inbinary format, which can be inspected using xxd as shown later. This file is not provided during automarking. Your rlebwt will need to generate it.For the B and B’ arrays, after the last bit is written to the file, fill in the gap (if any) of the last byte with bit 1.Check the xxd examples below for details.Initialization and External FilesWhenever rlebwt is executed using a given file FILENAME, for example:rlebwt -X FILENAME INDEX_FOLDER QUERY_STRINGwhere X can be any one of the options (-m, -r, -a, -n), it will take FILENAME.s and FILENAME.b as input;and also check if exists. If does not exist, it will generate one.After that, it will check if INDEX_FOLDER exists. If not, it will create it as an index folder. Index files will thenbe generated inside this index folder accordingly.In addition to the B’ array, your solution is allowed to write out up to 6 external index files that are in total nolarger than the total size of the given, input FILENAME.s file plus 2 x the size of the given FILENAME.b. If yourindex files are larger than this limit, you will receive zero points for the tests that involve that given FILENAME.You may assume that the index folder (and its index files inside) will not be deleted during all the tests for a givenFILENAME, and all the INDEX_FOLDER are uniquely and correspondingly named. Therefore, to save time, youonly need to generate the index files when their folder does not exist yet.ExampleSuppose the original file (say dummy.txt) before RLFM is:[3]Computers in industry[25]Data compression[33]Integration[40]Big data indexing[90]1990-02-19[190]20.55Some examples:%wagner> rlebwt -m ~/a2/dummy ~/a2/dummyIndex “in”4%wagner> rlebwt -r ~/a2/dummy ~/a2/dummyIndex “in”7/6/2019 COMP9319 2019T2 Assignment 2 3/52%wagner> rlebwt -a ~/a2/dummy ~/a2/dummyIndex “in”[3][40]%wagner> rlebwt -n ~/a2/dummy ~/a2/dummyIndex “3”Computers in industry%wagner>In the above example, we assume dummy.s and dummy.b exist in the a2 folder of our home directory. rlebwt willgenerate inside a2, and will then create an index folder called dummyIndex (with the index files insidedummyIndex) inside a2 as well.In the following example, we assume dummy.s and dummy.b exist in the XYZ folder of the account MyAccount.You will check if exists in ~MyAccount/XYZ/. If not, your submitted rlebwt will generate dummy.bbin ~MyAccount/XYZ/ (assume you have write permission in that folder). You will create an index folder calleddummy (with the index files inside dummy) at your current directory.%wagner> rlebwt -m ~MyAccount/XYZ/dummy dummy “in “1%wagner> rlebwt -r ~MyAccount/XYZ/dummy dummy “in “1%wagner>%wagner> rlebwt -a ~MyAccount/XYZ/dummy dummy “In”[33]%wagner>%wagner> rlebwt -m ~MyAccount/XYZ/dummy dummy “9”3%wagner> rlebwt -r ~MyAccount/XYZ/dummy dummy “9”1%wagner> rlebwt -a ~MyAccount/XYZ/dummy dummy “9”[90]%wagner> rlebwt -n ~MyAccount/XYZ/dummy dummy “9”%wagner>%wagner> rlebwt -n ~MyAccount/XYZ/dummy dummy “90”1990-02-19%wagner> rlebwt -n ~MyAccount/XYZ/dummy dummy “25”Data compression%wagner>Note that it is possible that your submission may be tested with the B’ files provided. For example, the RLFMencoded file path could be ~cs9319/a2/simple1 and path to index folder could be ~/a2/myIndex. Since simple1.bbis already there, you do not need to generate the B’ file again and just read and use it from ~cs9319/a2/. You willthen generate the index folder called myIndex at your own a2 folder.Inspecting the Binary FilesYou may find the tool xxd useful to inspect the binary files correspond to the B and B’ arrays. For example, youmay use xxd to inspect the provided sample files:wagner % pwd/import/kamen/1/cs9319/a2wagner %wagner % xxd -b simple1.b0000000: 10111111 11101001 ..wagner % xxd -b simple1.bb0000000: 11101011 00111111 .?wagner % cat simple1.s
awagner %wagner % xxd -b simple2.b0000000: 10000110 10111111 11111111 11111001 00000010 00111100 …..<0000006: 11100110 10111111 ..wagner % xxd -b simple2.bb0000000: 11011111 11100001 01000000 11100101 10010011 11111111 ..@…0000006: 01111100 01111111 |.wagner % cat simple2.s[1[1endgnad1234245ndbnb]ngnabdiaiaiwagner %wagner %7/6/2019 COMP9319 2019T2 Assignment 2 4/5In particular, simple1.s has 11 characters. Therefore, there will be 11 ones in the array B that correspond to the 11characters in simple1.s. Since all the zeros representing the duplicates of these 11 characters, you can observe thatthe last bit “1” is just a gap filler. This also means the original text file will contain 15 characters.Compiling Your SubmissionWe will use the make command below to compile your solution. Please provide a makefile and ensure that the codeyou submit can be compiled on a CSE Linux machine, e.g., wagner. Solutions that have compilation errors willreceive zero points for the entire assignment.makeYour solution should not write out any external files other than the B’ file and the index folder with maximum sixfiles inside. Any solution that writes out external files other than these files will receive zero points for the entireassignment.PerformanceYour solution will be marked based on space and runtime performance. Your soluton will not be tested against anyRLFM encoded files with their original files that are larger than 160MB.Runtime memory is assumed to be always less than 16MB. Runtime memory consumption will be measured byvalgrind massif with the option –pages-as-heap=yes, i.e., all the memory used by your program will bemeasured. Any solution that violates this memory requirement will receive zero points for that query test. To helpyou started, your tutor will provide an overview on valgrind and makefile during the tutorial in week 5 or week 6.Any solution that runs for more than 90 seconds on a machine with similar specification as wagner for the firstquery on a given RLFM file will be killed, and will receive zero points for the queries for that RLFM file. Afterthat any solution that runs for more than 20 seconds for any one of the subsequent queries on that RLFM file willbe killed, and will receive zero points for that query test. We will use the time command and count both the userand system time as runtime measurement.Documentation and Code InspectionYour source code will be inspected. Marks may be deducted if your code is very poor on readability and ease ofunderstanding; your code does not follow the requirements of this specification document; or your code is writtendeliberately to avoid being accurately measured by valgrind.Assumptions/clarifications To avoid large runtime memory for sorting, none of the testcases for marking will result in more than 5,000record matches. The input filename is a path to the given RLFM encoded file (without its extension .s and .b). Please openthese files as read-only in case you do not have the write permission for these files. Marks will be deducted for output of any extra text, other than the required, correct answers (in the rightorder). This extra information includes (but not limited to) debugging messages, line numbers and so on. You can assume that the input query string will not be an empty string (i.e., “”). Furthermore, except withthe command argument -n, search terms containing only numbers shall not match any record identifiers.Finally, search terms containing any square bracket will not be tested. You may assume that offset >= 0 and will fit in an unsigned int. When counting the number of substring matches (i.e., with -m option), to make it easier for backwardsearch matching, all combinations of matches should be counted. E.g., There are 2 matches of “aa” on therecord value “aaa”; 2 matches of “ana” on “banana”. You are allowed to use up to 6 external index files to enhance the performance of your solution. However, ifyou believe that your solution is fast enough without using index files, you do not have to generate thesefiles. Even in such case, your solution should still accept a path to index folder as one of the input argumentas specified.7/6/2019 COMP9319 2019T2 Assignment 2 5/5 A record will not be unreasonably long, e.g., you will not see a line that is 5,000+ chars long. Empty records may exist in the original files (before RLFM). However, these records will never be matchedduring searching because the empty string will not be used as a search term when testing your program.MarkingThis assignment is worth 100 points. Below is an indicative marking proportion for large vs small files. You maywant to ensure your assignment working properly for small files before enhance it for larger files.Test Files PointsTXT size <= 8MB 60 TXT size > 8MB 40BonusBonus marks (up to 10 points) will be awarded for the solution that achieves 100 points and runs the fastestoverall (i.e., the shortest total time to finish all the tests). Note: regardless of the bonus marks you receive in thisassignment, the maximum final mark for the subject is capped at 100.SubmissionDeadline: Monday 5th August 12:00 (noon). Late submissions will have marks deducted from the maximumachievable mark at the rate of roughly 1% of the total mark per hour that they are late (i.e., 24% per day), and nosubmissions will be accepted after 3 days late.Use the give command below to submit the assignment or submit via WebCMS3:give cs9319 a2 makefile *.h *.c *.cppPlease use “classrun” to check your submission to make sure that you have submitted all the necessary files.PlagiarismThe work you submit must be your own work. Submission of work partially or completely derived from any otherperson or jointly written with any other person is not permitted. The penalties for such an offence may includenegative marks, automatic failure of the course and possibly other academic discipline. Assignment submissionswill be examined both automatically and manually for such submissions.Relevant scholarship authorities will be informed if students holding scholarships are involved in an incident ofplagiarism or other misconduct.Do not provide or show your assignment work to any other person – apart from the teaching staff of this subject. Ifyou knowingly provide or show your assignment work to another person for any reason, and work derived from itis submitted you may be penalized, even if the work was submitted without your knowledge or consent. This mayapply even if your work is submitted by a third party unknown to you.

Assignment status: Already Solved By Our Experts

(USA, AUS, UK & CA  PhD. Writers)