Crowd-Powered Systems

Spring 2017 :: ECE 69500 :: Purdue University

Warm-ups

These assignments will help you learn enough web development to create your project, while also ensuring that every group member knows a common set of tools.

Warmup #0: self-introduction in HTML

Due Sun 1/15

Write a brief self-introduction to yourself

The body of your HTML should contain just one <div> tag, which should be 500 pixels wide and 250 pixels high. This will allow us to have a single page with everyone's self-introduction. Here is a template to start with.

<!DOCTYPE html>
  <html>
  <head>
    <meta charset="utf-8">
    <title>__YOUR_NAME_HERE__</title>
    <style type="text/css">
      /* CSS style for the <div> immediately within the <body> */
      body > div {
        width:  500px;
        height: 250px;
        border: solid black 1px;
      }

      /* CSS style to remove any margin/padding/border from the <body> so these fit nicely. */
      body {
        margin:  0px;
        padding: 0px;
        border:  0px;
      }
    </style>
  </head>
  <body>
    <div id="content">
      __YOUR_SELF_INTRODUCTION_GOES_HERE__
    </div>
  </body>
  </html>
  

To turn in, send email to aq@purdue.edu with subject line warmup 0 submission [ece695cps] and attach a ZIP file with your files. If you would like to host it yourself, include a publicly accessible URL, and I will embed that in the big page everyone's self-introductions.

By default, these will be publicly accessible from a page on this course web site. If you prefer that yours be kept private, just let me know.

See all self-introductions

Warmup #1: post a HIT via AMT web site

Due Sun 1/22

Think of an idea for a task, post it to AMT, and keep track of the time and cost.

Think of a small, real task that you could ask workers to do. It should be something specific to you and potentially useful.  We want to avoid posting tasks that are completely useless. That said, it's okay if you only post a small part of the overall job (i.e., a few tasks).  Think of this as a practice run.

Here are some ideas to get you thinking: (Some of these are easier than others.)
  • translate something into another language (hint: you can filter workers by country)
  • do a mini-experiment (example)
  • find products on a shopping site that meet your criteria
  • suggest ideas (e.g., project name, solution to a problem)
  • tag some photos
  • transcribe some handwritten text
  • transcribe some audio
  • guess the answer to a hard question
  • enter menu items from a restaurant(s)
  • check several possible travel itineraries for an upcoming trip
  • reformat references for a bibliography
  • look up nutrition information
  • evaluate several choices and pick the best (e.g., which photo is prettiest)
  • find references for a paper you are writing
  • draw something (example)

Once you have decided on the task and what kind of information you want them to provide, write on a piece of paper how long you expect the task to take them, in minutes. Although this assignment will be turned in by email–so the instructor won't see the paper–you will be comparing your time estimate with the actual time it takes workers. You should aim for a task that can be reasonably complected in 3-10 minutes.

The amount of time a task takes will determine the amount of money you offer workers to do it. “Fair pay” is hard to define for a distributed marketplace, but if you look around the existing HITs on Mechanical Turk, it's easy to find instances of clearly unfair pay. For this class, we will define “fair pay” as $9 per hour, average over all workers who do the task properly. By learning to estimate the time it takes to do your task, you can ensure that your noble intentions translate into noble outcomes.

Please read these Guidelines for Academic Requesters . Everyone is expected to follow them, and the rules below. (Some of these will make sense once you get into the assignment.)

These are designed to avoid some potential problems. Exceptions are possible. See the instructor.

Post your task on Mechanical Turk using the web interface.  Keep track of the time you spend creating the interface.  The Requester Best Practices Guide may also be helpful.  Aim to pay $9 per hour.  Always include a feedback box in the task so that workers can tell you if something was unclear. Label it as "optional".  You can be reimbursed up to $5.00 for this assignment.  After receiving the results, check their accuracy yourself (if it makes sense for your task).

Send me an email with the following:
  1. What did you ask the workers to do?
  2. Give a screenshot of your task interface.
  3. Do you trust the results? Why?
  4. How long did it take to create the user interface?
  5. How long did it take to get your results (from when you submitted the HITs)?
  6. How long did it take before the first HIT was accepted?
  7. What was your original time estimate for your task that you wrote on the paper?
  8. Give a table with the assignment ID, worker ID, accept time, and submit time of each assignment.
  9. What was the average/min/max time per assignment?
  10. Give the amount of time spent by each worker.
  11. What was your total cost?
  12. What was the hourly rate on each assignment? … for each worker? … overall?
  13. How long would it have taken you to do the same work yourself?
  14. Were there any tasks that you thought of using, but were impractical? Describe.
  15. Attach a ZIP file containing the HTML template you used and the input file.

For this assignment, you do not need to submit a URL.

You have two options for paying for HITs:
  1. Pay for them using your personal credit card and request reimbursement. See the instructor for details.
  2. Instructor will put credit on your account using his personal credit card and request reimbursement on your behalf.

Note: In general, IRB approval is needed for any “human subjects research”. That would include anything you might eventually include in published research results. Approval is not needed for an exercise such as this, as long as the sole purpose is your training. See the Purdue IRB web site—especially their Determination of Human Subjects Research Worksheet—for more information about this.

Warmup #2: design and implement task UI

Due Wed 2/1
In this assignment, you will create a more feature-rich task interface and serve it from a web application. At this step, the focus is on the design and implementation, especially on the client side (web browser). You will be scratching the surface on several of the key technologies you need to know. (In warm-up #3, you will go deeper on some of them.)

Learning goals
Steps
  1. Choose a task. Feel free to re-use the problem you chose for warm-up #0, pick from the examples in that assignment, or choose something of your own. However, it must meet the requirements below.
  2. Design and implement your task UI. You will probably want to sketch your idea on paper before you start implementing. Then, using HTML and CSS, create a UI form. You must incorporate at least three of the following in a way that makes sense: The jQuery library is recommended, but you are free to use something else lightweight or even build from scratch. Heavyweight client libraries are not recommended.
  3. Create a very simple Python+Flask web application to serve your task UI. This shouldn't require more than about 10-15 source lines of Python code (excluding comments/whitespace), and most of that will be boilerplate which you can find in the Flask tutorial. For now, your application should serve a random instance of the task each time it is loaded. For example, if you were having workers tag photographs, your application would insert a random image URL into your task template.
Requirements – summary of the above
Non-requirements – you do not need to do any of these:
  • post on AMT
  • post task UI publicly (yet)
  • capture results (yet)
  • use database (yet)

New to Python and/or JavaScript?… Now is a great time to start learning. Fortunately, both are much easier to learn than C, C++, C#, or Java, so if you are confident with one of those, you should be fine. To help you get started quickly, I recommend starting with a short tutorial that covers all of the main aspects of the language. The resources page has a short tutorial for each of these languages. I strongly recommend that you type each example (not copy-paste) and test that it works. Feel free to send me questions (even easy ones) by email.

Warmup #3: implement server backend + add UI tracking

Due Tue 2/14

You are now ready to start collecting results. We will use an SQLite database to collect the results. In addition to the user's inputs, you will also capture some basic information about the worker's interaction with your interface.

  1. Extend your interface from warm-up #2 with a submit button. Your form should use the POST method.
  2. Add a new route to accept the data. The new route should have the same URL as your task UI, but should accept its input as POST. The data should be stored in the database, along with the IP address, host name, and submit time.
  3. Extend your task route so that it passes the time the page was sent from the server (UTC, according to server)
  4. Extend your interface further so that it tracks the page load time and clicks (see below) in memory. These should be stored as JSON in an <input type="hidden" value="…"> element.
  5. Just before the form is submitted, the task UI should store the submit time in a <input type="hidden" …> element so that it can be passed to the server, as well.
  6. Extend the server a bit more so that it stores the tracking data. To keep things simple, you may store the tracking data as JSON (as is) in the database, if you like.
  7. Add one more route to your web application for a results page which displays all of the results received so far.
Requirements – summary of the above
Non-requirements – you do not need to do any of these:
  • display tracking data in results page
  • post on AMT
  • post task UI publicly (yet)

Structure – Follow the structure from the “favorite_numbers” example (shared 2/8), unless you have something else you like much better. (This was announced by email 2/8.) The point is that you should follow a deliberate, purposeful structure for your code. Don't just dump the code in any way that it will work.

Submission – To submit, create a .tar.xz archive in /d/g/695/YourUserName on the server and send me an email with just the SHA1 hash of the file (no attachment). From within your code directory, enter the following commands:

tar cJfv /d/g/695/$USER/warmup3.tar.xz *
sha1sum /d/g/695/$USER/warmup3.tar.xz

Copy-paste the output into an email with subject line “warmup 3 submission [ece695cps]”.

Hopefully, this will sidestep the issues some people have experienced with .js files being stripped by email virus scanners. If you have any trouble with that, feel free to send by email like before.

Q & A
What's the difference between GET and POST?
There are two key differences: (a) With GET, parameters are included in the URL, whereas POST requests pass their parameters in the body of the HTTP request. (b) GET requests are expected to have no side effects, whereas POST requests may be used for submitting orders, creating accounts, and so forth. This article explains this further.
How do I specify that a function should be used for GET or POST (but not both)?
In your @app.route('…') decorator, add a methods="…" keyword argumentwith a list of the methods to be accepted. This article gives example code.
How do I pass data from the server (Python) to the browser (JavaScript)?
One way is to use your Flask/Jinja template to insert the data as JSON into a <script> tag. This post has some example code. As an alternative for data that you simply want to pass back to the server when the form is submitted, you could skip the <script> tag and just insert it into value="…" attribute of an <input type="hidden" name="…" value="…"> element.
How do I pass data from the browser (JavaScript) to the server (Python)?
Convert the data to JSON using JSON.stringify(…) and store it in the value="…" attribute of an <input type="hidden" name="…" value="…"> element. For example, if your data was called tracking_data and your hidden input had name="tracking_data_input" (i.e., <input type="hidden" name="tracking_data_input" value="">), you could store the data with jQuery("input[name=tracking_data_input]").val(JSON.stringify(tracking_data));. Then, in your Python+Flask code, you would use either request.args (for GET requests) or request.form (for POST requests) to get the JSON text, and then json.loads(…) to convert it to a Python object.
How do I detect when the user has clicked the mouse (in JavaScript)?
Create an event handler using the jQuery .on(…) method.. For example, to capture every click to a button, you would use $("input[type=button]").on("click", function(evt) { /* do stuff */ });
You will set that up inside the existing jQuery(document).ready(…) handler. Remember that at the time your JS first runs, the document doesn't exist yet, so you can't refer to document.body until the page has loaded. In your task HTML template, you will have a hidden input (e.g., <input type="hidden" id="tracking" name="tracking" value="[]">). On each click event, you just keep adding to that.
What's with the “$” used for jQuery?
It is actually an alias for a global object called “jQuery”. If you prefer to make your code more explicit, you can substitute “jQuery” everywhere you see “$”. For example, $(document.body).on("click", …) becomes jQuery(document.body).on("click", …).
Why would I need to worry about passing data between the browser and server?
For this warm-up, you will need to pass the page sent time to the browser, so that it can pass it back to the server when the form is submitted. Also, the tracking data will need to be passed from the server to the browser when the form is submitted.
How do I get the user agent (browser type)?
User request.user_agent.string from your Python code. This is oddly hard to find in the Flask/Werkzeug documentation but very easy to find by searching Google for [flask user agent].
How do I get the user's IP address?
Google [flask ip address] and click the first result. See the code example that was shared 2/8. You don't want to use flask.request.remote_addr because that would return 127.0.0.1 when running on the server.
How do I use an SQLite database with Flask?
This article gives you all of the building blocks you need. You are welcome to use the db.py module included in the “favorite_numbers” example that was shared 2/8.
Should tracking be done on the results page?
No. You will collect data about how workers use your task page, and store the data in the database at the time they submit their work. Your results page will likely have just one user (you). No tracking is needed for that.
The purpose of the tracking is to preserve information that might help you troubleshoot any problems that may come up, either with usability, browser-specific issues, or suspected fraud (e.g., random clicking, bots, etc.). The code you write here will likely be useful in your project. You should store IP, host name, and browser information for every project you do. Click tracking is also highly recommended. You can do all of this in around 20 lines of JavaScript.
If you like, you can display the data at the bottom of the results page (e.g., as a blob of JSON), but that is not required. Just make sure you check the data somehow, to be sure that it is accurate.

Note: The Q&A section may continue to be updated as questions come up.

Warmup #4: complete crowd-powered system

Due Wed 4/5

In this final warm-up, you will integrate your previous warm-ups into AMT and improve the overall quality of the whole thing.

  1. Improve the UI based on peer and instructor feedback.
  2. Improve the code to meet basic security and quality standards.
  3. Enhance the results page to update in real time using server push.
  4. Adapt to be able to run on Mechanical Turk.

The result will integrate all of the technical and design skills you are expected to know by the end of this class.

Additional details or hints may be added, as they come up, in green text.

1. Improve the UI based on peer and instructor feedback.

Fix any known issues. Make sure your task UI meets the requirements given in the previous warm-ups. Then, start your application on the server using the nohup command, as described in the tips (below). After testing it yourself, send a link to the two people listed after your user name below by Mon 3/20 @ 11:59 PM with CC to the instructor and subject line warmup 4 feedback request [ece695cps].

abaigele ⇒ {charlesj, tseng24} charlesj ⇒ {tseng24, kulkar18} tseng24 ⇒ {kulkar18, aalshai} kulkar18 ⇒ {aalshai, djampani}
aalshai ⇒ {djampani, dlemus} djampani ⇒ {dlemus, mzaim} dlemus ⇒ {mzaim, carrells} mzaim ⇒ {carrells, zhan1486}
carrells ⇒ {zhan1486, liu1274} zhan1486 ⇒ {liu1274, vmanam} liu1274 ⇒ {vmanam, abaigele} vmanam ⇒ {abaigele, charlesj}
Each person should receive two feedback requests. Perform the tasks as a worker, before putting on your design hat. Send ≥4 comments by Wed 3/22 @ 11:59 PM to your classmate, with CC to the instructor and subject line warmup 4 response [ece695cps].
If the above timeline is too tight, you may negotiate an alternative with your feedback partners, as long as it is convenient for all.

Tips

  • To keep your application running after you log off, first create a log directory:
    $ mkdir -m700 log # create a log directory if it doesn't exist already
    Then, use nohup to start it in the background. The output will go into log/nohup.out .
    $ nohup python main.py >> log/nohup.out &
  • To monitor the output of your application, when running with nohup:
    $ tail -f log/nohup.out
  • To kill your application, first get its process ID:
    $ ps uxw USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND aq 16263 0.0 0.0 145044 3072 ? S 19:28 0:00 sshd: aq@pts/2 aq 20923 0.0 0.0 151060 1848 pts/2 R+ 21:27 0:00 ps uxw aq 144851 0.0 0.0 231332 17796 ? S Mar02 0:00 python main.py
    Then, kill it:
    $ kill 144851

2. Improve the code to meet basic security and quality standards.

Make sure your code meets the following standards:

  1. None of the OWASP top 10 vulnerabilities are present.
  2. Code is readable and has no dead code (e.g., functions never called).
  3. Generated pages should pass W3C validation with no errors or warnings.
  4. At most one JavaScript global variable (i.e., attached to the window object) should be created by your application.
  5. Task inputs (e.g., image filenames, etc.) should be stored in a database or text file—not in your task template or code.
  6. Follow the task design guidelines given in the readings (and summarized here).
  7. Use whitespace to group related form components. (This brief article summarizes most of what was covered in class on this.)
  8. Set the stage for good relations with workers:
    • Include a prominent contact link (e.g., “Questions?: yourusername@purdue.edu”) near the instructions or submit button.
    • Include a feedback box at the bottom, labelled as "optional".
    • Include your name, affilliation, and purpose (e.g., “This HIT was posted by Your Name Here for a class assignment at Purdue University taught by Prof. Alex Quinn.”) in slightly smaller type below the submit button.

Tips

  • To select which task inputs to present without creating a insecure direct object references , use a pre-assigned batch code in your URL. For example, if your task were to categorize words, then, you might associate random 6-letter codes to each batch of 4 inputs, like this:
    insert into batch (code, words) values ('flremf', '["zip", "run", "wit", "cod"}]'); insert into batch (code, words) values ('lteaag', '["mow", "rip", "raw", "cap"}]'); …
    Then, you would post a HIT with that code in the URL, like this (using CrowdLib to post to AMT):
    hit_type.create_hit("https://crowd.ecn.purdue.edu/99/?batch_code=flremf") hit_type.create_hit("https://crowd.ecn.purdue.edu/99/?batch_code=lteaag") …
    A worker who accepts the HIT would get one of those URLs (in an IFRAME). Your application would then look up the inputs associated with that batch_code and pass them to the task template, like this:
    @app.route('/task/') def task_page(): batch_code = flask.request.args['batch_code'] words_json = db.query('select words from batch where code=?;', batch_code)[0] return flask.render('task.html', words=json.loads(words_json))
  • To pass the server render time through the browser safely, there are three main approaches:
    • Use the aes module to encrypt it. It is already installed and easy to use.
      >>> key = aes.generateRandomKey(16).encode('base64').rstrip() # create new key; store it with your code >>> print key n69w9SaqLxtL7EpYINyMZw== >>> c = aes.encryptData(key, unicode("SECRET")).encode('base64').rstrip() # encrypt >>> print c LE1dFT1kh3R4UXE/gbQ0dYSjWD0eRikXhKZelPI8SQ0= >>> aes.decryptData(key, c.decode('base64')) # decrypt 'SECRET'
    • Use the hmac or hashlib modules to pass a signature along with it. Example:
      >>> signature = hashlib.sha1(str((key, "MESSAGE"))).hexdigest() # calculate signature >>> print signature 4e015ff4eef849d275f573b153df9df86a1b6b02 >>> if signature != hashlib.sha1(str((key, "MESSAGE"))).hexdigest(): # check signature ... raise DataIntegrityError("signature did not match") # exception class could be defined by you
    • Every time the task page is loaded, save a new form_presentation_code (random characters) in the database, along with the render time, and anything else you would like to save.

3. Enhance the results page to update in real time using server push.

Enhance your results page so that it displays the following, in real-time:

  1. List of workers who have the page open currently, including IP address, host name, hit ID (if known), worker ID (if known), and assignment ID (if known).
  2. Summary of the results received so far. A very simple table is fine.
Additional requirements:
  1. Use Flask-SocketIO to push updates to the results page and and to tracking which workers have which task pages open.
  2. Track workers who currently have the task open using Flask-SocketIO.
  3. If you use any example code from the web, it must be clearly acknowledged with a comment ("/* Credit: <author>, <scope>, <license> */"), and you must clean it up so that it fits seamlessly into your code with any unused parts removed.
  4. Do not use disk (e.g., files, database, etc.) to pass messages between threads/requests.
  5. No periodic page refreshing, short polling, or busy waiting.

Tips

  • You can find a working example of Flask-SocketIO in /d/g/all/examples/flask.socketio.full with the minor changes needed to make this work on our server.
  • We will cover the necessary code on Thu 3/23 and Tue 3/28.

4. Adapt to be able to run on Mechanical Turk.

Your application will be embedded in an IFRAME by AMT. This is known as an ExternalQuestion in the terminology of the AMT API. There are other methods, but this offers the most flexibility. They will append the assignment ID, hit ID, and worker ID to the URL. When the worker presses Submit, the results must be passed back to AMT via a URL that they will specify, along with the assignment ID, which will be appended to your URL.

To access AMT via Python, we will use CrowdLib. You should read the documents sections on concepts and settings carefully, and at least skim the sections on examples and the API. We will discuss in class on Thu 3/23 and/or Tue 3/28.

  1. When a worker previews your task, assignmentId will be set to ASSIGNMENT_ID_NOT_AVAILABLE, like this:
    https://crowd.ecn.purdue.edu/99/task/?batch_code=flremf&assignmentId=ASSIGNMENT_ID_NOT_AVAILABLE&hitId=ABC123
    Disable all form controls (using JavaScript) and display a prominent notice at the top reminding them they are in preview mode. Do not allow submission of the form. The URL parameters will also include hitId (HIT ID) which you can ignore.
  2. When a worker accepts your task, assignmentId will be set to an assignment ID, like this:
    https://crowd.ecn.purdue.edu/99/task/?batch_code=flremf&assignmentId=A1ZXYS9876&hitId=ABC123&workerId=WXY789&turkSubmitTo=https://www.mturk.com
    You will also have a parameter called turkSubmitTo, which is the URL to which your form must direct the result when the worker clicks Submit. Insert the assignment ID into a <input type="hidden" name="assignmentId" value="…"> element and set your form.action to turkSubmitTo + "/mturk/externalSubmit". The URL parameters will also include hitId (HIT ID) and workerId (worker ID), which you can ignore.
  3. To enable the real-time display of new results, use an event handler to send the results to your application first via AJAX, and then send them to AMT. (See the cautionary note in the tips.)
  4. When you are testing your application, you will not have assignmentId, hitId, workerId, or turkSubmitTo. Your application should still work. Have it POST the results back to the same URL (e.g., /task) and display them as JSON.
  5. Your application must run from crowd.ecn.purdue.edu, and should work when called by AMT or directly (for testing).
  6. Create a simple control panel that lists all assignments received so far, and can (a) post a batch of HITs, (b) cancel all HITs, and (c) approve/reject assignments.

Tips

  • We will cover the necessary code on Thu 3/23 and Tue 3/28.
  • See the AWS documentation for ExternalQuestion for more information.
  • When the worker is previewing a HIT, assignmentId is set to “ASSIGNMENT_ID_NOT_AVAILABLE”.
  • hitId and workerId do not need to be passed back to AMT when the form is submitted.
  • Be very careful that under any/all circumstances, it will still be submitted back to AMT–even if your JavaScript code has a bug or the server responds unexpectedly (or not at all).

Submission

Send email to aq@purdue.edu with subject line warmup 4 submission [ece695cps]. In the body, include the following:

  1. What changes did you make as a result of the feedback?
  2. What feedback did you choose not to address?
  3. How much would you offer (reward) for one HIT? How did you determine this amount?
  4. Which of the following transports does your application use (directly or indirectly) to push events/messages to the client (browser)? (a) long polling, (b) WebSockets, (c) Server-Sent Events, (d) WebRTC, (e) something else
  5. Is there any limit on the number of connected clients?
  6. If so, what will happen if the limit is exceeded?
  7. Does your application have any known browser support limitations (e.g., IE9, IE8, etc.)?
  8. Are multiple threads and/or processes used (directly or indirectly)?  If so… (a) which is ued? (b) how? (c) why? (d) what determines the maximum number of threads (e) if you had 100 clients listening and then one more connected, would a new thread and/or process need to be created?
  9. Which of the following are used in your code (directly) to express or manage concurrency? (a) coroutines, (b) futures, (c) condition variables, (d) callbacks, (e) auxiliary message queue server, (f) something else.
  10. Include screenshots in the email of your application running in the AMT sandbox, in preview and accepted modes.
  11. Copy your code as warmup4.tar.xz (like before) and include the SHA1 checksum in your email.
  12. How many hours did this take you? (My goal was to keep it down to 10-15 hours total.)

Scoring

Scoring will be from 0 to 6 as follows:

In addition:

Warm-up assignments may be modified up to 1 week prior to the due date.