coding – Page 2 – The rattled cough of a Hyper-Empathetic PM

Update my Contacts with Python: exploring LinkedIn’s and iCloud’s Contact APIs

Published on 2016-12-312016-12-31 by paranoidmike3 Comments

TL;DR Wow is it an adventure to decipher how to interact with undocumented web services like I found on LinkedIn and iCloud. Migrating data from LinkedIn to iCloud looks possible, but I got stuck at implementing the PUT operation to iCloud using Python.

Background: Because I have a shoddy memory for details about all the people I meet, and because LinkedIn appears to be de-prioritizing their role as a professional contact manager, I want to make my iPhone Contacts my system of record for all data about people I meet professionally. Which means scraping as much useful data as possible from LinkedIn and uploading it to iCloud Contacts (since my people-centric data is currently centered more around my iPhone than a Google Contacts approach).

In our last adventure, I stumbled across the a surprisingly well-formed and useful API for pulling data from LinkedIn about my Connections:

https://www.linkedin.com/connected/api/v2/contacts?start=40&count=10&fields=id%2Cname%2CfirstName%2ClastName%2Ccompany%2Ctitle%2Clocation%2Ctags%2Cemails%2Csources%2CdisplaySources%2CconnectionDate%2CsecureProfileImageUrl&sort=CREATED_DESC&_=1481999304007

Available Data

Which upon inspection of the results, gives me a lot of the data I was hoping to import into my iCloud Contacts:

crucial: Date we first connected on LinkedIn (“connectionDate” as time-since-epoch), Tags (“tags” as list of dictionaries), Picture (“profileImageUrl” as URI), first name (“firstName” as string), last name (“lastName” as string)
want: current company (“company” as dictionary), current title (“title” as string)
metadata: phone number (“phoneNumbers” as dictionary)

What doesn’t it give? Notes, Twitter ID, web site addresses, previous companies, email address. [What else does it give that could be useful? LinkedIn profile URL (“profileUrl” as the permanent URL, not the “friendly URL” that many of us have generated such as https://www.linkedin.com/in/mikelonergan. I can see how it would be helpful at a meetup to browse through my iPhone contacts to their LinkedIn profile to refresh myself on their work history. Creepy, desperate, but something I’ve done a few times when I’m completely blanking.]

What can I get from the User Data Archive? Notes are found in the Contacts.csv, and email address is found in Connections.csv. Matching those two files’ data together with what I can pull from the Contacts API shouldn’t be a challenge (concat firstName + lastName, and among the data set of my 684 contacts, I doubt I’ll find any collisions). Then matching those records to my iCloud Contacts *should* be just a little harder (I expect to match 50% of my existing contacts by emailAddress, then another fraction by phone number; the rest will likely be new records for my Contacts, with maybe one or two that I’ll have to merge by hand at the end).

Planning the “tracer bullet”

So what’s the smallest piece of code I can pull together to prove this scenario actually works? It’ll need at least these features (assumes Python):

can authenticate to LinkedIn via at least one supported protocol (e.g. OAuth 2.0)
can pull down the first 10 JSON records from Contacts API and hold them in a list
can enumerate the First + Last Name and pull out “title” for that record
can authenticate to iCloud
- Note: I may need to disable 2-factor authentication that is currently enabled on my account
can find a matching First + Last Name in my iCloud Contacts
can write the title field to the iCloud contact
- Note: I’m worried least about existing data for the title field
can upload the revised record to iCloud so that it replicates successfully to my iPhone

That should cover all the essential operations for the least-complicated data, without having to worry about edge cases like “what if the contact doesn’t exist in iCloud” or “what if there’s already data in the field I want to fill”.

Step 1: authenticate to LinkedIn

There are plenty of packages and modules on Github for accessing LinkedIn, but the ones I’ve evaluated all use the REST APIs, with their dual-secrets authentication mechanism, to get at the data. (e.g. this one, this one, that one, another one).

Or am I making this more complicated than it is? This python module simply used username + password in their call to an HTTP ‘endpoint’. Let’s assume that judicious use of the requests package is sufficient for my needs.

I thought I’d build an anaconda kernel and a jupyter notebook to experiment with the modules I’m looking at. And when I attempted to install the requests package in my new Anaconda environment, I get back this error:

LinkError:
Link error: Error: post-link failed for: openssl-1.0.2j-0

Quick search turns up a couple of open conda issues that don’t give me any immediate relief. OK, forget this for a bit – the “root” kernel will do fine for the moment.

Next let’s try this code and see what we get back:

import requests
r = requests.get('https://www.linkedin.com/connected/api/v2/contacts?start=40&count=10&fields=id%2Cname%2CfirstName%2ClastName%2Ccompany%2Ctitle%2Clocation%2Ctags%2Cemails%2Csources%2CdisplaySources%2CconnectionDate%2CsecureProfileImageUrl&sort=CREATED_DESC&_=1481999304007', auth=('mikethecanuck@gmail.com', 'linkthis'))
r.status_code

Output is simply “401”. Dang, authentication wasn’t *quite* that easy.

So I tried that URL in an incognito tab, and it displays this to me without an existing auth cookie:

{"status":"Member is not Logged in."}

And as soon as I open another tab in that incognito window and authenticate to the linkedin.com site, the first tab with that contacts query returns the detailed JSON I was expecting.

Digging deeper, it appears that when I authenticate to https://www.linkedin.com through the incognito tab, I receive back one cookie labelled “lidc”, and that an “lidc” cookie is also sent to the server on the successful request to the contacts API.

But setting the cookie manually with the value returned from a previous request still leads to 401 response:

url = 'https://www.linkedin.com/connected/api/v2/contacts?start=40&count=10&fields=id%2Cname%2CfirstName%2ClastName%2Ccompany%2Ctitle%2Clocation%2Ctags%2Cemails%2Csources%2CdisplaySources%2CconnectionDate%2CsecureProfileImageUrl&sort=CREATED_DESC&_=1481999304007'
cookies = dict(lidc="b=OGST00:g=43:u=1:i=1482261556:t=1482347956:s=AQGoGetJeZPEDz3sJhm_2rQayX5ZsILo")
r2 = requests.get(url, cookies=cookies)

I tried two other approaches that people have used in the past – some even successfully with certain pages on LinkedIn – but eventually I decided that I’m getting ratholed on trying to reverse-engineer an undocumented (and more than likely unusually-constructed) API, when I can quite easily dump the data out of the API by hand and then do the rest of my work successfully. (Yes I know that disqualifies me as a ‘real coder’, but I think we both know I was never going to win that medal – but I will win the medal for “results-oriented” not “pedantically chasing my tail”.)

Thus, knowing that I’ve got 684 connections on LinkedIn (saw that in the footer of a response), I submitted the following queries and copy-pasted the results into 4 separate .JSON files for offline processing:

https://www.linkedin.com/connected/api/v2/contacts?start=0&count=200&fields=id%2Cname%2CfirstName%2ClastName%2Ccompany%2Ctitle%2Clocation%2Ctags%2Cemails%2Csources%2CdisplaySources%2CconnectionDate%2CsecureProfileImageUrl&sort=CREATED_DESC&_=1481999304007

https://www.linkedin.com/connected/api/v2/contacts?start=200&count=200&fields=id%2Cname%2CfirstName%2ClastName%2Ccompany%2Ctitle%2Clocation%2Ctags%2Cemails%2Csources%2CdisplaySources%2CconnectionDate%2CsecureProfileImageUrl&sort=CREATED_DESC&_=1481999304007

https://www.linkedin.com/connected/api/v2/contacts?start=400&count=200&fields=id%2Cname%2CfirstName%2ClastName%2Ccompany%2Ctitle%2Clocation%2Ctags%2Cemails%2Csources%2CdisplaySources%2CconnectionDate%2CsecureProfileImageUrl&sort=CREATED_DESC&_=1481999304007

https://www.linkedin.com/connected/api/v2/contacts?start=600&count=200&fields=id%2Cname%2CfirstName%2ClastName%2Ccompany%2Ctitle%2Clocation%2Ctags%2Cemails%2Csources%2CdisplaySources%2CconnectionDate%2CsecureProfileImageUrl&sort=CREATED_DESC&_=1481999304007

Oddly, the four sets of results contain 196, 198, 200 and 84 items – they assert that I have 684 connections, but can only return 678 of them? I guess that’s one of the consequences of dealing with a “free” data repository (even if it started out as mine).

Step 2: read the JSON file and parse a list of connections

I’m sure I could be more efficient than this, but as far as getting a working result, here’s the arrangement of code I used to start accessing structured list data from the Contacts API output I shunted to a file:

import json
import os
contacts_file = open("Connections-API-results.json")
contacts_data = contacts_file.read()
contacts_json = json.loads(contacts_data)
contacts_list = contacts_json['values']

Step 3: pulling data out of the list of connections

It turns out this is pretty easy, e.g.:

for contact in contacts_list:
 print(contact['name'], contact['title'])

Messing around a little further, trying to make sense of the connectionDate value from each record, I found that this returns an ISO 8601-style date string that I can use later:

import time
print(strftime("%Y-%m-%d", time.localtime(contacts_list[15]['connectionDate'] / 1000)))

e.g. for the record at index “15”, that returned 2007-03-15.

Data issue: it turns out that not all records have a profileImageUrl key (e.g. for those oddball security geeks among my contacts who refuse to publish a photo on their LinkedIn profile), so I got to handle my first expected exception 🙂

Assembling all the useful data for all my Connections I wanted into a single dictionary, I was able to make the following work (as you can find in my repo):

stripped_down_connections_list = []

for contact in contacts_list:
 name = contact['name']
 first_name = contact['firstName']
 last_name = contact['lastName']
 title = contact['title']
 company = contact['company']['name']
 date_first_connected = time.strftime("%Y-%m-%d", time.localtime(contact['connectionDate'] / 1000))

picture_url = ""
 try:
 picture_url = contact['profileImageUrl']
 except KeyError:
 pass

tags = []
for i in range(len(contact['tags'])):
tags.append(contact['tags'][i]['name'])

phone_number = ""
try:
 phone_number = {"type" : contact['phoneNumbers'][0]['type'], 
 "number" : contact['phoneNumbers'][0]['number']}
except IndexError:
 pass

stripped_down_connections_list.append({"firstName" : contact['firstName'], 
 "lastName" : contact['lastName'], 
 "title" : contact['title'], 
 "company" : contact['company']['name'],
 "connectionDate" : date_first_connected, 
 "profileImageUrl" : picture_url,
 "tags" : tags,
 "phoneNumber" : phone_number,})

Step 4: Authenticate to iCloud

For this step, I’m working with the pyicloud package, hoping that they’ve worked out both (a) Apple’s two-factor authentication and (b) read/write operations on iCloud Contacts.

I setup yet another jupyter notebook and tried out a couple of methods to import PyiCloud (based on these suggestions here), at least one of which does a fine job. With picklepete’s suggested 2FA code added to the mix, I appear to be able to complete the authentication sequence to iCloud.

APPLE_ID = 'REPLACE@ME.COM'
APPLE_PASSWORD = 'REPLACEME'

from importlib.machinery import SourceFileLoader

foo = SourceFileLoader("pyicloud", "/Users/mike/code/pyicloud/pyicloud/__init__.py").load_module()
api = foo.PyiCloudService(APPLE_ID, APPLE_PASSWORD)

if api.requires_2fa:
    import click
    print("Two-factor authentication required. Your trusted devices are:")

    devices = api.trusted_devices
    for i, device in enumerate(devices):
        print(" %s: %s" % (i, device.get('deviceName',
        "SMS to %s" % device.get('phoneNumber'))))

    device = click.prompt('Which device would you like to use?', default=0)
    device = devices[device]
    if not api.send_verification_code(device):
        print("Failed to send verification code")
        sys.exit(1)

    code = click.prompt('Please enter validation code')
    if not api.validate_verification_code(device, code):
        print("Failed to verify verification code")
        sys.exit(1)

Step 5: matching on First + Last with iCloud

Caveat: there are a number of my contacts who have appended titles, certifications etc to their lastName field in LinkedIn, such that I won’t be able to match them exactly against my cloud-based contacts.

I’m not even worried about this step, because I quickly got worried about…

Step 6: write to the iCloud contacts (?)

Here’s where I’m stumped: I don’t think the PyiCloud package has any support for non-GET operations against the iCloud Contacts service. There appears to be support for POST in the Reminders module, but not in any of the other services modules (including Contacts).

So I sniffed the wire traffic in Chrome Dev Tools, to see what’s being done when I make an update to any iCloud.com contact. There’s two possible operations: a POST method call for a new contact, or a a PUT method call for an update to an existing contact.

Here’s the Request Payload for a new contact:

{“contacts”:[{“contactId”:”2EC49301-671B-431B-BC8C-9DE6AE15D21D”,”firstName”:”Tony”,”lastName”:”Stank”,”companyName”:”Stark Enterprises”,”isCompany”:false}]}

Here’s the Request Payload for an update to that existing contact (I added homepage URL):

{“contacts”:[{“firstName”:”Tony”,”lastName”:”Stank”,”contactId”:”2EC49301-671B-431B-BC8C-9DE6AE15D21D”,”prefix”:””,”companyName”:”Stark Enterprises”,”etag”:”C=1432@U=afe27ad8-80ce-4ba8-985e-ec4e365bc6d3″,”middleName”:””,”isCompany”:false,”suffix”:””,”urls”:[{“label”:”HOMEPAGE”,”field”:”http://stark.com”}]}]}

There are four requests being made for either type of change to iCloud contacts (at least via the iCloud.com web interface that I am using as a model for what the code should be doing):

Here’s the details for these calls when I create a new Contact:

Request URL: https://p28-contactsws.icloud.com/co/contacts/card/?clientBuildNumber=16HProject79&clientId=63D7078B-F94B-4AB6-A64D-EDFCEAEA6EEA&clientMasteringNumber=16H71&clientVersion=2.1&dsid=197715384&prefToken=914266d4-387b-4e13-a814-7e1b29e001c3&syncToken=DAVST-V1-p28-FT%3D-%40RU%3Dafe27ad8-80ce-4ba8-985e-ec4e365bc6d3%40S%3D1426
Request Payload: {“contacts”:[{“contactId”:”E2DDB4F8-0594-476B-AED7-C2E537AFED4C”,”urls”:[{“label”:”HOMEPAGE”,”field”:”http://apple.com”}],”phones”:[{“label”:”MOBILE”,”field”:”(212) 555-1212″}],”emailAddresses”:[{“label”:”WORK”,”field”:”johnny.appleseed@apple.com”}],”firstName”:”Johnny”,”lastName”:”Appleseed”,”companyName”:”Apple”,”notes”:”Dummy contact for iCloud automation experiments”,”isCompany”:false}]}
Request URL: https://p28-contactsws.icloud.com/co/changeset?clientBuildNumber=16HProject79&clientId=63D7078B-F94B-4AB6-A64D-EDFCEAEA6EEA&clientMasteringNumber=16H71&clientVersion=2.1&dsid=197715384&prefToken=914266d4-387b-4e13-a814-7e1b29e001c3&syncToken=DAVST-V1-p28-FT%3D-%40RU%3Dafe27ad8-80ce-4ba8-985e-ec4e365bc6d3%40S%3D1427
Request URL: https://webcourier.push.apple.com/aps?tok=bc3dd94e754fd732ade052eead87a09098d3309e5bba05ed24272ede5601ae8e&ttl=43200
Request URL: https://feedbackws.icloud.com/reportStats
Request Payload: {“stats”:[{“httpMethod”:”POST”,”statusCode”:200,”hostname”:”www.icloud.com”,”urlPath”:”/co/contacts/card/”,”clientTiming”:395,”uncompressedResponseSize”:14469,”region”:”OR”,”country”:”US”,”time”:”Wed Dec 28 2016 12:13:48 GMT-0800 (PST) (1482956028436)”,”timezone”:”PST”,”browserLocale”:”en-us”,”statName”:”contactsRequestInfo”,”sessionID”:”63D7078B-F94B-4AB6-A64D-EDFCEAEA6EEA”,”platform”:”desktop”,”appName”:”contacts”,”isLiteAccount”:false},{“httpMethod”:”POST”,”statusCode”:200,”hostname”:”www.icloud.com”,”urlPath”:”/co/changeset”,”clientTiming”:237,”uncompressedResponseSize”:2,”region”:”OR”,”country”:”US”,”time”:”Wed Dec 28 2016 12:13:48 GMT-0800 (PST) (1482956028675)”,”timezone”:”PST”,”browserLocale”:”en-us”,”statName”:”contactsRequestInfo”,”sessionID”:”63D7078B-F94B-4AB6-A64D-EDFCEAEA6EEA”,”platform”:”desktop”,”appName”:”contacts”,”isLiteAccount”:false}]}

I am 99% sure that the only request that actually changes the Contact data is the first one (https://p28-contactsws.icloud.com/co/contacts/card/), so I’ll ignore the other three calls from here on out.

Here’s the details of the first request when I edit an existing Contact:

Request URL: https://p28-contactsws.icloud.com/co/contacts/card/?clientBuildNumber=16HProject79&clientId=792EFA4A-5A0D-47E9-A1A5-2FF8FFAF603A&clientMasteringNumber=16H71&clientVersion=2.1&dsid=197715384&method=PUT&prefToken=914266d4-387b-4e13-a814-7e1b29e001c3&syncToken=DAVST-V1-p28-FT%3D-%40RU%3Dafe27ad8-80ce-4ba8-985e-ec4e365bc6d3%40S%3D1427
Request Payload: {“contacts”:[{“lastName”:”Appleseed”,”notes”:”Dummy contact for iCloud automation experiments”,”contactId”:”E2DDB4F8-0594-476B-AED7-C2E537AFED4C”,”prefix”:””,”companyName”:”Apple”,”phones”:[{“field”:”(212) 555-1212″,”label”:”MOBILE”}],”isCompany”:false,”suffix”:””,”firstName”:”Johnny”,”urls”:[{“field”:”http://apple.com”,”label”:”HOMEPAGE”},{“label”:”HOME”,”field”:”http://johnny.name”}],”emailAddresses”:[{“field”:”johnny.appleseed@apple.com”,”label”:”WORK”}],”etag”:”C=1427@U=afe27ad8-80ce-4ba8-985e-ec4e365bc6d3″,”middleName”:””}]}

So here’s what’s puzzling me so far: both the POST (create) and PUT (edit) operations include a contactId parameter. Its value is the same from POST to PUT (i.e. I believe that means it’s referencing the same record). When I create a second new Contact, the contactId is different than the contactId submitted in the Request Payload for the first new Contact (so it’s presumably not a dummy value). And yet when I look at the request/response for the initial page load when I click “+” and “New Contact”, I don’t see a request sent from the browser to the server (so the server isn’t sending down a contactID – not at that moment at least – perhaps it’s cached earlier?).

Explained another way, this is how I believe the sequence works (based on repeated analysis of the network traffic from Chrome to the iCloud endpoint and back):

User loads icloud.com, Contacts page (#contacts), clicks “+” and selects “New Contact”
- Browser sends no request, but rather builds the New Contact form from cached code
User adds data and clicks the Done button for the new Contact
- Browser sends POST request to https://p28-contactsws.icloud.com/co/contacts/card/ with a bunch of form data on the URL, a whole raft of cookies and the JSON request payload [including contactId=x]
- Server sends response
User clicks Edit on that new contact, updates some data and clicks Done
- Browser sends PUT request to https://p28-contactsws.icloud.com/co/contacts/card/ with form data, cookies and JSON request payload [including the same contactId=x]
- Server sends response

So the question is: if I’m creating a net-new Contact, how does the web client get a valid contactId that iCloud will accept? Near as I can figure, digging through the javascript-packed.js this page uses, this is the function that generates a UUID at the client:

Contacts.Contact = Contacts.Record.extend({
 primaryKey: "contactId",
 contactId: CW.Record.attr(String, {
 defaultValue: function() {
 return CW.upperCaseUUID()
 }
 })

Using this function (IIUC):

UUID: function() {
 var e = new Array(36),
 t = 0,
 n = ["8", "9", "a", "b"];
 if (window.crypto && window.crypto.getRandomValues) {
 var r = new Uint8Array(18);
 crypto.getRandomValues(r);
 for (t = 0; t < 18; t++) e[t * 2 + 1] = (r[t] >> 4).toString(16), e[t * 2] = (r[t] & 15).toString(16);
 e[19] = n[r[9] >> 6]
 } else {
 while (t < 36) e[t] = (Math.random() * 16 | 0).toString(16), t++;
 e[19] = n[Math.random() * 4 | 0]
 }
 return e[8] = e[13] = e[18] = e[23] = "-", e[14] = "4", e.join("")
 }

[Aside: I sincerely hope this is a standard library for UUID, not something Apple wrote themselves. If I ever think that I’m going to need to generate iCloud-compatible UUIDs.]

Whoa – Pause

I need to take a step back and re-examine my goals and what I can specifically address. I have learned a lot about both LinkedIn and iCloud, but I didn’t set out to recreate them, just find a way to make consistent use of the data I already have.

Parsing PDFs using Python

Published on 2016-12-292016-12-29 by paranoidmike2 Comments

I’m part of a project that has a need to import tabular data into a structured database, from PDF files that are based on digital or analog inputs. [Digital input = PDF generated from computer applications; analog input = PDF generated from scanned paper documents.]

These are the preliminary research notes I made for myself a while ago that I am now publishing for reference by other project members. These are neither conclusive nor comprehensive, but they are directionally relevant.

I.E. The amount of work it takes code to parse structured data from analog input PDFs is a significant hurdle, not to be underestimated (this blog post was the single most awe-inspiring find I made). The strongest possible recommendation based on this research is GET AS MUCH OF THE DATA FROM DIGITAL SOURCES AS YOU CAN.

Packages/libraries/guidance

The basics: https://automatetheboringstuff.com/chapter13/ PyPDF2
A more involved tutorial examining many packages: https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167: Pdfrw, slate, PDFQuery, PDFMiner, PyPDF2
Further searching: https://pypi.python.org/pypi?%3Aaction=search&term=pdf&submit=search: lots of packages
One StackOverflow comparison: http://stackoverflow.com/questions/6413441/python-pdf-library: PyPDF2, PDFMiner, ReportLab
One high-level package: https://github.com/pmaupin/pdfrw
Others suggested by Ed Borasky: ijmbarr/parsing-pdfs, reesepathak/pdf-mining
Others I found: poppler

Evaluation of Packages

Pdfrw: https://github.com/pmaupin/pdfrw
- Python 2 & 3 (3.3, 3.4)
- Last updated 2016-10
- Heavily oriented to a printing workflow: manipulating paging, sizing, embedded images
- Multiple references to reportlab (complementary functionality)
- “There are a lot of incorrectly formatted PDFs floating around; support for these is added in some cases. The decision is often based on what acroread and okular do with the PDFs; if they can display them properly, then eventually pdfrw should, too, if it is not too difficult or costly.”
- Great writeup of the trials of wrangling PDF document internal structures
Pdfminer http://www.unixuser.org/~euske/python/pdfminer/index.html
- Python 2 (pdfminer3k, pdfminer.six apparently support Python 3)
- Last updated 2016-09
- Can export PDF to other formats (e.g. HTML
Slate https://pypi.python.org/pypi/slate
- Supports Python 2 & 3
- Last updated 2015-11
- Wrapper around PDFMiner for ease of use
- Focused on extracting text from PDFs
ReportLab http://www.reportlab.com/opensource/
- Python 2.7 or 3.3+0
- Source currently tracked here: https://bitbucket.org/rptlab/reportlab/
- Commercially-backed open source
- Oriented primarily to creating PDFs
PdfQuery https://pypi.python.org/pypi/pdfquery
- Python 2 or 3
  Last updated 2016-03
- Also a wrapper around PDFMiner
- Also meant for ease of use
- Orients to JQuery or XPath syntax (I.e. requires no explicit knowledge of internal layout complexities)
XPDF http://www.foolabs.com/xpdf/about.html
- Couple of years old
- Includes PDFInfo which does a great job of exporting metadata
ijmbarr/parsing-pdfs https://github.com/ijmbarr/parsing-pdfs
- Specifically tackling tabular data
- FANTASTIC writeup of the low-level grind in extracting tabular data from PDFs that weren’t designed for ease of reuse: http://www.degeneratestate.org/posts/2016/Jun/15/extracting-tabular-data-from-pdfs/
Reesepathak/pdf-mining https://github.com/reesepathak/pdf-mining
- (did not examine)
Poppler https://poppler.freedesktop.org
- (did not examine)

Possible issues

Encryption of the file
Compression of the file
Vector images, charts, graphs, other image formats
Form XObjects
Text contained in figures
Does text always appear in the same place on the page, or different every page/document?

PDF examples I tried parsing, to evaluate the packages

IRS 1040A
2015-16-prelim-doc-web.pdf (Bellingham city budget)
- Tabular data begins on page 30 (labelled Page 28)
- PyPDF2 Parsing result: None of the tabular data is exported
- SCARY: some financial tables are split across two pages
2016-budget-highlights.pdf (Seattle city budget summary)
- Tabular data begins on page 15-16 (labelled 15-16)
- PyPDF2 Parsing result: this data parses out
FY2017 Proposed Budget-Lowell-MA (Lowell)
- Financial tabular data starts at page 95-104, then 129-130, 138-139
- More interesting are the small breakouts on subsequent pages e.g. 149, 151, 152, 162; 193, 195, 197
- PyPDF2 Parsing result: all data I sampled appears to parse out

Experiment ideas

Build an example PDF for myself with XLS tables, and then see what comes out when the contents are parsed using one of these libraries
Build a script that spits out useful metadata about the document: which app/library generated it (e.g. Producer, Creator), size, # of pages
Build another script to verify there’s a non-trivial amount of ASCII/Unicode text in the document (I.e. so we confirm it doesn’t have to be OCR’d)

Experiments tried

Create an Anaconda Notebook to write some PDF scripts
- Created “PDF Experiments” environment on Win10 Anaconda install
- Fired up Anaconda Prompt and ran “pip install pypdf2” (see http://stackoverflow.com/questions/18640305/how-do-i-keep-track-of-pip-installed-packages-in-an-anaconda-conda-environment#18640601)
- Created “PDF Experiments” notebook and ran script that included “import pypdf2” – successfully using the function
Extract the contents of Google Spreadsheets exported as PDF
- Result: no readable text exported using pypdf2 [5 spreadsheets attempted]

Update my Contacts with Python: thinking through the options

Published on 2016-12-172016-12-17 by paranoidmike1 Comment

Alright folks, that’s the bell. When LinkedIn stops thinking of itself as a professional contact manager, you know there’s no profit in it, and it’s time to manage this stuff yourself.

Problem To Solve

I’ve been hemming and hawing for a couple of years, ever since Evernote shut down their Hello app, about how to remember who I’ve met and where I met them. I’m a Meetup junkie (with no rehab in sight) and I’ve developed a decent network of friends and acquaintances that make it easy for me to attend new events and conferences in town – I’ll always “know” someone there (though not always remember why/how I know them or even what their name is).

When I first discovered Evernote Hello, it seemed like the perfect tool for me – provided me a timeline view of all the people I’d met, with rich notes on all the events I’d seen them at and where those places were. It never entirely gelled, it sporadically did and did NOT support business card import (pay for play mostly), and it was only good for those people who gave me enough info for me to link them. Even with all those imperfections, I remember regularly scanning that list (from a quiet corner at a meetup/party/conference) before approaching someone I *knew* I’d seen before, but couldn’t remember why. [Google Glasses briefly promised to solve this problem for me too, but that tech is off somewhere, licking its wounds and promising to come back in ten years when we’re ready for it.]

What other options do I have, before settling in to “do it myself”?

Pay the big players e.g. SalesForce, LinkedIn
- Salesforce: smallest SKUs I could find @ $25/month [nope]
- LinkedIn “Sales” SKU: $65/month [NOPE]
Get a cheap/trustworthy/likely-to-survive-more-than-a-year app
- Plenty of apps I’ve evaluated that sound sketchy, or likely to steal your data, or are so under-funded that they’re likely to die off in a few months

Requirements

Do it myself then. Now I’ve got a smaller problem set to solve:

Enforce synchronization between my iPhone Contacts.app, the iCloud replica (which isn’t a perfect replica) and my Google Contacts (which are a VERY spotty replica).
- Actually, let’s be MVP about this: all I *need* right now is a way of automating edits to Contacts on my iPhone. I assume that the most reliable way of doing this is to make edits to the iCloud.com copy of the contact and let it replicate down to my phone.
- the Google Contacts sync is a future-proofing move, and one that theoretically sounded free (just needed to flip a toggle on my iPhone profile), but which in practice seems to be built so badly that only about 20% of my contacts have ever sync’d with Google
Add/update information to my contacts such as photos, “first met” context (who introduced, what event met at) and other random details they’ve confessed to me (other attempts to hook my memory) – *WITHOUT* linking my iPhone contacts with either LinkedIn or Facebook (who will of course forever scrape all that data up to their cloud, which I do *not* want to do – to them or me).

Test the Sync

How can I test my requirements in the cheapest way possible?

Make hand edits to the iCloud.com contacts and check that it syncs to the iPhone Contacts.app
- Result: sync to iPhone within seconds
Make hand edits to contacts in Contacts.app and check that it syncs to iCloud.com contact
- Result: sync to iCloud within seconds

OK, so once I have data that I want to add to an iCloud contact, and code (Python for me please!) that can write to iCloud contacts, it should be trivial to edit/append.

Here’s all the LinkedIn Data I Want

Data that’s crucial to remembering who someone is:

Date we first connected on LinkedIn
Tags
Notes
Picture

Additional data that can help me fill in context if I want to dig further:

current company
current title
Twitter ID
Web site addresses
Previous companies

And metadata that can help uniquely identify people when reading or writing from other directories:

Email address
Phone number

How to Get my LinkedIn connection data?

OK, so (as of 2016-12-15 at 12:30pm PST) there’s three ways I can think of pulling down the data I’ve peppered into my LinkedIn connections:

User Data Archive: request an export of your user data from LinkedIn
LinkedIn API: request data for specified Connections using LinkedIn’s supported developer APIs
Web Scraping: iterate over every Connection and pull fields via CSS using e.g. Beautiful Soup

User Data Archive

This *sounds* like the most efficient and straightforward way to get this data. The “Relationship Section” announcement even implies that I’ll get everything I want:

If you want to download your existing Notes and Tags, you’ll have the option to do so through March 31, 2017…. Your notes and tags will be in the file named Contacts.

The initial data dump included everything except a Contacts.csv file. The later Complete_LinkedInDataExport_12-16-2016 [ISO 8601 anyone?] included the data promised and nearly nothing else:

Connections.csv: First Name, Last Name, Email Address, Current Company, Current Position, Tags
Contacts.csv: First Name, Last Name, Email (mostly blank), Notes, Tags

I didn’t expect to get Picture, but I was hoping for Date First Connected, and while the rest of the data isn’t strictly necessary, it’s certainly annoying that LinkedIn is so friggin frugal.

Regardless, I have almost no other source for pictures for my professional contacts, and that is pretty essential for recalling someone I’ve met only a handful of times, so while helpful, this wasn’t sufficient.

LinkedIn API

The next most reliable way to attack this data is to programmatically request it. However, as I would’ve expected from this “roach motel” of user-generated data, they don’t even support an API to request all Connections from your user account (merely sign-in and submit data).

Where they do make reference to user data, it’s in a highly-regulated set of Member Profile fields:

With the r_basicprofile permission, you can get first-name, last-name, positions, picture-url plus some other data I don’t need
With the r_emailaddress permission, you can get the user’s primary email address
For developers accepted into “Apply with LinkedIn”, and with the r_fullprofile permission, you can further get date-of-birth and member-url-resources
For those “Apply with LinkedIn” developers who have the r_contactinfo permssion, you can further get phone-numbers and twitter-accounts

After registering a new application, I am immediately given the ability to grant the following permissions to my app: r_basicprofile, r_emailaddress. That’ll get me picture-url, if I can figure out a way to enumerate all the Connections for my account.

(A half-hour sorting through Chrome Dev Tools’ Network outputs later…)

Looks like there’s a handy endpoint that lets the browser enumerate pretty much all the data I want:

That bears further investigation.

Web Scraping

While this approach doesn’t have the built-in restrictions with the LinkedIn APIs, there’s at least three challenges I can forsee so far:

LinkedIn requires authentication, and OAuth 2.0 at that (at least for API access). Integrating OAuth into a Beautiful Soup script isn’t something I’ve heard of before, but I’m seeing some interesting code fragments and tutorials that could be helpful, and it appears that the requests package can do OAuth 1 & 2.
LinkedIn has helpfully implemented the “infinite scroll” AJAX behaviour on the Connections page.
- There are ways to work with this behaviour, but it sure feels cumbersome – to the point I almost feel like doing this work by hand would just be faster.
Navigating automatically to each linked page (each Connection) from the Connections page isn’t something I am entirely confident about
- Though I imagine it should be as easy as “for each Connection in Connections, load the page, then find the data with this CSS attribute, and store it in an array of whatever form you like”. The mechanize package promises to make the link navigation easy.

Am I Ready for This Much Effort?

It sure feels like there’s a lot of barriers in the way to just collecting the info I’ve accumulated in LinkedIn about my connections. Would it take me less time to just browse each connection page and hand copy/paste the data from LinkedIn to iCloud? Almost certainly. To together a Beautiful Soup + requests + various github modules solution would probably take me 20-30 hours I’m guessing, from all the reading and piecing together code fragments from various sources, to debugging and troubleshooting, to making something that spits out the data and then automatically uploads it without mucking up existing data.

Kinda takes the fun out of it that way, doesn’t it? I mean, the “glory” of writing code that’ll do something I haven’t found anyone else do, that’s a little boost of ego and all. Still, it’s hard to believe this kind of thing hasn’t been solved elsewhere – am I the only person with this bad of a memory, and this much of a drive to keep myself from looking like Leonard Shelby at every meetup?

What’s worse though, for embarking on this thing, is that I’d bet in six months’ time, LinkedIn and/or iCloud will have ‘broken’ enough of their site(s) that I wouldn’t be able to just re-use what I wrote the first time. Maintenance of this kind of specialized/unique code feels pretty brutal, especially if no one else is expected to use it (or at least, I don’t have any kind of following to make it likely folks will find my stuff on github).

Still, I don’t think I can leave this itch entirely unscratched. My gut tells me I should dig into that Contacts API first before embarking on the spelunking adventure that is Beautiful Soup.

This time, success: Flask-on-AWS tutorial (with advanced use of virtualenv)

Published on 2016-12-122016-12-12 by paranoidmike1 Comment

Last time I tried this, I ended up semi-deliberately choosing to use Python 3 for a tutorial (I didn’t realize quickly enough) was built around Python 2.

After cleaning up my experiment I remembered that the default python on my MacBook was still python 2.7.10, which gave me the idea I might be able to re-run that tutorial with all-Python 2 dependencies. Or so it seemed.

Strangely, the first step both went better and no better than last time:

Mac4Mike:flask-aws-tutorial mike$ virtualenv flask-aws
Using base prefix '/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5'
New python executable in /Users/mike/code/flask-aws-tutorial/flask-aws/bin/python3.5
Also creating executable in /Users/mike/code/flask-aws-tutorial/flask-aws/bin/python
Installing setuptools, pip, wheel...done.

Yes it didn’t throw any errors, but no it didn’t use the base Python 2 that I’d hoped. Somehow the fact that I’ve installed Python 3 on my system is still getting picked up by virtualenv, so I needed to dig further into how virtualenv can be used to truly insulate from Python 3.

Found a decent article here that gave me hope, and even though they punted to using the virtualenvwrapper scripts, it still clued me in to the virtualenv parameter “-p”, so this seemed to work like a charm:

Mac4Mike:flask-aws-tutorial mike$ virtualenv flask-aws -p /usr/bin/python
Running virtualenv with interpreter /usr/bin/python
New python executable in /Users/mike/code/flask-aws-tutorial/flask-aws/bin/python
Installing setuptools, pip, wheel...done.

This time? The requirements install worked like a charm:

Successfully installed Flask-0.10.1 Flask-SQLAlchemy-2.0 Flask-WTF-0.10.3 Jinja2-2.7.3 MarkupSafe-0.23 PyMySQL-0.6.3 SQLAlchemy-0.9.8 WTForms-2.0.1 Werkzeug-0.9.6 argparse-1.2.1 boto-2.28.0 itsdangerous-0.24 newrelic-2.74.0.54

Then (since I still had all the config in place), I ran pip install awsebcli and skipped all the way to the bottom of the tutorial and tried eb deploy:

INFO: Deploying new version to instance(s).                         
ERROR: Your requirements.txt is invalid. Snapshot your logs for details.
ERROR: [Instance: i-01b45c4d01c070555] Command failed on instance. Return code: 1 Output: (TRUNCATED)...)
  File "/usr/lib64/python2.7/subprocess.py", line 541, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '/opt/python/run/venv/bin/pip install -r /opt/python/ondeck/app/requirements.txt' returned non-zero exit status 1. 
Hook /opt/elasticbeanstalk/hooks/appdeploy/pre/03deploy.py failed. For more detail, check /var/log/eb-activity.log using console or EB CLI.
INFO: Command execution completed on all instances. Summary: [Successful: 0, Failed: 1].
ERROR: Unsuccessful command execution on instance id(s) 'i-01b45c4d01c070555'. Aborting the operation.
ERROR: Failed to deploy application.

This kept barfing over and over until I remembered that the target environment was still configured for Python 3.4. Fortunately or not, you can’t change major versions of the platform – so back to eb init I go (with the -i parameter to re-initialize).

This time around? The command eb deploy worked like a charm.

Lesson: be *very* explicit about your Python versions when messing with someone else’s code. [Duh.]

Getting to know AWS with Python (the free way)

Published on 2016-12-112016-12-12 by paranoidmike1 Comment

I’ve run through all the easy AWS tutorials and I was looking to roll sleeves up and take it one level deeper, so (given I’ve recently completed a Python class, and given I’m trying to stay on a budget) I hunted around and found some ideas for putting a sample Flask app up on AWS Elastic Beanstalk, that cost as little as possible to use. Here’s how I wove them together (into an almost-functional cloud solution first time around).

Set aside Anaconda, install Python3

First step in the tutorial I found was building the Python app locally (then eventually “freezing” it and uploading to AWS EB). [Note there’s a similar tutorial from AWS here, which is good for comparison.] [Also note: after I mostly finished my work, I confirmed that the tutorial I’m using was written for Python 2.7, and yet I’d blundered into it with Python 3.4. Not for the faint of heart, nor for those who like non-Degraded health status.]

Which start with using virtualenv to build up the app profile.

But for me, virtualenv threw an error after I installed it (with pip install virtualenv):

ERROR: virtualenv is not compatible with this system or executable

Okay…so Ryan Wilcox gave us a clue that you might be using the wrong python. Running which python told me I’m directed to the anaconda version, so I commented that PATH modification out of .bash_profile and installed python3 using homebrew (which I’d previously installed on my MacBook).

Setup the Flask environment Debug virtualenv

Surprising no one but myself, by removing anaconda and installing python3, I lost access not only to virtualenv but also to pip, so I guess all that was wrapped in the anaconda environment. Running brew install pip reports the following:

Updating Homebrew...
Error: No available formula with the name "pip" 
Homebrew provides pip via: `brew install python`. However you will then
have two Pythons installed on your Mac, so alternatively you can install
pip via the instructions at:
  https://pip.readthedocs.io/en/stable/installing/

The only option from that article seems to be running get-pip.py, so I ran python3 get-pip.py (in case the script installed a different pip based on which version of python was running the script). Running pip –version returned this, so I felt safe enough to proceed:

pip 9.0.1 from /usr/local/lib/python3.5/site-packages (python 3.5)

Ran pip install virtualenv, now we’re back where I started (but without the anaconda crutch that I don’t entirely understand, and didn’t entirely trust to be compatible with these AWS EB tutorials).

Running virtualenv flask-aws from within the local clone of the tutorial repo throws this:

Using base prefix '/usr/local/Cellar/python3/3.5.2_3/Frameworks/Python.framework/Versions/3.5'
Overwriting /Users/mike/code/flask-aws-tutorial/flask-aws/lib/python3.5/orig-prefix.txt with new content
New python executable in /Users/mike/code/flask-aws-tutorial/flask-aws/bin/python3.5
Not overwriting existing python script /Users/mike/code/flask-aws-tutorial/flask-aws/bin/python (you must use /Users/mike/code/flask-aws-tutorial/flask-aws/bin/python3.5)
Traceback (most recent call last):
  File "/usr/local/bin/virtualenv", line 11, in 
    sys.exit(main())
  File "/usr/local/lib/python3.5/site-packages/virtualenv.py", line 713, in main
    symlink=options.symlink)
  File "/usr/local/lib/python3.5/site-packages/virtualenv.py", line 925, in create_environment
    site_packages=site_packages, clear=clear, symlink=symlink))
  File "/usr/local/lib/python3.5/site-packages/virtualenv.py", line 1370, in install_python
    os.symlink(py_executable_base, full_pth)
FileExistsError: [Errno 17] File exists: 'python3.5' -> '/Users/mike/code/flask-aws-tutorial/flask-aws/bin/python3'

Hmm, is this what virtualenv does? OK, if that’s true why is something intended to insulate from external dependencies seemingly getting borked by external dependencies?

Upon closer examination, it appears that what the virtualenv script tried to do the first time was generate a symlink to a python3.5 interpreter, and got tripped when it found a symlink already there. However, the second time I ran the command (hoping that it is self-healing), I got yelled at about “too many symlinks”, and discover that the targets now have a circular loop:

Mac4Mike:bin mike$ ls -la
total 24
drwxr-xr-x  5 mike  staff  170 Dec 11 12:26 .
drwxr-xr-x  6 mike  staff  204 Dec 11 12:26 ..
lrwxr-xr-x  1 mike  staff    9 Dec 11 12:26 python -> python3.5
lrwxr-xr-x  1 mike  staff    6 Dec 11 11:52 python3 -> python
lrwxr-xr-x  1 mike  staff    6 Dec 11 11:52 python3.5 -> python

If I’m reading the above stack trace correctly, they were trying to create a symlink from python3.5 to some other target. So all it should require is deleting the python3.5 symlink, yes? Easy.

Aside: then running virtualenv flask-aws from the correct location (but an old, not-refreshed shell) spills this:

Using base prefix '/Users/mike/anaconda3'
Overwriting /Users/mike/code/flask-aws-tutorial/flask-aws/lib/python3.5/orig-prefix.txt with new content
New python executable in /Users/mike/code/flask-aws-tutorial/flask-aws/bin/python
Traceback (most recent call last):
  File "/Users/mike/anaconda3/bin/virtualenv", line 11, in 
    sys.exit(main())
  File "/Users/mike/anaconda3/lib/python3.5/site-packages/virtualenv.py", line 713, in main
    symlink=options.symlink)
  File "/Users/mike/anaconda3/lib/python3.5/site-packages/virtualenv.py", line 925, in create_environment
    site_packages=site_packages, clear=clear, symlink=symlink))
  File "/Users/mike/anaconda3/lib/python3.5/site-packages/virtualenv.py", line 1387, in install_python
    raise e
  File "/Users/mike/anaconda3/lib/python3.5/site-packages/virtualenv.py", line 1379, in install_python
    stdout=subprocess.PIPE)
  File "/Users/mike/anaconda3/lib/python3.5/subprocess.py", line 947, in __init__
    restore_signals, start_new_session)
  File "/Users/mike/anaconda3/lib/python3.5/subprocess.py", line 1551, in _execute_child
    raise child_exception_type(errno_num, err_msg)
OSError: [Errno 62] Too many levels of symbolic links

Try again from a shell where anaconda’s been removed from $PATH? Works fine:

Installing setuptools, pip, wheel...done.

Step 2: Setting up the Flask environment

Installing the requirements (using pip install -r requirements.txt) went *mostly* smooth, but it broke down on either the distribute or itsdangerous dependency:

Collecting distribute==0.6.24 (from -r requirements.txt (line 11))
  Downloading distribute-0.6.24.tar.gz (620kB)
    100% |████████████████████████████████| 624kB 1.1MB/s 
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "", line 1, in 
      File "/private/var/folders/dw/yycg4bz1347cx5v_crcjl5580000gn/T/pip-build-6gettd5v/distribute/setuptools/__init__.py", line 2, in 
        from setuptools.extension import Extension, Library
      File "/private/var/folders/dw/yycg4bz1347cx5v_crcjl5580000gn/T/pip-build-6gettd5v/distribute/setuptools/extension.py", line 2, in 
        from setuptools.dist import _get_unpatched
      File "/private/var/folders/dw/yycg4bz1347cx5v_crcjl5580000gn/T/pip-build-6gettd5v/distribute/setuptools/dist.py", line 103
        except ValueError, e:
                         ^
    SyntaxError: invalid syntax
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/dw/yycg4bz1347cx5v_crcjl5580000gn/T/pip-build-6gettd5v/distribute/

Unfortunately the “/pip-build-6gettd5v/” folder was already deleted by the time I got there to look at its contents. So I re-ran the script and noticed that all dependencies through to distribute reported as “Using cached distribute-0.6.24.tar.gz”.

Figured I’d just pip install itsdangerous by hand, and if I got into too much trouble I could always uninstall it right? Well, that didn’t seem to help arrest this error – presumably pip caches all the requirements files first and then installs them – so I figured I’d divide the requirements.txt file into two (creating requirements2.txt and stuffing the last three files there instead) and see if that helped bypass the problem in installing the first 11 requirements.

Nope, that didn’t work, so I also moved the line “distribute==0.6.24” out of requirements.txt into requirements2.txt:

Successfully installed Flask-0.10.1 Flask-SQLAlchemy-2.0 Flask-WTF-0.10.3 Jinja2-2.7.3 MarkupSafe-0.23 PyMySQL-0.6.3 SQLAlchemy-0.9.8 WTForms-2.0.1 Werkzeug-0.9.6 argparse-1.2.1

Yup, that did it, so off to pip install requirements2.txt:

Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/dw/yycg4bz1347cx5v_crcjl5580000gn/T/pip-build-5y_r3gg0/distribute/

OK, so then I removed the “distribute==0.6.24” line entirely from requirements2.txt:

Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/dw/yycg4bz1347cx5v_crcjl5580000gn/T/pip-build-5g9i5fpl/wsgiref/

Crap, wsgiref too? OK, pip install wsigiref==0.1.2 by hand:

Collecting wsgiref==0.1.2
  Using cached wsgiref-0.1.2.zip
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "", line 1, in 
      File "/private/var/folders/dw/yycg4bz1347cx5v_crcjl5580000gn/T/pip-build-suqskeut/wsgiref/setup.py", line 5, in 
        import ez_setup
      File "/private/var/folders/dw/yycg4bz1347cx5v_crcjl5580000gn/T/pip-build-suqskeut/wsgiref/ez_setup/__init__.py", line 170
        print "Setuptools version",version,"or greater has been installed."
                                 ^
    SyntaxError: Missing parentheses in call to 'print'

Shit, even worse – the package itself craps out. Good news is, I know exactly what’s wrong here, because the print method (function?) requires () in Python3. (See, I *did* learn something from that anti-Zed rant.)

Double crap – wsgiref‘s latest version IS 0.1.2.

Triple-crap – that code’s docs haven’t been touched in forever, and don’t exist in a git repo against which issues can be filed.

Damn, this *really* sucks. Seems I’ve been trapped in the hell that is “tutorials written for Python2”. The only way I could get around this error is to edit the source of the cached wsgiref-0.1.2.zip file – do such files get deleted from $TMPDIR when pip exits? Or are they stuffed in ~/Library/Caches/pip in some obfuscated filename?

Ultimately it didn’t matter – I found that pip install –download ~/ wsgiref==0.1.2 worked to dump the file exactly where I wanted.

And yes, wrapping lines 170 and 171 with () after the print verb (and re-zip’ing the folder contents) allowed me to fully install wsgiref-0.1.2 using pip install –no-index –find-links=~/ wsgiref-0.1.2.zip.

OK, finally ran pip install boto==2.28.0 and that’s all the dependencies installed.

Next!

Step 3: Create the AWS RDS

While the AWS UI has changed in subtle but UX-improving ways, this part wasn’t hard to follow. I wasn’t happy about seeing the “all traffic, all sources” security group applied here, but not knowing what my ultimate target configuration might look like, I made the classic blunder of saying to self, “I’ll lock it down later.”

Step 4: Add tables

It wasn’t entirely clear, so here’s the construction of the SQLALCHEMY_DATABASE_URI parameter in config.py:

= Master Username you chose on the Specify DB Details page
e.g. “awsflasktst”
= Master Password you chose on the Specify DB Details page
e.g. “awsflasktstpwd”
= Endpoint string from DB Instance page
e.g. “flask_tst.cxcmcajseabc.us-west-2.rds.amazonaws.com:3306”
= Database Name you chose on the Configure Advanced Settings page
e.g. “awstflasktstdb”

So in my case that line read (well, it didn’t because I’m not telling you my *actual* configuration, but this lets you know how I parsed the tutorial guidance):

SQLALCHEMY_DATABASE_URI = 'mysql+pymysql://awsflasktst:awsflasktstpwd@flask_tst.cxcmcajseabc.us-west-2.rds.amazonaws.com:3306/awstflasktstdb'

Step 4.6 Install NewRelic

Because I’m a former Relic and I’m still enthused about APM technology, I decided to install the newrelic package via pip install newrelic. A few steps into the Quick Start guide and it was reporting all the…tiny amounts of traffic I was giving myself.

God knows if this’ll survive the pip freeze, but it’s sure worth a shot. [Spoiler: cloud deployment was ultimately unsuccessful, so it’s moot.]

Step 5: Elastic Beanstalk setup

Installing the CLI was easy enough:

Successfully installed awsebcli-3.8.7 blessed-1.9.5 botocore-1.4.85 cement-2.8.2 colorama-0.3.7 docker-py-1.7.2 dockerpty-0.4.1 docopt-0.6.2 docutils-0.13.1 jmespath-0.9.0 pathspec-0.5.0 python-dateutil-2.6.0 pyyaml-3.12 requests-2.9.1 semantic-version-2.5.0 six-1.10.0 tabulate-0.7.5 wcwidth-0.1.7 websocket-client-0.40.0

Setup of the user account was fairly easy, although again it’s clear that AWS has done a lot of refactoring of their UI flows. Just read ahead a little in the tutorial and you’ll be able to fill in the blanks. (Although again, it’s a little disturbing to see a tutorial be this cavalier about the access permissions being granted to a simple web app.)

When we got to the eb init command I ran into a snag:

ERROR: ConfigParseError :: Unable to parse config file: /Users/mike/.aws/config

I checked and I *have* a config file there, and it’s already been populated with Packer credentials info (for a separate experiment I was working on previously).

So what did I do? I renamed the config and credentials files to set them aside for now, of course!

Like me, you might also see an EB application listed, so I chose “2” not “1”. No big deal.

There was also an unexpected option to use AWS CodeCommit, but since I’m not even sure I’m keeping this app around, I did not take that choice for now.

(And once again, this tutorial was super-cavalier about ignoring the SSH option that *everyone* always uses, so just closed my eyes and prayed I never inherit these bad habits…)

“Time to deploy this bad boy.” Bad boy indeed – Python 2.7, *no* security measures whatsoever. This boy is VERY bad.

Step 6: Deployment

So there’s this apparent dependency between the EBCLI and your local git repo. If your changes aren’t committed, then what gets deployed won’t reflect those changes.

However, the part about running git push on the repo seemed unnecessary. If the local CLI tool checks my repo, they’re only checking the local one, not the remote one, so once we run git commit, it shouldn’t matter whether those changes were pushed upstream.

However, knowing that I monkeyed with the requirements.txt file, I thought I’d also git add that one to my commit history:

git add requirements.txt
git commit -m "added newrelic requirement"

After initiating eb create, I was presented with a question never mentioned in the tutorial (another new AWS feature!), and not documented specifically in any of the ELB documentation I reviewed:

Select a load balancer type
1) classic
2) application
(default is 1):

While my experience in this space tells me “application” sounds like the right choice, I chose the default for sake of convenience/sanity.

Proceeded with eb deploy, and it was looking good until we ran into the problem I was expecting to see:

ERROR: Your requirements.txt is invalid. Snapshot your logs for details.
ERROR: [Instance: i-01b45c4d01c070555] Command failed on instance. Return code: 1 Output: (TRUNCATED)...)
  File "/usr/lib64/python2.7/subprocess.py", line 541, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '/opt/python/run/venv/bin/pip install -r /opt/python/ondeck/app/requirements.txt' returned non-zero exit status 1. 
Hook /opt/elasticbeanstalk/hooks/appdeploy/pre/03deploy.py failed. For more detail, check /var/log/eb-activity.log using console or EB CLI.
INFO: Command execution completed on all instances. Summary: [Successful: 0, Failed: 1].
WARN: Environment health has transitioned from Pending to Degraded. Command failed on all instances. Initialization completed 7 seconds ago and took 3 minutes.
ERROR: Create environment operation is complete, but with errors. For more information, see troubleshooting documentation.

Lesson

Ultimately unsuccessful, but was it satisfying? Heck yes, grappling with all the tools and troubleshooting various aspects of the procedures was a wonderful learning experience. [And as it turned out, was a great primer to lead into a less hand-holding tutorial I found later that I’ll try next.]

Coda

Well, heck. That was easier than I thought. Succeeded after another run-through by forcing Python 2.7 on both the local app configuration and in a new Elastic Beanstalk app.

Git Tricks I Keep Having to Look Up

Published on 2016-11-252016-11-26 by paranoidmikeLeave a comment

Another trail of breadcrumbs for myself…me being the kind of guy I am, I try to do things the “git” way so I don’t piss off those upstream who might otherwise not have the energy to deal with another PR.

Sync Fork with Upstream
http://stackoverflow.com/questions/3903817/pull-new-updates-from-original-github-repository-into-forked-github-repository?rq=1

Git Squash
https://ariejan.net/2011/07/05/git-squash-your-latests-commits-into-one/

Reverting Git Commits
http://stackoverflow.com/questions/6971717/github-how-to-revert-changes-to-previous-state#6971775

Switching branches
http://stackoverflow.com/questions/1475037/switching-branches-in-git

Push.default warning
http://stackoverflow.com/questions/13148066/warning-push-default-is-unset-its-implicit-value-is-changing-in-git-2-0#13148313

The Git Workflow (according to Atlassian)
https://www.atlassian.com/git/tutorials/comparing-workflows/gitflow-workflow

Switching to SSH for an existing local repo
http://stackoverflow.com/questions/6565357/git-push-requires-username-and-password

The classic…
http://nvie.com/posts/a-successful-git-branching-model/

Occupied Neurons, November edition (late)

Published on 2016-11-092016-12-03 by paranoidmikeLeave a comment

Docker In Production: a History of Failure

A cautionary tale to counter some of the newbie hype around the new Infrastructure Jesus that is Docker. I’ve fallen prey to the hype as well, assuming that (a)Docker is ready for prime time, (b) Docker is universally beneficial for all workloads and (c) Docker is measurably superior to the infrastructure design patterns that it intends to replace.

That said, the article is long on complaints, and doesn’t attempt to back its claims with data, third-party verification or unemotional hyperbole. I’m sure we’ll see many counter-articles claiming “it works for me”, “I never saw these kinds of problems” and “what’s this guy’s agenda?” I’ll still pay attention to commentary like this, because it reads to me like the brain dump of a person exhausted from chasing their tail all year trying to find a tech combo that they can just put in production and not devote unwarranted levels of monitoring and maintenance to. I think their expectations aren’t unreasonable. It sure sounds like the Docker team are more ambitious or cavalier than their position and staffing levels warrant.

Wat

This is one of the most hilarious and horrifying expeditions into the dark corners of (un?)intended consequences in coding languages. Watching this made me feel like I’m more versed in the lessons of the absurd “stupid pet tricks” with many languages, even if I’d never use 99% of these in real life. It also made me feel like “did someone deliberately allow these in the language design, or did some nearly-insane persons just end up naturally stumbling on these while trying to make the language do things it should never have done?”

Is Agile dying a slow death? Or is it being reborn?

This guy captures all my attitudes about “Agile according to the rules” versus “getting an organization tuned to collaborate and learn as fast as possible”. While extra/unnecessary process makes us feel like we have guard rails to keep people from making mistakes, in my experience what it *actually* does it drive DISengagement and risk aversion in most employees, knowing that unless they have explicit permission to break the rules, their great new idea is likely to attract organizational antibodies.

Stanford’s password policy shuns one-size-fits-all security

This is better than a Bigfoot sighting! An actual organization who’ve thought about security risk vs punishing anti-usability and come up with an approach that should satisfy both campaigns! This UX-in-security bigot can finally die a happy man.

A famed hacker is grading thousands of programs – and may revolutionise software in the process

May not get to the really grotty code security issues that are biting us some days, and probably giving a few CIOs a false sense of security. Controversial? Yes.

A necessary next step as software grows up as an engineering discipline? Absolutely.

Let’s see many more security geeks meeting the software developer where they live, and stop expecting em to voluntarily become part-time security experts just because someone came up with another terrific Hollywood Security Theater plot.

A Rebuttal for Python 3

Why are some old-school Pythonistas so damned pissy about Python 3 – to the point of (in at least one egregiously dishonest case) writing long articles trying to dissuade others from using it? Are they still butthurt at Guido for making breaking changes that don’t allow them to run their old Python 2 code on the Python 3 runtime? Do they not like change? Are they aware that humans are imperfect and sometimes have to admit mistakes/try something different? I find it fascinating to watch these kinds of holy wars – it gives the best kinds of insights into what frailties and hot buttons really motivate people.

The best quote’s in the comments: “Wow, I haven’t seen this much bullshit in a “technical” article in a while. A Donald Trump transcript is more honest and informative than that. I seriously doubt Zed Shaw himself believes a single paragraph there; if he actually does, he should stop acting like a Python expert and admit he’s an idiot.”

How The Web Became Unreadable

It’s painful to see some designers slavishly devote their efforts more to the third hand fashion they hear about from other designers, than to the end users of the sites and services to which they deliver their designs. I love a lot of the design work that’s come out the last few years – the jumbled mess that was web design ten years ago was painful – but the practical implications of how that design is consumed in the wild must be paramount. And it is where I am the final decision maker on shipping software.

Mashing the marvelous wrapper until it responds, part 1: prereq/setup

Published on 2016-07-242016-11-13 by paranoidmikeLeave a comment

I haven’t used a dynamic language for coding nearly as much as strongly-typed, compiled languages so approaching Python was a little nervous-making for me. It’s not every day you look into the abyss of your own technical inadequacies and find a way to keep going.

Here’s how embarrassing it got for me: I knew enough to clone the code to my computer and to copy the example code into a .py file, but beyond that it felt like I was doing the same thing I always do when learning a new language: trying to guess at the basics of using the language that everyone who’s writing about it already knows and has long since forgotten, they’re so obvious. Obvious to everyone but the neophyte.

Second, is that I don’t respond well to the canonical means of learning a language (at least according to all the “Learn [language_x] from scratch” books I’ve picked up over the years), which is

Chapter 1: History, Philosophy and Holy Wars of the Language
Chapter 2: Installing The Author’s Favourite IDE
Chapter 3: Everything You Don’t Have a Use For in Data Types
Chapter 4: Advanced Usage of Variables, Consts and Polymorphism
Chapter 5: Hello World
Chapter 6: Why Hello World Is a Terrible Lesson
Chapter 7: Author’s Favourite Language Tricks

… etc.

I tend to learn best by attacking a specific, relevant problem hands-on – having a real problem I felt motivated to attack is how these projects came to be (EFSCertUpdater, CacheMyWork). So for now, despite a near-complete lack of context or mentors, I decided to dive into the code and start monkeying with it.

Riches of Embarrassment

I quickly found a number of “learning opportunities” – I didn’t know how to:

Run the example script (hint: install the python package for your OS, make sure the python binary is in your current shell’s path, and don’t use the Windows Git Bash shell as there’s some weird bug currently at work)

Install the dependencies (hint: run “pip install xxxx”, where “xxxx” is whatever shows up at the end of an error message like this:

C:\Users\Mike\code\marvelous>python example.py 
Traceback (most recent call last):     
    File "example.py", line 5, in <module>
        from config import public_key, private_key 
ImportError: No module named config

In this example, I ran “pip install config” to resolve this error.

Set the public & private keys (hint: there was some mention of setting environment variables, but it turns out that for this example script I had to paste them into a file named “config” – no, for python the file needs to be named “config.py even though it’s text not a script you would run on its own – and make sure the config.py file is stored in the same folder as the script you’re running. Its contents should look similar to these (no, these aren’t really my keys):
```
    public_key = 81c4290c6c8bcf234abd85970837c97 
    private_key = c11d3f61b57a60997234abdbaf65598e5b96
```
Nope, don’t forget – when you declare a variable in most languages, and the variable is not a numeric value, you have to wrap the variable’s value in some type of quotation marks. [Y’see, this is one of the things that bugs me about languages that don’t enforce strong typing – without it, it’s easy for casual users to forget how strings have to be handled]:
```
    public_key = '81c4290c6c8bcf234abd85970837c97' 
    private_key = 'c11d3f61b57a60997234abdbaf65598e5b96'
```
Properly call into other Classes in your code – I started to notice in Robert’s Marvelous wrapper that his Python code would do things like this – the comic.py file defined
```
     class ComicSchema(Schema):
```
…and the calling code would state
```
    import comic 
    … 
    schema = comic.ComicSchema()
```
This was initially confusing to me, because I’m used to compiled languages like C# where you import the defined name of the Class, not the filename container in which the class is defined. If this were C# code, the calling code would probably look more like this:
```
    using ComicSchema;
    … 
    _schema Schema = ComicSchema();
```
(Yes, I’m sure I’ve borked the C# syntax somehow, but for sake of this sad explanation, I hope you get the idea where my brain started out.)

I’m inferring that for a scripted/dynamic language like Python, the Python interpreter doesn’t have any preconceived notion of where to find the Classes – it has to be instructed to look at specific files first (import comic, which I’m guessing implies import comic.py), then further to inspect a specified file for the Class of interest (schema = comic.ComicSchema(), where comic. indicates the file to inspect for the ComicSchema() class).

Status: Learning

So far, I’m feeling (a) stupid that I have to admit these were not things with which I sprang from the womb, (b) grateful Python’s not *more* punishing, (c) smart-ish that fundamental debugging is something I’ve still retained and (d) good that I can pass along these lessons to other folks like me.

Coding Again? Experimenting with the Marvel API

Published on 2016-07-142016-10-09 by paranoidmikeLeave a comment

I’ve been hanging around developers *entirely* too much lately.

These days I find myself telling myself the story that unless I get back into coding, I’m not going to be relevant in the tech industry any longer.

Hanging out (aka volunteering) at developer-focused conferences will do that to you:

Open Source Bridge (June 2016)
.NET Fringe (July 2016)

Volunteering on open source projects will do that to you (jQuery Foundation‘s infrastructure team).

Interviewing for engineering-focused Product Owner and Technical Product Manager roles will do that to you. (Note: when did “technical” become equivalent to “I actively code in my day job/spare time”?)

One of the hang-ups I have that keeps me from investing the immense amount of grinding time it takes to make working code is that I haven’t found an itch to scratch that bugs me enough that I’m willing to commit myself to the effort. Plenty of ideas float their way past my brain, but very few (like CacheMyWork) get me emotionally engaged enough to surmount the activation energy necessary to fight alone past all the barriers: lonely nights, painful problem articulation, lack of buddy to work on it, and general frustration that I don’t know all the tricks and vocabulary that most good coders do.

Well, it finally happened. I found something that should keep me engaged: creating a stripped-down search interface into the Marvel comics catalogue. Marvel.com provides a search on their site but I:

keep forgetting where they buried it,
find it cumbersome and slow to use, and
never know if the missing references (e.g. appearances of Captain Marvel as a guest in others’ comics that aren’t returned in the search results) are because the search doesn’t work, or because the data is actually missing

Marvel launched an API a couple of years ago – I heard about it at the time and felt excited that my favourite comics publisher had embraced the Age of APIs. But didn’t feel like doing anything with it.

Fast forward two years: I’m a diehard user of Marvel Unlimited, my comics reading is about half-Marvel these days, and I’m spending a lot of time trying to weave together a picture of how the characters relate, when they’ve bumped into each other, what issue certain happenings occurred in, etc

Possible questions I could answer if I write some code:

How socially-connected is Spidey compared with Wolverine?
When is the first appearance of any character?
What’s the chronological publication order of every comic crossover in any comics Event?

Possible language to use:

C# (know it)
F# (big hawtness at the .NET Fringe conf)
Python (feel like I should learn it)
Typescript (ES6 – like JavaScript with static types and other frustration-killers)
ScriptCS (a scriptable C#)

More important than choice of language though is availability of wrappers for the API – while I’m sure it would be very instructive to immediately start climbing the cliff of building “zero tech” code, I learn far faster when I have visible results, than when I’m still fiddling with getting the right types for my variables or trying to remember where and when to set the right kind of closing braces.

So for sake of argument I’m going to try out the second package I found – Robert Kuykendall’s “marvelous” python wrapper: https://github.com/rkuykendall/marvelous

See you when I’ve got something to report.

What I learned today about server-side programming

Published on 2012-11-092016-11-13 by paranoidmikeLeave a comment

Attended PDX Web & Design meetup tonight, where Eric Redmond delivered a primer on modern server-side programming. What’s cool about Eric’s talks is he gives us a smattering of the stuff that’s he’s touched in the last few months – and he’s got a voracious appetite for fun, productive tools that support the lazy programmer. (Kindred spirit here – don’t make me do anything more than I absolutely have to to get the job done.)

Here’s the notes I took tonight as Eric was entertaining me (and the rest of the room). Mostly noted what technologies he mentioned, a few key lessons, and a few opinions that resonated with me.

– http://www.socketstream.com (& GitHub)
– CoffeeScript
– Socketracer.com
– “ws://” protocol
– Server-side frameworks:
— PHP (easy/messy – bad programming practices)
— Express.js (difficult) – built on Node.js
— Sinatra (just right – “Ruby on Rails’ little brother”)

– CMD shell: SharpEnviro (win), iTerm (mac)
– txt editor: Notepad++ (win), Sublime Edit (mac) — or Sublime for windows
– httpd for win: WinLAMP or XAMPP
– Codeacademy.com/Tracks/JavaScript
– heroku.com – hosting provider for hosting Express.js apps (and tons more frameworks)
– Express.js takes care of implementing things like sockets that Node.js makes you write on your own
– HTML templating languages = jade, slim
— jade templates remove , instead of “end block” tags it uses indenting to define hierarchy
– Sinatra – ruby-lang.org, sinatrarb.com, rubygems.org
– erlang is now Eric Redmond’s favoured programming language (doesn’t recommend it)
– TryRuby.org – teach yourself Ruby from scratch
– larger-scale frameworks (generally MVC): RailwayJS (RailwayJS.com), Django (www.djangoproject.com), Ruby on Rails (rubyonrails.org)
– CMS’s: refinery, redmine, typo3

“Learning to program teaches you how to think. Computer science is a liberal art.” — Steve Jobs

	Lewis on Update my Contacts with Python…
	paranoidmike on Parsing PDFs using Python
	Anne Laski on Parsing PDFs using Python
	paranoidmike on Hashicorp Vault + Ansible + CD…
	KrzWrd on Hashicorp Vault + Ansible + CD…